GitHub - scilons/harvesting: Collection of data parser for harvested data in SciLonS

Collection of parsers for scientific data for the SciLonS project

Repository structure

Directory	Description
`scilons_harvesting/`	Core package: TEI XML parsing (SAX-based), TEI-to-Parquet conversion with language detection, document schema, and writers
`postprocessors/`	Data cleaning pipelines (datatrove): Gopher quality filtering, MinHash deduplication, PII removal, statistics, dataset partitioning
`statistics/`	CLI tools for dataset analysis: citation counting, keyword extraction, language distribution, token statistics
`scripts/`	Slurm batch job scripts for running pipelines on HPC clusters

Prerequisites

The following tools are used upstream to harvest and ingest scientific full texts into a uniform TEI XML format, which is the starting point for this project.

Harvesters:

biblio_glutton_harvester -- large-scale harvester for the full Unpaywall collection, supporting S3 and Swift cloud storage.
arxiv_harvester -- creates a mirror of all arXiv resources in cloud storage, used by biblio_glutton_harvester for arXiv data.

Ingestion (PDF/XML/LaTeX to TEI XML):

grobid -- transforms PDF into TEI XML.
grobid_client_python -- GROBID client for processing directories of PDF in parallel.
Pub2TEI -- transforms publisher XML formats (including NLM JATS) into TEI XML without loss of encoding information.
LaTeXML custom fork -- transforms LaTeX files into TEI XML (including formulas).

Install

Requires Python >= 3.10.

git clone https://github.com/scilons/harvesting.git
cd harvesting/

Set up a virtual environment:

virtualenv --system-site-packages -p python3 env
source env/bin/activate

Install the package with dependencies:

make install

TEI2text -- parsing structured full text

Converts GROBID TEI XML documents into plain text and language-partitioned Parquet datasets. Serialization is configurable (formula rendering, reference markers, table format, bibliography inclusion) via config.yml.

python -m scilons_harvesting.grobid_to_parquet \
    --input_dir=<path to TEI ZIP files> \
    --output_dir=<output directory> \
    --workers=24

See scilons_harvesting/README.md for configuration and detailed usage.

Postprocessing

Quality filtering, deduplication, and PII removal pipeline built on datatrove. Includes dataset partitioning and HuggingFace dataset card generation.

python -m postprocessors.scilons_pipeline \
    --input-dir /path/to/parquet/eng_Latn/ \
    --output-dir /path/to/output

See postprocessors/README.md for pipeline details and Slurm execution.

Statistics

CLI tools for citation counting, keyword extraction, language distribution analysis, and token statistics.

count-tei-citations --input-dir /path/to/zip/files --workers 12 --output results.csv
analyze-dataset-languages --input-dir /path/to/parquet/ --verbose

See statistics/README.md for all available tools.

Additional scripts

Slurm batch job scripts for running the pipelines on HPC clusters.

See scripts/README.md for available scripts.

Development & Testing

Unit tests

make test

Linting

make lint

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.github/workflows		.github/workflows
postprocessors		postprocessors
resources		resources
scilons_harvesting		scilons_harvesting
scripts		scripts
statistics		statistics
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.yml		config.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Collection of parsers for scientific data for the SciLonS project

Repository structure

Prerequisites

Install

TEI2text -- parsing structured full text

Postprocessing

Statistics

Additional scripts

Development & Testing

Unit tests

Linting

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Collection of parsers for scientific data for the SciLonS project

Repository structure

Prerequisites

Install

TEI2text -- parsing structured full text

Postprocessing

Statistics

Additional scripts

Development & Testing

Unit tests

Linting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages