| Directory | Description |
|---|---|
scilons_harvesting/ |
Core package: TEI XML parsing (SAX-based), TEI-to-Parquet conversion with language detection, document schema, and writers |
postprocessors/ |
Data cleaning pipelines (datatrove): Gopher quality filtering, MinHash deduplication, PII removal, statistics, dataset partitioning |
statistics/ |
CLI tools for dataset analysis: citation counting, keyword extraction, language distribution, token statistics |
scripts/ |
Slurm batch job scripts for running pipelines on HPC clusters |
The following tools are used upstream to harvest and ingest scientific full texts into a uniform TEI XML format, which is the starting point for this project.
Harvesters:
- biblio_glutton_harvester -- large-scale harvester for the full Unpaywall collection, supporting S3 and Swift cloud storage.
- arxiv_harvester -- creates a mirror of all arXiv resources in cloud storage, used by biblio_glutton_harvester for arXiv data.
Ingestion (PDF/XML/LaTeX to TEI XML):
- grobid -- transforms PDF into TEI XML.
- grobid_client_python -- GROBID client for processing directories of PDF in parallel.
- Pub2TEI -- transforms publisher XML formats (including NLM JATS) into TEI XML without loss of encoding information.
- LaTeXML custom fork -- transforms LaTeX files into TEI XML (including formulas).
Requires Python >= 3.10.
git clone https://github.com/scilons/harvesting.git
cd harvesting/Set up a virtual environment:
virtualenv --system-site-packages -p python3 env
source env/bin/activateInstall the package with dependencies:
make installConverts GROBID TEI XML documents into plain text and language-partitioned Parquet datasets. Serialization is configurable (formula rendering, reference markers, table format, bibliography inclusion) via config.yml.
python -m scilons_harvesting.grobid_to_parquet \
--input_dir=<path to TEI ZIP files> \
--output_dir=<output directory> \
--workers=24See scilons_harvesting/README.md for configuration and detailed usage.
Quality filtering, deduplication, and PII removal pipeline built on datatrove. Includes dataset partitioning and HuggingFace dataset card generation.
python -m postprocessors.scilons_pipeline \
--input-dir /path/to/parquet/eng_Latn/ \
--output-dir /path/to/outputSee postprocessors/README.md for pipeline details and Slurm execution.
CLI tools for citation counting, keyword extraction, language distribution analysis, and token statistics.
count-tei-citations --input-dir /path/to/zip/files --workers 12 --output results.csv
analyze-dataset-languages --input-dir /path/to/parquet/ --verboseSee statistics/README.md for all available tools.
Slurm batch job scripts for running the pipelines on HPC clusters.
See scripts/README.md for available scripts.
make testmake lintApache 2.0