Skip to content

scilons/harvesting

Repository files navigation

Collection of parsers for scientific data for the SciLonS project

Repository structure

Directory Description
scilons_harvesting/ Core package: TEI XML parsing (SAX-based), TEI-to-Parquet conversion with language detection, document schema, and writers
postprocessors/ Data cleaning pipelines (datatrove): Gopher quality filtering, MinHash deduplication, PII removal, statistics, dataset partitioning
statistics/ CLI tools for dataset analysis: citation counting, keyword extraction, language distribution, token statistics
scripts/ Slurm batch job scripts for running pipelines on HPC clusters

Prerequisites

The following tools are used upstream to harvest and ingest scientific full texts into a uniform TEI XML format, which is the starting point for this project.

Harvesters:

  • biblio_glutton_harvester -- large-scale harvester for the full Unpaywall collection, supporting S3 and Swift cloud storage.
  • arxiv_harvester -- creates a mirror of all arXiv resources in cloud storage, used by biblio_glutton_harvester for arXiv data.

Ingestion (PDF/XML/LaTeX to TEI XML):

  • grobid -- transforms PDF into TEI XML.
  • grobid_client_python -- GROBID client for processing directories of PDF in parallel.
  • Pub2TEI -- transforms publisher XML formats (including NLM JATS) into TEI XML without loss of encoding information.
  • LaTeXML custom fork -- transforms LaTeX files into TEI XML (including formulas).

Install

Requires Python >= 3.10.

git clone https://github.com/scilons/harvesting.git
cd harvesting/

Set up a virtual environment:

virtualenv --system-site-packages -p python3 env
source env/bin/activate

Install the package with dependencies:

make install

TEI2text -- parsing structured full text

Converts GROBID TEI XML documents into plain text and language-partitioned Parquet datasets. Serialization is configurable (formula rendering, reference markers, table format, bibliography inclusion) via config.yml.

python -m scilons_harvesting.grobid_to_parquet \
    --input_dir=<path to TEI ZIP files> \
    --output_dir=<output directory> \
    --workers=24

See scilons_harvesting/README.md for configuration and detailed usage.

Postprocessing

Quality filtering, deduplication, and PII removal pipeline built on datatrove. Includes dataset partitioning and HuggingFace dataset card generation.

python -m postprocessors.scilons_pipeline \
    --input-dir /path/to/parquet/eng_Latn/ \
    --output-dir /path/to/output

See postprocessors/README.md for pipeline details and Slurm execution.

Statistics

CLI tools for citation counting, keyword extraction, language distribution analysis, and token statistics.

count-tei-citations --input-dir /path/to/zip/files --workers 12 --output results.csv
analyze-dataset-languages --input-dir /path/to/parquet/ --verbose

See statistics/README.md for all available tools.

Additional scripts

Slurm batch job scripts for running the pipelines on HPC clusters.

See scripts/README.md for available scripts.

Development & Testing

Unit tests

make test

Linting

make lint

License

Apache 2.0

About

Collection of data parser for harvested data in SciLonS

Resources

License

Stars

Watchers

Forks

Contributors