Undumper utility for EUR-Lex Data Dumps

A small Python command‑line tool that copies EUR-Lex data dump files (HTML, PDF, DOCX, …) from a UUID‑based archive using technical identifiers into a clean and customizable output folder hierarchy, using metadata fetched from each file’s corresponding metadata (to be downloaded separately).

Why? EU Cellar dumps are great, but the folder layout (one UUID folder per document with terse filenames) is a bit uncomfortable for browsing. This script lets you reorganise the dump by year, month, ELI, title — whatever you like — while keeping the original files intact.

Features

RDF‑powered – extracts creation date, ELI, resource‑type, English title & subtitle via a SPARQL query.
Template masks – design your own folder structure and filenames using placeholders such as {year}, {month}, {eli}, {title}, etc.
Conflict handling – name clashes are resolved by appending the UUID (or a fallback) plus a counter.
Test mode – --limit N processes only the first N files for fast experiments.

Directory assumptions

ARCHIVE_DIR/<UUID>/<FILETYPE>/<original_file>.<ext>
METADATA_DIR/<UUID>/tree_non_inferred.rdf        # filename is configurable

Both directories must share the same set of UUID folders, should work for every archive type (html, xhtml, pdf, formex), though it was only tested with html files.

Quick‑start

0. Preparation

Download the archives you are interested in on https://datadump.publications.europa.eu/.
Download the metadata archive also
Extract both archives in separate folder

1. Clone & set up a virtualenv

git clone https://github.com/openjusticebe/EURLex-unDump
cd EURLex-unDump
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt  # click, rdflib

(Or simply pip install click rdflib if you prefer.)

2. Run the script

python unDump.py \
    /path/to/ARCHIVE_DIR \
    /path/to/OUTPUT_DIR \
    /path/to/METADATA_DIR \
    --folder-mask "{year}/{month}" \
    --file-mask "{eli}" \
    -v

Common flags

Option	Default	What it does
`--folder-mask`	`{year}/{month}`	Template for sub‑folders. Leave empty (`""`) for a flat layout.
`--file-mask`	`{eli}`	Template for file stem (extension is preserved).
`--limit N`	—	Process only the first N alphabetically sorted files.
`--language ENG`	`ENG`	Language to when retrieving metadata attributes (three letters)
`-v / -vv`	—	Increase log verbosity (INFO / DEBUG).
`--help`	—	Full reference.

3. Templating reference

Placeholder	Example	Notes
`{year}`	`2022`	Parsed from legacy creation date.
`{month}`	`06`	—
`{day}`	`15`	—
`{date}`	`2022-06-15`	Full date string.
`{eli}`	`eli/reg/2022/922/oj"`	ELI reference
`{celex_identifier}`	`32022R0889`	Celex number.
`{title}`	`Commission Regulation …`	English title.
`{subtitle}`	`… amending Regulation (EC) …`	English subtitle.
`{type}`	`REG`	Resource type URI (stringified).
`{default_identifier}`	UUID (CELLAR ID)	Fallback used in conflicts.

Development hints

The core slugification logic lives in slugify() – adjust the regex or MAX_SEGMENT_LEN as needed.
The SPARQL query is in parse_metadata() – extend predicate selection
For unit tests, point ARCHIVE_DIR and METADATA_DIR at a small subset of documents and use --limit.

License

See repository.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
undump.py		undump.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Undumper utility for EUR-Lex Data Dumps

Features

Directory assumptions

Quick‑start

0. Preparation

1. Clone & set up a virtualenv

2. Run the script

3. Templating reference

Development hints

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Undumper utility for EUR-Lex Data Dumps

Features

Directory assumptions

Quick‑start

0. Preparation

1. Clone & set up a virtualenv

2. Run the script

3. Templating reference

Development hints

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages