A small Python command‑line tool that copies EUR-Lex data dump files (HTML, PDF, DOCX, …) from a UUID‑based archive using technical identifiers into a clean and customizable output folder hierarchy, using metadata fetched from each file’s corresponding metadata (to be downloaded separately).
Why? EU Cellar dumps are great, but the folder layout (one UUID folder per document with terse filenames) is a bit uncomfortable for browsing. This script lets you reorganise the dump by year, month, ELI, title — whatever you like — while keeping the original files intact.
- RDF‑powered – extracts creation date, ELI, resource‑type, English title & subtitle via a SPARQL query.
- Template masks – design your own folder structure and filenames
using placeholders such as
{year},{month},{eli},{title}, etc. - Conflict handling – name clashes are resolved by appending the UUID (or a fallback) plus a counter.
- Test mode –
--limit Nprocesses only the first N files for fast experiments.
ARCHIVE_DIR/<UUID>/<FILETYPE>/<original_file>.<ext>
METADATA_DIR/<UUID>/tree_non_inferred.rdf # filename is configurable
Both directories must share the same set of UUID folders, should work for every archive type (html, xhtml, pdf, formex), though it was only tested with html files.
- Download the archives you are interested in on https://datadump.publications.europa.eu/.
- Download the metadata archive also
- Extract both archives in separate folder
git clone https://github.com/openjusticebe/EURLex-unDump
cd EURLex-unDump
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt # click, rdflib(Or simply pip install click rdflib if you prefer.)
python unDump.py \
/path/to/ARCHIVE_DIR \
/path/to/OUTPUT_DIR \
/path/to/METADATA_DIR \
--folder-mask "{year}/{month}" \
--file-mask "{eli}" \
-vCommon flags
| Option | Default | What it does |
|---|---|---|
--folder-mask |
{year}/{month} |
Template for sub‑folders. Leave empty ("") for a flat layout. |
--file-mask |
{eli} |
Template for file stem (extension is preserved). |
--limit N |
— | Process only the first N alphabetically sorted files. |
--language ENG |
ENG |
Language to when retrieving metadata attributes (three letters) |
-v / -vv |
— | Increase log verbosity (INFO / DEBUG). |
--help |
— | Full reference. |
| Placeholder | Example | Notes |
|---|---|---|
{year} |
2022 |
Parsed from legacy creation date. |
{month} |
06 |
— |
{day} |
15 |
— |
{date} |
2022-06-15 |
Full date string. |
{eli} |
eli/reg/2022/922/oj" |
ELI reference |
{celex_identifier} |
32022R0889 |
Celex number. |
{title} |
Commission Regulation … |
English title. |
{subtitle} |
… amending Regulation (EC) … |
English subtitle. |
{type} |
REG |
Resource type URI (stringified). |
{default_identifier} |
UUID (CELLAR ID) | Fallback used in conflicts. |
- The core slugification logic lives in
slugify()– adjust the regex orMAX_SEGMENT_LENas needed. - The SPARQL query is in
parse_metadata()– extend predicate selection - For unit tests, point
ARCHIVE_DIRandMETADATA_DIRat a small subset of documents and use--limit.
See repository.