Skip to content

openjusticebe/EURLex-unDump

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Undumper utility for EUR-Lex Data Dumps

A small Python command‑line tool that copies EUR-Lex data dump files (HTML, PDF, DOCX, …) from a UUID‑based archive using technical identifiers into a clean and customizable output folder hierarchy, using metadata fetched from each file’s corresponding metadata (to be downloaded separately).

Why? EU Cellar dumps are great, but the folder layout (one UUID folder per document with terse filenames) is a bit uncomfortable for browsing. This script lets you reorganise the dump by year, month, ELI, title — whatever you like — while keeping the original files intact.


Features

  • RDF‑powered – extracts creation date, ELI, resource‑type, English title & subtitle via a SPARQL query.
  • Template masks – design your own folder structure and filenames using placeholders such as {year}, {month}, {eli}, {title}, etc.
  • Conflict handling – name clashes are resolved by appending the UUID (or a fallback) plus a counter.
  • Test mode – --limit N processes only the first N files for fast experiments.

Directory assumptions

ARCHIVE_DIR/<UUID>/<FILETYPE>/<original_file>.<ext>
METADATA_DIR/<UUID>/tree_non_inferred.rdf        # filename is configurable

Both directories must share the same set of UUID folders, should work for every archive type (html, xhtml, pdf, formex), though it was only tested with html files.


Quick‑start

0. Preparation

1. Clone & set up a virtualenv

git clone https://github.com/openjusticebe/EURLex-unDump
cd EURLex-unDump
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt  # click, rdflib

(Or simply pip install click rdflib if you prefer.)

2. Run the script

python unDump.py \
    /path/to/ARCHIVE_DIR \
    /path/to/OUTPUT_DIR \
    /path/to/METADATA_DIR \
    --folder-mask "{year}/{month}" \
    --file-mask "{eli}" \
    -v

Common flags

Option Default What it does
--folder-mask {year}/{month} Template for sub‑folders. Leave empty ("") for a flat layout.
--file-mask {eli} Template for file stem (extension is preserved).
--limit N  — Process only the first N alphabetically sorted files.
--language ENG  ENG Language to when retrieving metadata attributes (three letters)
-v / -vv  — Increase log verbosity (INFO / DEBUG).
--help  — Full reference.

3. Templating reference

Placeholder Example Notes
{year} 2022 Parsed from legacy creation date.
{month} 06
{day} 15
{date} 2022-06-15 Full date string.
{eli} eli/reg/2022/922/oj" ELI reference
{celex_identifier} 32022R0889 Celex number.
{title} Commission Regulation … English title.
{subtitle} … amending Regulation (EC) … English subtitle.
{type} REG Resource type URI (stringified).
{default_identifier} UUID (CELLAR ID) Fallback used in conflicts.

Development hints

  • The core slugification logic lives in slugify() – adjust the regex or MAX_SEGMENT_LEN as needed.
  • The SPARQL query is in parse_metadata() – extend predicate selection
  • For unit tests, point ARCHIVE_DIR and METADATA_DIR at a small subset of documents and use --limit.

License

See repository.

About

Undumps data from the publication's office legal data dump service.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages