diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..205272c --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,70 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Commands + +```bash +# Install with dev dependencies +pip install -e .[tests,docs] duckdb_engine pytz + +# Run all tests including DB integration tests (uses in-memory DuckDB) +TZ="Europe/Paris" DB_STRING="duckdb:///:memory:" python -m pytest + +# Run only non-database tests (no DB_STRING needed) +pytest -m "not dbtest" + +# Run a single test file or test by name +pytest tests/test_conversions.py +pytest -k "test_iterative_recursive_parsing" + +# Run against a real database instead +DB_STRING="postgresql+psycopg2://user:pass@localhost/testdb" pytest + +# Serve documentation locally +mkdocs serve +``` + +## Architecture + +`xml2db` maps an XSD schema to a relational database schema and loads XML files into it. The top-level flow is: + +1. **`DataModel`** (`model.py`) reads an XSD file using `xmlschema` + `lxml`, traverses the schema tree, and builds a set of `DataModelTable` objects — one per XSD `complexType`. It then creates SQLAlchemy tables from those objects. +2. **`DataModel.parse_xml()`** returns a **`Document`** (`document.py`), which holds the parsed flat data ready for insertion. +3. **`XMLConverter`** (`xml_converter.py`) does the actual XML traversal, producing a nested "document tree" dict. Two strategies exist: iterative (`iterparse=True`) and recursive — tests assert they produce identical output. +4. **`Document.insert_into_target_tables()`** inserts the flat data into the database. **`Document.to_xml()`** converts it back. + +### Table hierarchy (`table/`) + +Each XSD `complexType` becomes one of two concrete table classes: + +- **`DataModelTableReused`** — deduplicates identical subtrees via a SHA-256 hash column (`xml2db_record_hash`). This is the default. Relationships between a reused child and multiple parents require an intermediate join table (`DataModelRelationN` + `DataModelTransformedTable`). +- **`DataModelTableDuplicated`** — stores rows without deduplication; parent FK lives directly in the child row. Set `"reuse": False` in `model_config` to use this per table. + +Relations are stored as `DataModelRelation1` (0..1 / 1..1) or `DataModelRelationN` (0..n / 1..n) in `DataModelTable.fields`. + +### Dialect system (`dialect/`) + +`DatabaseDialect` (base class) abstracts DB-specific behaviour: identifier length limits (truncated with MD5 suffix when too long), XSD→SQLAlchemy type mapping, and DDL generation. Each subclass (`postgresql.py`, `mysql.py`, `mssql.py`, `duckdb.py`) overrides only what differs. `get_dialect()` in `dialect/__init__.py` selects the right class from the SQLAlchemy engine dialect name. + +### Snapshot tests for model outputs + +`tests/test_models_output.py` compares generated ERDs, source/target trees, and SQL DDL against committed `.md`, `.txt`, and `.sql` files under `tests/sample_models/`. When a change intentionally modifies the data model or DDL output, regenerate these snapshots by running: + +```bash +cd tests/sample_models && python models.py +``` + +then commit the updated snapshot files alongside the code change. + +### Key configuration options (`model_config`) + +| Option | Effect | +|---|---| +| `tables..reuse` | `False` → `DataModelTableDuplicated` | +| `tables..choice_transform` | `False` → keep XSD `choice` fields separate instead of type+value columns | +| `tables..fields..transform` | `False` / `"elevate_wo_prefix"` etc. → override field-level simplification | +| `row_numbers` | Add ordering column tracking original XML element position | +| `metadata_columns` | Extra SQLAlchemy columns appended to the root table | +| `record_hash_column_name` / `record_hash_constructor` / `record_hash_size` | Customise the deduplication hash column | +| `as_columnstore` | MS SQL Server columnstore index on a table | diff --git a/docs/getting_started.md b/docs/getting_started.md index 18e9f70..d438170 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -124,20 +124,29 @@ troubleshooting if need be. Actual values need to be passed to [`DataModel.parse_xml`](api/data_model.md#xml2db.model.DataModel.parse_xml) for each parsed documents, as a `dict`, using the `metadata` argument. -!!! note - You can also load multiple documents at the same time to the database, which could make the process faster if you - have a lot of small XML files to load: +!!! note "Loading multiple XML files in one database operation" + By default, each `parse_xml` + `insert_into_target_tables` call is an independent database operation. When you have + many small XML files to load, you can instead accumulate all of them in memory first and insert them in a single + batch, which reduces the number of database round-trips. + + Pass the `flat_data` from the previous document into the next `parse_xml` call to accumulate records: + ``` py - data = None + flat_data = None for xml_file in files: document = data_model.parse_xml( - xml_file="path/to/file.xml", - flat_data=data, + xml_file=xml_file, + metadata={"input_file_path": xml_file}, + flat_data=flat_data, ) - data = document.data + flat_data = document.data document.insert_into_target_tables() ``` + Note that each file can carry its own `metadata` values (e.g. the file name or a loading timestamp), which will be + stored per root record in the columns defined by + [`metadata_columns`](configuring.md#model-configuration). + ## Getting back the data into XML diff --git a/src/xml2db/document.py b/src/xml2db/document.py index 5471737..f64a9dc 100644 --- a/src/xml2db/document.py +++ b/src/xml2db/document.py @@ -56,8 +56,10 @@ def parse_xml( skip_validation: Should we validate the document against the schema first? iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory recover: Should we try to parse incorrect XML? (argument passed to lxml parser) - flat_data: A dict containing flat data if we want to add data to another dataset instead of creating - a new one + flat_data: An existing `document.data` dict from a previously parsed document. When provided, records + from this XML file are appended to it rather than starting fresh, allowing multiple files to be + accumulated in memory and inserted together with a single + [`insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call. """ self.xml_file_path = xml_file[:255] if isinstance(xml_file, str) else "" diff --git a/src/xml2db/model.py b/src/xml2db/model.py index b2939a4..de0edd7 100644 --- a/src/xml2db/model.py +++ b/src/xml2db/model.py @@ -698,8 +698,10 @@ def parse_xml( skip_validation: Should we validate the documents against the schema first? iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory recover: Should we try to parse incorrect XML? (argument passed to lxml parser) - flat_data: A dict containing flat data if we want to add data to another dataset instead of creating - a new one + flat_data: An existing `document.data` dict from a previously parsed document. When provided, records + from this XML file are appended to it rather than starting fresh, allowing multiple files to be + accumulated in memory and inserted together with a single + [`Document.insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call. Returns: A parsed [`Document`](document.md) object