Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Commands

```bash
# Install with dev dependencies
pip install -e .[tests,docs] duckdb_engine pytz

# Run all tests including DB integration tests (uses in-memory DuckDB)
TZ="Europe/Paris" DB_STRING="duckdb:///:memory:" python -m pytest

# Run only non-database tests (no DB_STRING needed)
pytest -m "not dbtest"

# Run a single test file or test by name
pytest tests/test_conversions.py
pytest -k "test_iterative_recursive_parsing"

# Run against a real database instead
DB_STRING="postgresql+psycopg2://user:pass@localhost/testdb" pytest

# Serve documentation locally
mkdocs serve
```

## Architecture

`xml2db` maps an XSD schema to a relational database schema and loads XML files into it. The top-level flow is:

1. **`DataModel`** (`model.py`) reads an XSD file using `xmlschema` + `lxml`, traverses the schema tree, and builds a set of `DataModelTable` objects — one per XSD `complexType`. It then creates SQLAlchemy tables from those objects.
2. **`DataModel.parse_xml()`** returns a **`Document`** (`document.py`), which holds the parsed flat data ready for insertion.
3. **`XMLConverter`** (`xml_converter.py`) does the actual XML traversal, producing a nested "document tree" dict. Two strategies exist: iterative (`iterparse=True`) and recursive — tests assert they produce identical output.
4. **`Document.insert_into_target_tables()`** inserts the flat data into the database. **`Document.to_xml()`** converts it back.

### Table hierarchy (`table/`)

Each XSD `complexType` becomes one of two concrete table classes:

- **`DataModelTableReused`** — deduplicates identical subtrees via a SHA-256 hash column (`xml2db_record_hash`). This is the default. Relationships between a reused child and multiple parents require an intermediate join table (`DataModelRelationN` + `DataModelTransformedTable`).
- **`DataModelTableDuplicated`** — stores rows without deduplication; parent FK lives directly in the child row. Set `"reuse": False` in `model_config` to use this per table.

Relations are stored as `DataModelRelation1` (0..1 / 1..1) or `DataModelRelationN` (0..n / 1..n) in `DataModelTable.fields`.

### Dialect system (`dialect/`)

`DatabaseDialect` (base class) abstracts DB-specific behaviour: identifier length limits (truncated with MD5 suffix when too long), XSD→SQLAlchemy type mapping, and DDL generation. Each subclass (`postgresql.py`, `mysql.py`, `mssql.py`, `duckdb.py`) overrides only what differs. `get_dialect()` in `dialect/__init__.py` selects the right class from the SQLAlchemy engine dialect name.

### Snapshot tests for model outputs

`tests/test_models_output.py` compares generated ERDs, source/target trees, and SQL DDL against committed `.md`, `.txt`, and `.sql` files under `tests/sample_models/`. When a change intentionally modifies the data model or DDL output, regenerate these snapshots by running:

```bash
cd tests/sample_models && python models.py
```

then commit the updated snapshot files alongside the code change.

### Key configuration options (`model_config`)

| Option | Effect |
|---|---|
| `tables.<name>.reuse` | `False` → `DataModelTableDuplicated` |
| `tables.<name>.choice_transform` | `False` → keep XSD `choice` fields separate instead of type+value columns |
| `tables.<name>.fields.<field>.transform` | `False` / `"elevate_wo_prefix"` etc. → override field-level simplification |
| `row_numbers` | Add ordering column tracking original XML element position |
| `metadata_columns` | Extra SQLAlchemy columns appended to the root table |
| `record_hash_column_name` / `record_hash_constructor` / `record_hash_size` | Customise the deduplication hash column |
| `as_columnstore` | MS SQL Server columnstore index on a table |
23 changes: 16 additions & 7 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,20 +124,29 @@ troubleshooting if need be.
Actual values need to be passed to [`DataModel.parse_xml`](api/data_model.md#xml2db.model.DataModel.parse_xml) for
each parsed documents, as a `dict`, using the `metadata` argument.

!!! note
You can also load multiple documents at the same time to the database, which could make the process faster if you
have a lot of small XML files to load:
!!! note "Loading multiple XML files in one database operation"
By default, each `parse_xml` + `insert_into_target_tables` call is an independent database operation. When you have
many small XML files to load, you can instead accumulate all of them in memory first and insert them in a single
batch, which reduces the number of database round-trips.

Pass the `flat_data` from the previous document into the next `parse_xml` call to accumulate records:

``` py
data = None
flat_data = None
for xml_file in files:
document = data_model.parse_xml(
xml_file="path/to/file.xml",
flat_data=data,
xml_file=xml_file,
metadata={"input_file_path": xml_file},
flat_data=flat_data,
)
data = document.data
flat_data = document.data
document.insert_into_target_tables()
```

Note that each file can carry its own `metadata` values (e.g. the file name or a loading timestamp), which will be
stored per root record in the columns defined by
[`metadata_columns`](configuring.md#model-configuration).



## Getting back the data into XML
Expand Down
6 changes: 4 additions & 2 deletions src/xml2db/document.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,10 @@ def parse_xml(
skip_validation: Should we validate the document against the schema first?
iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory
recover: Should we try to parse incorrect XML? (argument passed to lxml parser)
flat_data: A dict containing flat data if we want to add data to another dataset instead of creating
a new one
flat_data: An existing `document.data` dict from a previously parsed document. When provided, records
from this XML file are appended to it rather than starting fresh, allowing multiple files to be
accumulated in memory and inserted together with a single
[`insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call.
"""
self.xml_file_path = xml_file[:255] if isinstance(xml_file, str) else "<stream>"

Expand Down
6 changes: 4 additions & 2 deletions src/xml2db/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -698,8 +698,10 @@ def parse_xml(
skip_validation: Should we validate the documents against the schema first?
iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory
recover: Should we try to parse incorrect XML? (argument passed to lxml parser)
flat_data: A dict containing flat data if we want to add data to another dataset instead of creating
a new one
flat_data: An existing `document.data` dict from a previously parsed document. When provided, records
from this XML file are appended to it rather than starting fresh, allowing multiple files to be
accumulated in memory and inserted together with a single
[`Document.insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call.

Returns:
A parsed [`Document`](document.md) object
Expand Down
Loading