cre-dev · martinv13 · May 28, 2026 · May 28, 2026 · May 28, 2026 · May 28, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,70 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Commands
+
+```bash
+# Install with dev dependencies
+pip install -e .[tests,docs] duckdb_engine pytz
+
+# Run all tests including DB integration tests (uses in-memory DuckDB)
+TZ="Europe/Paris" DB_STRING="duckdb:///:memory:" python -m pytest
+
+# Run only non-database tests (no DB_STRING needed)
+pytest -m "not dbtest"
+
+# Run a single test file or test by name
+pytest tests/test_conversions.py
+pytest -k "test_iterative_recursive_parsing"
+
+# Run against a real database instead
+DB_STRING="postgresql+psycopg2://user:pass@localhost/testdb" pytest
+
+# Serve documentation locally
+mkdocs serve
+```
+
+## Architecture
+
+`xml2db` maps an XSD schema to a relational database schema and loads XML files into it. The top-level flow is:
+
+1. **`DataModel`** (`model.py`) reads an XSD file using `xmlschema` + `lxml`, traverses the schema tree, and builds a set of `DataModelTable` objects — one per XSD `complexType`. It then creates SQLAlchemy tables from those objects.
+2. **`DataModel.parse_xml()`** returns a **`Document`** (`document.py`), which holds the parsed flat data ready for insertion.
+3. **`XMLConverter`** (`xml_converter.py`) does the actual XML traversal, producing a nested "document tree" dict. Two strategies exist: iterative (`iterparse=True`) and recursive — tests assert they produce identical output.
+4. **`Document.insert_into_target_tables()`** inserts the flat data into the database. **`Document.to_xml()`** converts it back.
+
+### Table hierarchy (`table/`)
+
+Each XSD `complexType` becomes one of two concrete table classes:
+
+- **`DataModelTableReused`** — deduplicates identical subtrees via a SHA-256 hash column (`xml2db_record_hash`). This is the default. Relationships between a reused child and multiple parents require an intermediate join table (`DataModelRelationN` + `DataModelTransformedTable`).
+- **`DataModelTableDuplicated`** — stores rows without deduplication; parent FK lives directly in the child row. Set `"reuse": False` in `model_config` to use this per table.
+
+Relations are stored as `DataModelRelation1` (0..1 / 1..1) or `DataModelRelationN` (0..n / 1..n) in `DataModelTable.fields`.
+
+### Dialect system (`dialect/`)
+
+`DatabaseDialect` (base class) abstracts DB-specific behaviour: identifier length limits (truncated with MD5 suffix when too long), XSD→SQLAlchemy type mapping, and DDL generation. Each subclass (`postgresql.py`, `mysql.py`, `mssql.py`, `duckdb.py`) overrides only what differs. `get_dialect()` in `dialect/__init__.py` selects the right class from the SQLAlchemy engine dialect name.
+
+### Snapshot tests for model outputs
+
+`tests/test_models_output.py` compares generated ERDs, source/target trees, and SQL DDL against committed `.md`, `.txt`, and `.sql` files under `tests/sample_models/`. When a change intentionally modifies the data model or DDL output, regenerate these snapshots by running:
+
+```bash
+cd tests/sample_models && python models.py
+```
+
+then commit the updated snapshot files alongside the code change.
+
+### Key configuration options (`model_config`)
+
+| Option | Effect |
+|---|---|
+| `tables.<name>.reuse` | `False` → `DataModelTableDuplicated` |
+| `tables.<name>.choice_transform` | `False` → keep XSD `choice` fields separate instead of type+value columns |
+| `tables.<name>.fields.<field>.transform` | `False` / `"elevate_wo_prefix"` etc. → override field-level simplification |
+| `row_numbers` | Add ordering column tracking original XML element position |
+| `metadata_columns` | Extra SQLAlchemy columns appended to the root table |
+| `record_hash_column_name` / `record_hash_constructor` / `record_hash_size` | Customise the deduplication hash column |
+| `as_columnstore` | MS SQL Server columnstore index on a table |
diff --git a/docs/getting_started.md b/docs/getting_started.md
@@ -124,20 +124,29 @@ troubleshooting if need be.
     Actual values need to be passed to [`DataModel.parse_xml`](api/data_model.md#xml2db.model.DataModel.parse_xml) for 
     each parsed documents, as a `dict`, using the `metadata` argument.
 
-!!! note
-    You can also load multiple documents at the same time to the database, which could make the process faster if you 
-    have a lot of small XML files to load:
+!!! note "Loading multiple XML files in one database operation"
+    By default, each `parse_xml` + `insert_into_target_tables` call is an independent database operation. When you have
+    many small XML files to load, you can instead accumulate all of them in memory first and insert them in a single
+    batch, which reduces the number of database round-trips.
+
+    Pass the `flat_data` from the previous document into the next `parse_xml` call to accumulate records:
+
     ``` py
-    data = None
+    flat_data = None
     for xml_file in files:
         document = data_model.parse_xml(
-            xml_file="path/to/file.xml",
-            flat_data=data,
+            xml_file=xml_file,
+            metadata={"input_file_path": xml_file},
+            flat_data=flat_data,
         )
-        data = document.data
+        flat_data = document.data
     document.insert_into_target_tables()
     ```
 
+    Note that each file can carry its own `metadata` values (e.g. the file name or a loading timestamp), which will be
+    stored per root record in the columns defined by
+    [`metadata_columns`](configuring.md#model-configuration).
+
 
 
 ## Getting back the data into XML

diff --git a/src/xml2db/document.py b/src/xml2db/document.py
@@ -56,8 +56,10 @@ def parse_xml(
             skip_validation: Should we validate the document against the schema first?
             iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory
             recover: Should we try to parse incorrect XML? (argument passed to lxml parser)
-            flat_data: A dict containing flat data if we want to add data to another dataset instead of creating
-                a new one
+            flat_data: An existing `document.data` dict from a previously parsed document. When provided, records
+                from this XML file are appended to it rather than starting fresh, allowing multiple files to be
+                accumulated in memory and inserted together with a single
+                [`insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call.
         """
         self.xml_file_path = xml_file[:255] if isinstance(xml_file, str) else "<stream>"
 

diff --git a/src/xml2db/model.py b/src/xml2db/model.py
@@ -698,8 +698,10 @@ def parse_xml(
             skip_validation: Should we validate the documents against the schema first?
             iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory
             recover: Should we try to parse incorrect XML? (argument passed to lxml parser)
-            flat_data: A dict containing flat data if we want to add data to another dataset instead of creating
-                a new one
+            flat_data: An existing `document.data` dict from a previously parsed document. When provided, records
+                from this XML file are appended to it rather than starting fresh, allowing multiple files to be
+                accumulated in memory and inserted together with a single
+                [`Document.insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call.
 
         Returns:
             A parsed [`Document`](document.md) object