Add bulk_insert dialect hook with DuckDB CSV implementation by martinv13 · Pull Request #57 · cre-dev/xml2db

martinv13 · 2026-05-28T22:18:14Z

Introduces DatabaseDialect.bulk_insert(conn, table, records) as the single insertion point for temp-table loading. The base implementation falls back to SQLAlchemy executemany (no behaviour change for PostgreSQL, MySQL, MSSQL). DuckDBDialect overrides it with a write-to-tempfile / read_csv approach that is significantly faster for large payloads:

Records are serialised to a NamedTemporaryFile CSV (stdlib csv, no extra dependencies).
read_csv is called with all_varchar=true; each column is then explicitly CAST to its target DuckDB type (BIGINT, DOUBLE, TIMESTAMPTZ, BOOLEAN, …) in the SELECT clause, avoiding auto_detect type mis-identification.
LargeBinary (record-hash) columns are hex-encoded in the CSV and decoded with unhex() in SQL.
SQLAlchemy Python-side scalar defaults (e.g. default=False on temp_exists) are materialised manually before writing the CSV, matching the behaviour of executemany.
The temp file is deleted in a finally block even when an error occurs.

document.py: insert_into_temp_tables now calls
dialect.bulk_insert(conn, query.table, records) instead of conn.execute(query, records) directly.

Tests: new tests/test_bulk_insert.py covers base-class fallback, numeric types (incl. BigInteger/SmallInteger subclass ordering), boolean, datetime, binary, scalar defaults, and empty-records no-op.

Introduces DatabaseDialect.bulk_insert(conn, table, records) as the single insertion point for temp-table loading. The base implementation falls back to SQLAlchemy executemany (no behaviour change for PostgreSQL, MySQL, MSSQL). DuckDBDialect overrides it with a write-to-tempfile / read_csv approach that is significantly faster for large payloads: - Records are serialised to a NamedTemporaryFile CSV (stdlib csv, no extra dependencies). - read_csv is called with all_varchar=true; each column is then explicitly CAST to its target DuckDB type (BIGINT, DOUBLE, TIMESTAMPTZ, BOOLEAN, …) in the SELECT clause, avoiding auto_detect type mis-identification. - LargeBinary (record-hash) columns are hex-encoded in the CSV and decoded with unhex() in SQL. - SQLAlchemy Python-side scalar defaults (e.g. default=False on temp_exists) are materialised manually before writing the CSV, matching the behaviour of executemany. - The temp file is deleted in a finally block even when an error occurs. document.py: insert_into_temp_tables now calls dialect.bulk_insert(conn, query.table, records) instead of conn.execute(query, records) directly. Tests: new tests/test_bulk_insert.py covers base-class fallback, numeric types (incl. BigInteger/SmallInteger subclass ordering), boolean, datetime, binary, scalar defaults, and empty-records no-op.

cre-os merged commit 43ef530 into cre-dev:main May 29, 2026
9 checks passed

martinv13 mentioned this pull request May 29, 2026

Add bulk load methods or documentation #51

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bulk_insert dialect hook with DuckDB CSV implementation#57

Add bulk_insert dialect hook with DuckDB CSV implementation#57
cre-os merged 1 commit into
cre-dev:mainfrom
martinv13:claude/sync-main-branch-tFAIj

martinv13 commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

martinv13 commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants