Skip to content

Add bulk_insert dialect hook with DuckDB CSV implementation#57

Merged
cre-os merged 1 commit into
cre-dev:mainfrom
martinv13:claude/sync-main-branch-tFAIj
May 29, 2026
Merged

Add bulk_insert dialect hook with DuckDB CSV implementation#57
cre-os merged 1 commit into
cre-dev:mainfrom
martinv13:claude/sync-main-branch-tFAIj

Conversation

@martinv13
Copy link
Copy Markdown
Collaborator

Introduces DatabaseDialect.bulk_insert(conn, table, records) as the single insertion point for temp-table loading. The base implementation falls back to SQLAlchemy executemany (no behaviour change for PostgreSQL, MySQL, MSSQL). DuckDBDialect overrides it with a write-to-tempfile / read_csv approach that is significantly faster for large payloads:

  • Records are serialised to a NamedTemporaryFile CSV (stdlib csv, no extra dependencies).
  • read_csv is called with all_varchar=true; each column is then explicitly CAST to its target DuckDB type (BIGINT, DOUBLE, TIMESTAMPTZ, BOOLEAN, …) in the SELECT clause, avoiding auto_detect type mis-identification.
  • LargeBinary (record-hash) columns are hex-encoded in the CSV and decoded with unhex() in SQL.
  • SQLAlchemy Python-side scalar defaults (e.g. default=False on temp_exists) are materialised manually before writing the CSV, matching the behaviour of executemany.
  • The temp file is deleted in a finally block even when an error occurs.

document.py: insert_into_temp_tables now calls
dialect.bulk_insert(conn, query.table, records) instead of conn.execute(query, records) directly.

Tests: new tests/test_bulk_insert.py covers base-class fallback, numeric types (incl. BigInteger/SmallInteger subclass ordering), boolean, datetime, binary, scalar defaults, and empty-records no-op.

Introduces DatabaseDialect.bulk_insert(conn, table, records) as the
single insertion point for temp-table loading. The base implementation
falls back to SQLAlchemy executemany (no behaviour change for
PostgreSQL, MySQL, MSSQL). DuckDBDialect overrides it with a
write-to-tempfile / read_csv approach that is significantly faster
for large payloads:

- Records are serialised to a NamedTemporaryFile CSV (stdlib csv,
  no extra dependencies).
- read_csv is called with all_varchar=true; each column is then
  explicitly CAST to its target DuckDB type (BIGINT, DOUBLE,
  TIMESTAMPTZ, BOOLEAN, …) in the SELECT clause, avoiding
  auto_detect type mis-identification.
- LargeBinary (record-hash) columns are hex-encoded in the CSV and
  decoded with unhex() in SQL.
- SQLAlchemy Python-side scalar defaults (e.g. default=False on
  temp_exists) are materialised manually before writing the CSV,
  matching the behaviour of executemany.
- The temp file is deleted in a finally block even when an error
  occurs.

document.py: insert_into_temp_tables now calls
dialect.bulk_insert(conn, query.table, records) instead of
conn.execute(query, records) directly.

Tests: new tests/test_bulk_insert.py covers base-class fallback,
numeric types (incl. BigInteger/SmallInteger subclass ordering),
boolean, datetime, binary, scalar defaults, and empty-records
no-op.
@cre-os cre-os merged commit 43ef530 into cre-dev:main May 29, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants