Skip to content

Replace batch-and-wait DuckDB loading with producer-consumer pattern#18

Open
kevinschaper wants to merge 4 commits into
mainfrom
issue-17-bulkload-improvement
Open

Replace batch-and-wait DuckDB loading with producer-consumer pattern#18
kevinschaper wants to merge 4 commits into
mainfrom
issue-17-bulkload-improvement

Conversation

@kevinschaper

Copy link
Copy Markdown
Contributor

Summary

  • Eliminates straggler effect: replaces fixed batch-of-workers processing (which waited for all workers in a batch before starting the next) with continuous task submission to a thread pool
  • Eliminates O(n^2) LIMIT/OFFSET pagination: replaces per-chunk LIMIT X OFFSET Y queries with a single query using fetchmany(), avoiding repeated table scans for large tables
  • Bounds memory by draining completed futures inline during the read loop

Fixes #17

kevinschaper and others added 4 commits March 6, 2026 14:38
Eliminates two performance bottlenecks in bulkload_duckdb():
- Straggler effect from processing chunks in fixed batches of workers
- O(n^2) LIMIT/OFFSET pagination that re-scans preceding rows

Now uses a single query with fetchmany() (producer) feeding a thread
pool (consumer), with inline future draining to bound memory usage.

Fixes #17

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Transient connection errors (e.g. Early EOF) during bulk indexing could
silently drop entire batches. Add transport-level retry via urllib3 Retry
on the HTTP adapter, plus application-level retry loops in both
upload_rows_to_solr() and bulkload_file() with exponential backoff.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Large batch uploads to Solr can stall long enough to hit the default
120s Jetty idle timeout, causing IOException during JSON parsing.
Bump to 300s to match the client-side request timeout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reserve roughly half of CPU cores for Solr's indexing/merge threads
instead of using cpu_count * 1.5. Caps at 8 workers instead of 12.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve bulk loading throughput: fix batch processing and LIMIT/OFFSET pagination

2 participants