Skip to content

perf: async span export to prevent event loop blocking#156

Merged
sohil-kshirsagar merged 2 commits intomainfrom
drift-node-sdk-pg-pooling-debug
Apr 13, 2026
Merged

perf: async span export to prevent event loop blocking#156
sohil-kshirsagar merged 2 commits intomainfrom
drift-node-sdk-pg-pooling-debug

Conversation

@sohil-kshirsagar
Copy link
Copy Markdown
Contributor

Make the span export pipeline non-blocking to prevent event loop stalls under load.

Problem

The TdSpanExporter.export() method runs synchronously on the event loop. Every 2 seconds, BatchSpanProcessor fires and processes up to 512 spans in a tight loop: JSON parsing, schema generation, hashing, and protobuf encoding per span, followed by synchronous filesystem writes. Under production load with moderate query result sizes, this blocks the event loop for 1-1.6 seconds per batch — long enough to cause connection pool timeouts, delayed callbacks, and cascading failures in database-heavy applications.

Measured impact

Benchmark: 10 concurrent HTTP requests, each running 20 pg queries returning 200 rows.

Variant Max event loop stall Request latency
Baseline 1,609ms 1,773ms
This PR 22ms 221ms

Changes

TdSpanExporter: Extract export() body into an async _exportAsync() helper. Replace the synchronous .map() over transformSpanToCleanJSON with a chunked loop that yields the event loop via setImmediate every 20 spans. This caps continuous blocking to ~14ms regardless of batch size. Also removed dead duplicate adapter-length check and cleaned up the resultCallback flow to properly propagate errors from the async path.

FilesystemSpanAdapter: Replace fs.appendFileSync (one blocking syscall per span) with fs/promises.appendFile. Additionally, batch writes by trace file — spans sharing a trace ID are concatenated and written in a single appendFile call instead of one per span. This reduces N syscalls to K (number of distinct traces in the batch, typically 1-5).

pg e2e tests: Add test endpoints that reproduce pool stress patterns (health-check-per-query, concurrent pool exhaustion, large result sets, event loop blocking measurement, behavioral correctness). Add export pipeline benchmark script.

Why not a worker thread?

The ideal long-term solution would move transformSpanToCleanJSON and the adapters to a worker_threads Worker for full event loop isolation. The serialization boundary is clean — ReadableSpan attributes are already JSON strings that transfer cheaply via structured clone. However, this requires handling Rust native binding loading in workers, adapter lifecycle management across thread boundaries, graceful shutdown coordination, and error recovery for worker crashes. The chunked async approach gives 99% of the benefit (1,609ms → 22ms max stall) with minimal complexity and risk.

Regression risks

  • Graceful shutdown: BatchSpanProcessor.shutdown() calls forceFlush() which waits for the current export. The async path calls resultCallback after completion, so graceful shutdown is safe. Only SIGKILL during an in-flight export could lose spans.
  • OTel tracing suppression: The suppressTracing context set by BatchSpanProcessor propagates through setImmediate via AsyncLocalStorage. Verified no recursive span creation during yields.
  • File write atomicity: appendFile writes are sequential within each exportSpans call, and _isExporting prevents concurrent exports. No interleaving.

Validation

pg e2e suite: 12/12 tests passed (record → replay round-trip, all span types).

The span export pipeline ran synchronously on the event loop,
blocking for 1-1.6s every 2s under load. This caused pg pool
connection timeouts for customers like Greenboard.

- Chunk transformSpanToCleanJSON across event loop ticks (setImmediate
  yields between batches of 20) so pool callbacks and timers can fire
- Replace fs.appendFileSync with async fs/promises.appendFile and
  batch writes by trace file (one write per file instead of per span)
- Remove dead duplicate adapter-length check
- Add pg e2e test endpoints reproducing Greenboard's pool patterns
- Add export pipeline benchmark script

Measured: max event loop stall drops from 1,609ms to 22ms.
E2E record+replay: 12/12 tests pass.
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 89e4868. Configure here.

Comment thread src/instrumentation/libraries/pg/e2e-tests/cjs-pg/run-export-bench.sh Outdated
Keep only the core export pipeline fixes (TdSpanExporter, FilesystemSpanAdapter).
The benchmark scripts and Greenboard-pattern test endpoints were investigation
tools, not e2e tests — they aren't wired into test_requests.mjs and wouldn't
run during record/replay.
@sohil-kshirsagar sohil-kshirsagar marked this pull request as ready for review April 11, 2026 20:10
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

@tusk-dev
Copy link
Copy Markdown
Contributor

tusk-dev bot commented Apr 11, 2026

Generated 15 tests - 15 passed

Commit tests View tests

Test Summary

  • TdSpanExporter._exportAsync - 4 ✓
  • TdSpanExporter.export - 5 ✓
  • FilesystemSpanAdapter.exportSpans - 6 ✓

Results

Tusk's tests all pass and validate the core performance improvements in this PR. The TdSpanExporter tests confirm that the async chunked export logic handles large batches without blocking, properly filters spans by library and trace status, and correctly propagates errors from adapters. The FilesystemSpanAdapter tests verify the new batching behavior — spans sharing a traceId are now grouped into single writes instead of one syscall per span, and file paths are cached for subsequent exports. Together, these tests confirm the PR's critical path: the exporter can process 512+ spans with setImmediate yields capping blocking to ~14ms, and the adapter reduces N filesystem syscalls to K (distinct traces), which directly validates the measured 1,609ms → 22ms max event loop stall improvement.

Avg +91% line coverage gain across 2 files
Source file Line Branch
src/core/tracing/TdSpanExporter.ts 89% (+89%) 85% (+85%)
src/core/tracing/adapters/FilesystemSpanAdapter.ts 94% (+94%) 74% (+74%)

Coverage is calculated by running tests directly associated with each source file, learn more here.

View check history

Commit Unit Tests Created (UTC)
f49c178 15 ✓, 0 ✗ · Tests Apr 11, 2026 8:10PM

Was Tusk helpful? Give feedback by reacting with 👍 or 👎

@sohil-kshirsagar sohil-kshirsagar merged commit ae6054d into main Apr 13, 2026
19 checks passed
@sohil-kshirsagar sohil-kshirsagar deleted the drift-node-sdk-pg-pooling-debug branch April 13, 2026 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants