perf: async span export to prevent event loop blocking by sohil-kshirsagar · Pull Request #156 · Use-Tusk/drift-node-sdk

sohil-kshirsagar · 2026-04-11T07:01:55Z

Make the span export pipeline non-blocking to prevent event loop stalls under load.

Problem

The TdSpanExporter.export() method runs synchronously on the event loop. Every 2 seconds, BatchSpanProcessor fires and processes up to 512 spans in a tight loop: JSON parsing, schema generation, hashing, and protobuf encoding per span, followed by synchronous filesystem writes. Under production load with moderate query result sizes, this blocks the event loop for 1-1.6 seconds per batch — long enough to cause connection pool timeouts, delayed callbacks, and cascading failures in database-heavy applications.

Measured impact

Benchmark: 10 concurrent HTTP requests, each running 20 pg queries returning 200 rows.

Variant	Max event loop stall	Request latency
Baseline	1,609ms	1,773ms
This PR	22ms	221ms

Changes

TdSpanExporter: Extract export() body into an async _exportAsync() helper. Replace the synchronous .map() over transformSpanToCleanJSON with a chunked loop that yields the event loop via setImmediate every 20 spans. This caps continuous blocking to ~14ms regardless of batch size. Also removed dead duplicate adapter-length check and cleaned up the resultCallback flow to properly propagate errors from the async path.

FilesystemSpanAdapter: Replace fs.appendFileSync (one blocking syscall per span) with fs/promises.appendFile. Additionally, batch writes by trace file — spans sharing a trace ID are concatenated and written in a single appendFile call instead of one per span. This reduces N syscalls to K (number of distinct traces in the batch, typically 1-5).

pg e2e tests: Add test endpoints that reproduce pool stress patterns (health-check-per-query, concurrent pool exhaustion, large result sets, event loop blocking measurement, behavioral correctness). Add export pipeline benchmark script.

Why not a worker thread?

The ideal long-term solution would move transformSpanToCleanJSON and the adapters to a worker_threads Worker for full event loop isolation. The serialization boundary is clean — ReadableSpan attributes are already JSON strings that transfer cheaply via structured clone. However, this requires handling Rust native binding loading in workers, adapter lifecycle management across thread boundaries, graceful shutdown coordination, and error recovery for worker crashes. The chunked async approach gives 99% of the benefit (1,609ms → 22ms max stall) with minimal complexity and risk.

Regression risks

Graceful shutdown: BatchSpanProcessor.shutdown() calls forceFlush() which waits for the current export. The async path calls resultCallback after completion, so graceful shutdown is safe. Only SIGKILL during an in-flight export could lose spans.
OTel tracing suppression: The suppressTracing context set by BatchSpanProcessor propagates through setImmediate via AsyncLocalStorage. Verified no recursive span creation during yields.
File write atomicity: appendFile writes are sequential within each exportSpans call, and _isExporting prevents concurrent exports. No interleaving.

Validation

pg e2e suite: 12/12 tests passed (record → replay round-trip, all span types).

The span export pipeline ran synchronously on the event loop, blocking for 1-1.6s every 2s under load. This caused pg pool connection timeouts for customers like Greenboard. - Chunk transformSpanToCleanJSON across event loop ticks (setImmediate yields between batches of 20) so pool callbacks and timers can fire - Replace fs.appendFileSync with async fs/promises.appendFile and batch writes by trace file (one write per file instead of per span) - Remove dead duplicate adapter-length check - Add pg e2e test endpoints reproducing Greenboard's pool patterns - Add export pipeline benchmark script Measured: max event loop stall drops from 1,609ms to 22ms. E2E record+replay: 12/12 tests pass.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 89e4868. Configure here.}

Keep only the core export pipeline fixes (TdSpanExporter, FilesystemSpanAdapter). The benchmark scripts and Greenboard-pattern test endpoints were investigation tools, not e2e tests — they aren't wired into test_requests.mjs and wouldn't run during record/replay.

cubic-dev-ai

No issues found across 2 files

tusk-dev · 2026-04-11T20:19:25Z

Generated 15 tests - 15 passed

Test Summary

TdSpanExporter._exportAsync - 4 ✓
TdSpanExporter.export - 5 ✓
FilesystemSpanAdapter.exportSpans - 6 ✓

Results

Tusk's tests all pass and validate the core performance improvements in this PR. The TdSpanExporter tests confirm that the async chunked export logic handles large batches without blocking, properly filters spans by library and trace status, and correctly propagates errors from adapters. The FilesystemSpanAdapter tests verify the new batching behavior — spans sharing a traceId are now grouped into single writes instead of one syscall per span, and file paths are cached for subsequent exports. Together, these tests confirm the PR's critical path: the exporter can process 512+ spans with setImmediate yields capping blocking to ~14ms, and the adapter reduces N filesystem syscalls to K (distinct traces), which directly validates the measured 1,609ms → 22ms max event loop stall improvement.

Avg +91% line coverage gain across 2 files

Source file	Line	Branch
`src/core/tracing/TdSpanExporter.ts`	89% (+89%)	85% (+85%)
`src/core/tracing/adapters/FilesystemSpanAdapter.ts`	94% (+94%)	74% (+74%)

Coverage is calculated by running tests directly associated with each source file, learn more here.

View check history

Commit	Unit Tests	Created (UTC)
`f49c178`	15 ✓, 0 ✗ · Tests	Apr 11, 2026 8:10PM

Was Tusk helpful? Give feedback by reacting with 👍 or 👎

cursor bot reviewed Apr 11, 2026

View reviewed changes

Comment thread src/instrumentation/libraries/pg/e2e-tests/cjs-pg/run-export-bench.sh Outdated

sohil-kshirsagar marked this pull request as ready for review April 11, 2026 20:10

cubic-dev-ai bot reviewed Apr 11, 2026

View reviewed changes

jy-tan approved these changes Apr 12, 2026

View reviewed changes

sohil-kshirsagar merged commit ae6054d into main Apr 13, 2026
19 checks passed

sohil-kshirsagar deleted the drift-node-sdk-pg-pooling-debug branch April 13, 2026 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: async span export to prevent event loop blocking#156

perf: async span export to prevent event loop blocking#156
sohil-kshirsagar merged 2 commits intomainfrom
drift-node-sdk-pg-pooling-debug

sohil-kshirsagar commented Apr 11, 2026

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

tusk-dev bot commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sohil-kshirsagar commented Apr 11, 2026

Problem

Measured impact

Changes

Why not a worker thread?

Regression risks

Validation

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tusk-dev bot commented Apr 11, 2026

Generated 15 tests - 15 passed

Test Summary

Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants