perf: async span export to prevent event loop blocking#156
perf: async span export to prevent event loop blocking#156sohil-kshirsagar merged 2 commits intomainfrom
Conversation
The span export pipeline ran synchronously on the event loop, blocking for 1-1.6s every 2s under load. This caused pg pool connection timeouts for customers like Greenboard. - Chunk transformSpanToCleanJSON across event loop ticks (setImmediate yields between batches of 20) so pool callbacks and timers can fire - Replace fs.appendFileSync with async fs/promises.appendFile and batch writes by trace file (one write per file instead of per span) - Remove dead duplicate adapter-length check - Add pg e2e test endpoints reproducing Greenboard's pool patterns - Add export pipeline benchmark script Measured: max event loop stall drops from 1,609ms to 22ms. E2E record+replay: 12/12 tests pass.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 89e4868. Configure here.
Keep only the core export pipeline fixes (TdSpanExporter, FilesystemSpanAdapter). The benchmark scripts and Greenboard-pattern test endpoints were investigation tools, not e2e tests — they aren't wired into test_requests.mjs and wouldn't run during record/replay.
Generated 15 tests - 15 passedTest Summary
ResultsTusk's tests all pass and validate the core performance improvements in this PR. The Avg +91% line coverage gain across 2 files
Coverage is calculated by running tests directly associated with each source file, learn more here. Was Tusk helpful? Give feedback by reacting with 👍 or 👎 |



Make the span export pipeline non-blocking to prevent event loop stalls under load.
Problem
The
TdSpanExporter.export()method runs synchronously on the event loop. Every 2 seconds,BatchSpanProcessorfires and processes up to 512 spans in a tight loop: JSON parsing, schema generation, hashing, and protobuf encoding per span, followed by synchronous filesystem writes. Under production load with moderate query result sizes, this blocks the event loop for 1-1.6 seconds per batch — long enough to cause connection pool timeouts, delayed callbacks, and cascading failures in database-heavy applications.Measured impact
Benchmark: 10 concurrent HTTP requests, each running 20 pg queries returning 200 rows.
Changes
TdSpanExporter: Extractexport()body into an async_exportAsync()helper. Replace the synchronous.map()overtransformSpanToCleanJSONwith a chunked loop that yields the event loop viasetImmediateevery 20 spans. This caps continuous blocking to ~14ms regardless of batch size. Also removed dead duplicate adapter-length check and cleaned up theresultCallbackflow to properly propagate errors from the async path.FilesystemSpanAdapter: Replacefs.appendFileSync(one blocking syscall per span) withfs/promises.appendFile. Additionally, batch writes by trace file — spans sharing a trace ID are concatenated and written in a singleappendFilecall instead of one per span. This reduces N syscalls to K (number of distinct traces in the batch, typically 1-5).pg e2e tests: Add test endpoints that reproduce pool stress patterns (health-check-per-query, concurrent pool exhaustion, large result sets, event loop blocking measurement, behavioral correctness). Add export pipeline benchmark script.
Why not a worker thread?
The ideal long-term solution would move
transformSpanToCleanJSONand the adapters to aworker_threadsWorker for full event loop isolation. The serialization boundary is clean —ReadableSpanattributes are already JSON strings that transfer cheaply via structured clone. However, this requires handling Rust native binding loading in workers, adapter lifecycle management across thread boundaries, graceful shutdown coordination, and error recovery for worker crashes. The chunked async approach gives 99% of the benefit (1,609ms → 22ms max stall) with minimal complexity and risk.Regression risks
BatchSpanProcessor.shutdown()callsforceFlush()which waits for the current export. The async path callsresultCallbackafter completion, so graceful shutdown is safe. OnlySIGKILLduring an in-flight export could lose spans.suppressTracingcontext set byBatchSpanProcessorpropagates throughsetImmediateviaAsyncLocalStorage. Verified no recursive span creation during yields.appendFilewrites are sequential within eachexportSpanscall, and_isExportingprevents concurrent exports. No interleaving.Validation
pg e2e suite: 12/12 tests passed (record → replay round-trip, all span types).