fix: non-streaming event-channel overflow + baseline benchmark doc#8
fix: non-streaming event-channel overflow + baseline benchmark doc#8willamhou wants to merge 1 commit into
Conversation
…eline bench doc ## Non-streaming 500 fix A `stream:false` request with a long generation (observed at max_tokens=256) returned HTTP 500 `event channel full, cancelling slow consumer`. The engine emits Token events via `try_send` into a bounded 256-slot per-request channel and cancels the sequence if it's full (slow-consumer protection). The streaming consumer drains incrementally as it writes SSE, so 256 slack suffices; the non-streaming consumer accumulates tokens and only returns at Finish, so a burst of >256 tokens faster than the consumer task is scheduled overflows the channel. Fix: size the channel to `max_tokens + 16` (floor 256) for non-streaming requests; keep 256 for streaming (SSE backpressure handles it). A well-behaved non-streaming consumer can no longer be dropped mid-stream. Verified: max_tokens=256 non-streaming now returns finish=length with 256 completion tokens and zero channel-full warnings. ## Baseline benchmark doc docs/benchmarks/2026-05-25-baseline-tinyllama-gb10.md records the pre-CUDA-Graphs reference on GB10 (TinyLlama-1.1B, F16, paged attention): - single-stream decode 38.2 ms/token (26.2 tok/s) - concurrent throughput 179 / 531 / 730 tok/s at C = 4 / 16 / 32 Single-stream is launch-overhead-bound (~220 kernel launches/decode step), which is exactly what the upcoming CUDA Graphs end-to-end work targets. Doc notes the measurement method (differential to cancel prefill; wall-clock only, since urllib SSE line-buffering corrupts naive per-token timing) and this non-streaming bug (now fixed). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 421efd5143
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let channel_cap = if is_stream { | ||
| 256 | ||
| } else { | ||
| (params.max_tokens + 16).max(256) |
There was a problem hiding this comment.
Clamp non-stream channel capacity before allocation
max_tokens is user-controlled (ChatCompletionRequest.max_tokens) and is used directly in (params.max_tokens + 16).max(256) to size mpsc::channel. For sufficiently large values, this can overflow usize or exceed Tokio’s maximum bounded-channel capacity, which makes mpsc::channel(...) panic instead of returning a normal API error. In practice, a single non-streaming request with an extreme max_tokens can force a handler panic/connection abort; please bound max_tokens (or channel_cap) and use checked arithmetic before constructing the channel.
Useful? React with 👍 / 👎.
Summary
Two small things bundled (both surfaced during benchmarking):
Non-streaming 500 fix:
stream:false+ long generation (max_tokens≥~256) returned HTTP 500event channel full, cancelling slow consumer. The enginetry_sends Token events into a bounded 256-slot channel and drops the sequence if full. The non-streaming consumer accumulates tokens until Finish, so a burst faster than it's scheduled overflows 256. Fix: size the non-streaming channel tomax_tokens + 16(floor 256); streaming keeps 256 (SSE backpressure handles it). Verified max_tokens=256 now returns finish=length, 256 tokens, zero warnings.Baseline benchmark doc:
docs/benchmarks/2026-05-25-baseline-tinyllama-gb10.md— pre-CUDA-Graphs reference on GB10 (TinyLlama-1.1B F16, paged attention): single-stream 38.2 ms/tok (26.2 tok/s), concurrent 179/531/730 tok/s at C=4/16/32. Single-stream is launch-overhead-bound (~220 kernel launches/step) — the target for upcoming CUDA Graphs work.Test plan
cargo build --releaseclean🤖 Generated with Claude Code