fix: non-streaming event-channel overflow + baseline benchmark doc by willamhou · Pull Request #8 · willamhou/forge

willamhou · 2026-05-25T02:50:34Z

Summary

Two small things bundled (both surfaced during benchmarking):

Non-streaming 500 fix: stream:false + long generation (max_tokens≥~256) returned HTTP 500 event channel full, cancelling slow consumer. The engine try_sends Token events into a bounded 256-slot channel and drops the sequence if full. The non-streaming consumer accumulates tokens until Finish, so a burst faster than it's scheduled overflows 256. Fix: size the non-streaming channel to max_tokens + 16 (floor 256); streaming keeps 256 (SSE backpressure handles it). Verified max_tokens=256 now returns finish=length, 256 tokens, zero warnings.
Baseline benchmark doc: docs/benchmarks/2026-05-25-baseline-tinyllama-gb10.md — pre-CUDA-Graphs reference on GB10 (TinyLlama-1.1B F16, paged attention): single-stream 38.2 ms/tok (26.2 tok/s), concurrent 179/531/730 tok/s at C=4/16/32. Single-stream is launch-overhead-bound (~220 kernel launches/step) — the target for upcoming CUDA Graphs work.

Test plan

cargo build --release clean
Non-streaming max_tokens=256 → finish=length, 256 completion tokens, 0 channel-full warnings (previously 500)
Streaming path unchanged (still 256-slot channel)

🤖 Generated with Claude Code

…eline bench doc ## Non-streaming 500 fix A `stream:false` request with a long generation (observed at max_tokens=256) returned HTTP 500 `event channel full, cancelling slow consumer`. The engine emits Token events via `try_send` into a bounded 256-slot per-request channel and cancels the sequence if it's full (slow-consumer protection). The streaming consumer drains incrementally as it writes SSE, so 256 slack suffices; the non-streaming consumer accumulates tokens and only returns at Finish, so a burst of >256 tokens faster than the consumer task is scheduled overflows the channel. Fix: size the channel to `max_tokens + 16` (floor 256) for non-streaming requests; keep 256 for streaming (SSE backpressure handles it). A well-behaved non-streaming consumer can no longer be dropped mid-stream. Verified: max_tokens=256 non-streaming now returns finish=length with 256 completion tokens and zero channel-full warnings. ## Baseline benchmark doc docs/benchmarks/2026-05-25-baseline-tinyllama-gb10.md records the pre-CUDA-Graphs reference on GB10 (TinyLlama-1.1B, F16, paged attention): - single-stream decode 38.2 ms/token (26.2 tok/s) - concurrent throughput 179 / 531 / 730 tok/s at C = 4 / 16 / 32 Single-stream is launch-overhead-bound (~220 kernel launches/decode step), which is exactly what the upcoming CUDA Graphs end-to-end work targets. Doc notes the measurement method (differential to cancel prefill; wall-clock only, since urllib SSE line-buffering corrupts naive per-token timing) and this non-streaming bug (now fixed). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 421efd5143

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-25T02:54:03Z

+    let channel_cap = if is_stream {
+        256
+    } else {
+        (params.max_tokens + 16).max(256)


Clamp non-stream channel capacity before allocation

max_tokens is user-controlled (ChatCompletionRequest.max_tokens) and is used directly in (params.max_tokens + 16).max(256) to size mpsc::channel. For sufficiently large values, this can overflow usize or exceed Tokio’s maximum bounded-channel capacity, which makes mpsc::channel(...) panic instead of returning a normal API error. In practice, a single non-streaming request with an extreme max_tokens can force a handler panic/connection abort; please bound max_tokens (or channel_cap) and use checked arithmetic before constructing the channel.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: non-streaming event-channel overflow + baseline benchmark doc#8

fix: non-streaming event-channel overflow + baseline benchmark doc#8
willamhou wants to merge 1 commit into
masterfrom
fix/nonstream-channel-cap

willamhou commented May 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

willamhou commented May 25, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant