Skip to content

fix: non-streaming event-channel overflow + baseline benchmark doc#8

Open
willamhou wants to merge 1 commit into
masterfrom
fix/nonstream-channel-cap
Open

fix: non-streaming event-channel overflow + baseline benchmark doc#8
willamhou wants to merge 1 commit into
masterfrom
fix/nonstream-channel-cap

Conversation

@willamhou

Copy link
Copy Markdown
Owner

Summary

Two small things bundled (both surfaced during benchmarking):

  1. Non-streaming 500 fix: stream:false + long generation (max_tokens≥~256) returned HTTP 500 event channel full, cancelling slow consumer. The engine try_sends Token events into a bounded 256-slot channel and drops the sequence if full. The non-streaming consumer accumulates tokens until Finish, so a burst faster than it's scheduled overflows 256. Fix: size the non-streaming channel to max_tokens + 16 (floor 256); streaming keeps 256 (SSE backpressure handles it). Verified max_tokens=256 now returns finish=length, 256 tokens, zero warnings.

  2. Baseline benchmark doc: docs/benchmarks/2026-05-25-baseline-tinyllama-gb10.md — pre-CUDA-Graphs reference on GB10 (TinyLlama-1.1B F16, paged attention): single-stream 38.2 ms/tok (26.2 tok/s), concurrent 179/531/730 tok/s at C=4/16/32. Single-stream is launch-overhead-bound (~220 kernel launches/step) — the target for upcoming CUDA Graphs work.

Test plan

  • cargo build --release clean
  • Non-streaming max_tokens=256 → finish=length, 256 completion tokens, 0 channel-full warnings (previously 500)
  • Streaming path unchanged (still 256-slot channel)

🤖 Generated with Claude Code

…eline bench doc

## Non-streaming 500 fix

A `stream:false` request with a long generation (observed at
max_tokens=256) returned HTTP 500 `event channel full, cancelling slow
consumer`. The engine emits Token events via `try_send` into a bounded
256-slot per-request channel and cancels the sequence if it's full
(slow-consumer protection). The streaming consumer drains incrementally
as it writes SSE, so 256 slack suffices; the non-streaming consumer
accumulates tokens and only returns at Finish, so a burst of >256 tokens
faster than the consumer task is scheduled overflows the channel.

Fix: size the channel to `max_tokens + 16` (floor 256) for non-streaming
requests; keep 256 for streaming (SSE backpressure handles it). A
well-behaved non-streaming consumer can no longer be dropped mid-stream.

Verified: max_tokens=256 non-streaming now returns finish=length with 256
completion tokens and zero channel-full warnings.

## Baseline benchmark doc

docs/benchmarks/2026-05-25-baseline-tinyllama-gb10.md records the
pre-CUDA-Graphs reference on GB10 (TinyLlama-1.1B, F16, paged attention):
- single-stream decode 38.2 ms/token (26.2 tok/s)
- concurrent throughput 179 / 531 / 730 tok/s at C = 4 / 16 / 32

Single-stream is launch-overhead-bound (~220 kernel launches/decode step),
which is exactly what the upcoming CUDA Graphs end-to-end work targets.
Doc notes the measurement method (differential to cancel prefill; wall-clock
only, since urllib SSE line-buffering corrupts naive per-token timing) and
this non-streaming bug (now fixed).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 421efd5143

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

let channel_cap = if is_stream {
256
} else {
(params.max_tokens + 16).max(256)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Clamp non-stream channel capacity before allocation

max_tokens is user-controlled (ChatCompletionRequest.max_tokens) and is used directly in (params.max_tokens + 16).max(256) to size mpsc::channel. For sufficiently large values, this can overflow usize or exceed Tokio’s maximum bounded-channel capacity, which makes mpsc::channel(...) panic instead of returning a normal API error. In practice, a single non-streaming request with an extreme max_tokens can force a handler panic/connection abort; please bound max_tokens (or channel_cap) and use checked arithmetic before constructing the channel.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant