Skip to content

fix: watchdog timeout for non-stream upstream + cleanup uses failed candidate status#5

Open
stabey wants to merge 5 commits into
personalfrom
ci/test-watchdog-and-cleanup
Open

fix: watchdog timeout for non-stream upstream + cleanup uses failed candidate status#5
stabey wants to merge 5 commits into
personalfrom
ci/test-watchdog-and-cleanup

Conversation

@stabey
Copy link
Copy Markdown
Owner

@stabey stabey commented May 20, 2026

Summary

  • fix(gateway): 客户端流式 + endpoint upstream_stream_policy=force_non_stream 时,stream candidate watchdog 改为优先用 timeouts.total_ms(非流上游不会提前给 first byte,原先按 first_byte_ms=300s abort 早于 request_timeout=599s)。从 report_context.upstream_is_stream 读取标志位决定优先级。
  • fix(data): cleanup_stale_pending_requests 不再无脑写 504/超时文案。会先查最近 failed/cancelled candidate 的 status_code + error_message,有就用上(默认 502);只有真正卡在 pending/streaming 的请求才落 504 timeout 文案。三个 SQL backend (postgres/mysql/sqlite) 同步改造。

CI 状态

Rust CI run 26153684136 — 14/14 job 全绿,包括之前在 personal HEAD 挂的 Test (Gateway)(被 upstream/main 修了那俩 image 测试 bug)。

Test plan

  • cargo test -p aether-gateway --lib executor::candidate_loop::tests:: — 7/7 通过(含 3 个 watchdog 新测试)
  • cargo test -p aether-data --lib repository::usage:: — 121/121 通过(含新 sqlite cleanup 端到端测试)
  • cargo fmt --all -- --check
  • Rust CI workspace(Clippy + Test + Data DB Smoke 全平台)✓

🤖 Generated with Claude Code

stabey and others added 5 commits May 20, 2026 17:25
When the endpoint forces upstream_stream_policy=force_non_stream while
the client streams, the local stream candidate watchdog still preferred
timeouts.first_byte_ms — a non-stream upstream produces no early first
byte, so the watchdog fired before the HTTP request_timeout and aborted
otherwise-healthy attempts at ~300s.

Read upstream_is_stream from report_context and invert the priority:
non-stream upstreams use total_ms first, falling back to first_byte_ms
and then the default; streaming upstreams keep the previous order.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The stale-pending cleanup task previously hardcoded status_code=504 and a
generic timeout message for every usage row it finalized. When a request
had already been observed as failing — e.g. upstream Connection reset by
peer, watchdog 504, or an authenticated 4xx — the cleanup overwrote that
context with a misleading "服务器超时" outcome and 504 status, hiding the
real cause from the dashboards and customer.

Pull the most recent failed/cancelled candidate per stale request_id and,
if present, finalize the usage row with the candidate's status_code
(defaulting to 502 when none was recorded) and error_message. Requests
that have no terminal candidate (truly stuck pending/streaming) keep the
existing 504 + timeout-message behavior, since they really are timeouts
from the cleanup's perspective. Applied to all three SQL backends with
parameterized UPDATE statements.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3cd533f3a6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1674 to +1675
finished_at DESC NULLS LAST,
started_at DESC NULLS LAST,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use coalesced timestamp to select latest failed candidate

In the Postgres cleanup query, ordering by finished_at DESC NULLS LAST before started_at can pick an older candidate whenever the most recent failed/cancelled row has finished_at = NULL (the column is nullable), because any non-null finished_at is ranked ahead of a newer null row. This means stale cleanup may copy outdated status_code/error_message into usage instead of the latest failure context; MySQL/SQLite use COALESCE(finished_at, started_at, created_at) and do not have this mismatch.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant