fix: watchdog timeout for non-stream upstream + cleanup uses failed candidate status#5
fix: watchdog timeout for non-stream upstream + cleanup uses failed candidate status#5stabey wants to merge 5 commits into
Conversation
When the endpoint forces upstream_stream_policy=force_non_stream while the client streams, the local stream candidate watchdog still preferred timeouts.first_byte_ms — a non-stream upstream produces no early first byte, so the watchdog fired before the HTTP request_timeout and aborted otherwise-healthy attempts at ~300s. Read upstream_is_stream from report_context and invert the priority: non-stream upstreams use total_ms first, falling back to first_byte_ms and then the default; streaming upstreams keep the previous order. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The stale-pending cleanup task previously hardcoded status_code=504 and a generic timeout message for every usage row it finalized. When a request had already been observed as failing — e.g. upstream Connection reset by peer, watchdog 504, or an authenticated 4xx — the cleanup overwrote that context with a misleading "服务器超时" outcome and 504 status, hiding the real cause from the dashboards and customer. Pull the most recent failed/cancelled candidate per stale request_id and, if present, finalize the usage row with the candidate's status_code (defaulting to 502 when none was recorded) and error_message. Requests that have no terminal candidate (truly stuck pending/streaming) keep the existing 504 + timeout-message behavior, since they really are timeouts from the cleanup's perspective. Applied to all three SQL backends with parameterized UPDATE statements. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3cd533f3a6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| finished_at DESC NULLS LAST, | ||
| started_at DESC NULLS LAST, |
There was a problem hiding this comment.
Use coalesced timestamp to select latest failed candidate
In the Postgres cleanup query, ordering by finished_at DESC NULLS LAST before started_at can pick an older candidate whenever the most recent failed/cancelled row has finished_at = NULL (the column is nullable), because any non-null finished_at is ranked ahead of a newer null row. This means stale cleanup may copy outdated status_code/error_message into usage instead of the latest failure context; MySQL/SQLite use COALESCE(finished_at, started_at, created_at) and do not have this mismatch.
Useful? React with 👍 / 👎.
Summary
fix(gateway): 客户端流式 + endpointupstream_stream_policy=force_non_stream时,stream candidate watchdog 改为优先用timeouts.total_ms(非流上游不会提前给 first byte,原先按first_byte_ms=300sabort 早于request_timeout=599s)。从report_context.upstream_is_stream读取标志位决定优先级。fix(data):cleanup_stale_pending_requests不再无脑写 504/超时文案。会先查最近 failed/cancelled candidate 的status_code+error_message,有就用上(默认 502);只有真正卡在 pending/streaming 的请求才落 504 timeout 文案。三个 SQL backend (postgres/mysql/sqlite) 同步改造。CI 状态
Rust CI run 26153684136 — 14/14 job 全绿,包括之前在 personal HEAD 挂的
Test (Gateway)(被 upstream/main 修了那俩 image 测试 bug)。Test plan
cargo test -p aether-gateway --lib executor::candidate_loop::tests::— 7/7 通过(含 3 个 watchdog 新测试)cargo test -p aether-data --lib repository::usage::— 121/121 通过(含新 sqlite cleanup 端到端测试)cargo fmt --all -- --check✓🤖 Generated with Claude Code