fix: watchdog timeout for non-stream upstream + cleanup uses failed candidate status by stabey · Pull Request #5 · stabey/Aether-upstream-fork

stabey · 2026-05-20T10:46:41Z

Summary

fix(gateway): 客户端流式 + endpoint upstream_stream_policy=force_non_stream 时，stream candidate watchdog 改为优先用 timeouts.total_ms（非流上游不会提前给 first byte，原先按 first_byte_ms=300s abort 早于 request_timeout=599s）。从 report_context.upstream_is_stream 读取标志位决定优先级。
fix(data): cleanup_stale_pending_requests 不再无脑写 504/超时文案。会先查最近 failed/cancelled candidate 的 status_code + error_message，有就用上（默认 502）；只有真正卡在 pending/streaming 的请求才落 504 timeout 文案。三个 SQL backend (postgres/mysql/sqlite) 同步改造。

CI 状态

Rust CI run 26153684136 — 14/14 job 全绿，包括之前在 personal HEAD 挂的 Test (Gateway)（被 upstream/main 修了那俩 image 测试 bug）。

Test plan

cargo test -p aether-gateway --lib executor::candidate_loop::tests:: — 7/7 通过（含 3 个 watchdog 新测试）
cargo test -p aether-data --lib repository::usage:: — 121/121 通过（含新 sqlite cleanup 端到端测试）
cargo fmt --all -- --check ✓
Rust CI workspace（Clippy + Test + Data DB Smoke 全平台）✓

🤖 Generated with Claude Code

…ersonal

When the endpoint forces upstream_stream_policy=force_non_stream while the client streams, the local stream candidate watchdog still preferred timeouts.first_byte_ms — a non-stream upstream produces no early first byte, so the watchdog fired before the HTTP request_timeout and aborted otherwise-healthy attempts at ~300s. Read upstream_is_stream from report_context and invert the priority: non-stream upstreams use total_ms first, falling back to first_byte_ms and then the default; streaming upstreams keep the previous order. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The stale-pending cleanup task previously hardcoded status_code=504 and a generic timeout message for every usage row it finalized. When a request had already been observed as failing — e.g. upstream Connection reset by peer, watchdog 504, or an authenticated 4xx — the cleanup overwrote that context with a misleading "服务器超时" outcome and 504 status, hiding the real cause from the dashboards and customer. Pull the most recent failed/cancelled candidate per stale request_id and, if present, finalize the usage row with the candidate's status_code (defaulting to 502 when none was recorded) and error_message. Requests that have no terminal candidate (truly stuck pending/streaming) keep the existing 504 + timeout-message behavior, since they really are timeouts from the cleanup's perspective. Applied to all three SQL backends with parameterized UPDATE statements. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3cd533f3a6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T10:50:08Z

+         finished_at DESC NULLS LAST,
+         started_at DESC NULLS LAST,


Use coalesced timestamp to select latest failed candidate

In the Postgres cleanup query, ordering by finished_at DESC NULLS LAST before started_at can pick an older candidate whenever the most recent failed/cancelled row has finished_at = NULL (the column is nullable), because any non-null finished_at is ranked ahead of a newer null row. This means stale cleanup may copy outdated status_code/error_message into usage instead of the latest failure context; MySQL/SQLite use COALESCE(finished_at, started_at, created_at) and do not have this mismatch.

Useful? React with 👍 / 👎.

stabey and others added 5 commits May 20, 2026 17:25

ci: add stabey dockerhub image workflow

311cd83

ci: auto sync master with upstream/main daily

48f282e

ci: trigger rust-ci on any branch push and dockerhub build on PR to p…

b772c83

…ersonal

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

stabey force-pushed the personal branch from 67c4691 to 2de2a79 Compare June 4, 2026 04:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: watchdog timeout for non-stream upstream + cleanup uses failed candidate status#5

fix: watchdog timeout for non-stream upstream + cleanup uses failed candidate status#5
stabey wants to merge 5 commits into
personalfrom
ci/test-watchdog-and-cleanup

stabey commented May 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stabey commented May 20, 2026

Summary

CI 状态

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant