fix(proxy/logs): 假200识别 + 状态码推断 + endpoint_pool_exhausted/404 错因展示#765
fix(proxy/logs): 假200识别 + 状态码推断 + endpoint_pool_exhausted/404 错因展示#765tesgth032 wants to merge 25 commits intoding113:devfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthrough为非流式上游响应添加有限字节读取与解码(32 KiB 上限),新增 HTML 文档嗅探与 BOM 处理,在转发器中将强烈的伪 200 信号(如 FAKE_200_HTML_BODY/FAKE_200_JSON_*)映射为失败以触发故障切换,并扩展相关单元测试与测试桩。 Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @tesgth032, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 此拉取请求旨在解决代理服务中一个关键的可靠性问题:上游服务可能返回HTTP 200状态码,但实际内容却是HTML错误页面或空响应,导致代理层误判为成功。通过引入智能的HTML错误页面检测、增强统一的错误识别工具以及改进空响应的处理逻辑,本变更确保了代理能够准确识别这些“假200”情况,并及时触发故障转移和熔断机制,从而显著提升了系统的稳定性和用户体验。 Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
这个 Pull Request 引入了一种机制来检测实际上是 HTML 错误页的“假 200”响应,这对代理的可靠性和故障转移逻辑是一个很大的改进。实现很可靠,包括内存安全的响应体嗅探和添加全面的单元测试。我还注意到一个修复,防止 EmptyResponseError 被吞掉,这是对错误处理的一个关键修正。
我在 src/lib/utils/upstream-error-detection.ts 中有一个建议,可以让 HTML 检测逻辑更健壮,防止潜在的误报。总的来说,这是一个高质量的贡献。
There was a problem hiding this comment.
Code Review Summary
No significant issues identified in this PR. The implementation is well-structured and conservative: HTML document detection uses strong signals only (doctype/html tags), the memory-safe body inspection has a sensible 1 MiB cap, and the EmptyResponseError re-throw fix correctly addresses a real bug where missing_content errors were being silently swallowed by the JSON parse catch block.
PR Size: M
- Lines changed: 521 (499 additions, 22 deletions)
- Files changed: 6
Review Coverage
- Logic and correctness - Clean. SSE early-return ensures non-streaming checks are properly guarded. The
isHtml || !contentLengthinspection condition correctly covers both HTML content-type and missing content-length scenarios. Theresponse.clone()approach preserves the original response for downstream consumption. - Security (OWASP Top 10) - Clean. The
truncateForDetailandsanitizeErrorTextForDetailfunctions in upstream-error-detection.ts properly redact sensitive data before including error details. - Error handling - Clean. The
readResponseTextUpToempty catch forreader.cancel()follows established codebase patterns (multiple identical patterns in the same file). TheEmptyResponseErrorre-throw fix is a legitimate bug fix. - Type safety - Clean. No
anyusage in new code. Types are properly narrowed. - Documentation accuracy - Clean. Comments accurately describe the detection logic and its limitations.
- Test coverage - Adequate. New tests cover the fake 200 HTML detection and the
missing_contentswallowing fix. TheisLikelyHtmlDocumentfunction has both positive and negative test cases (including a test ensuring JSON containing<html>text is not falsely detected). - Code clarity - Good. The
readResponseTextUpToutility is well-separated, the inspection flow is linear and easy to follow.
Automated review by Claude AI
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 769-775: The ProxyError is currently constructed with the internal
detection code (detected.code) in the isStrongFake200 branch which exposes
internal tokens like "FAKE_200_HTML_BODY" to clients; update the throw in
forwarder.ts so that when isStrongFake200 is true you pass a user-friendly
message (e.g. "Upstream service returned an invalid response") instead of
detected.code while keeping the 502 status and the existing metadata (body,
providerId, providerName), or alternately add a 502-specific safe mapping inside
getClientSafeMessage() to map any 502/FAKE_200_* detector codes to a generic
client message; target symbols: forwarder.ts where isStrongFake200 is checked,
ProxyError constructor invocation, and getClientSafeMessage if choosing the
mapping approach.
🧹 Nitpick comments (2)
src/app/v1/_lib/proxy/forwarder.ts (2)
102-150:readResponseTextUpTo实现整体合理,有一个小问题值得注意
remaining <= 0(第 122 行)的检查发生在读取到一个新 chunk 之后、但尚未解码之前——这意味着该 chunk 会被静默丢弃。对于启发式嗅探来说这是可接受的,但如果后续有人复用此函数做精确截断可能会产生困惑,建议加一行注释说明。另外,当
truncated = false(即流正常结束)时,reader既没有cancel()也没有releaseLock()。由于调用方使用的是response.clone()的 body,GC 应能回收,但显式释放更稳健:建议:正常结束时也释放 reader
if (truncated) { try { await reader.cancel(); } catch { // ignore } + } else { + reader.releaseLock(); }
740-750: 当!contentLength且响应体较大时,response.clone()会导致整个 body 在内存中 tee 化第 742 行的条件
isHtml || !contentLength会对所有缺少Content-Length的非 SSE 响应触发response.clone()。虽然readResponseTextUpTo在 32 KiB 后会cancel()已克隆的分支,但Response.clone()底层使用ReadableStream.tee()——在某些运行时中,被取消分支的背压处理可能导致原始分支缓冲额外数据。对于正常的 LLM JSON 响应(通常几 KB ~ 几十 KB),这不是问题;但如果上游返回了一个很大的非流式响应(无
Content-Length),可能会有短暂的内存峰值。当前 32 KiB 上限已经做了合理的防护,只是建议在注释中补充说明这一 trade-off,便于后续维护者理解。
- endpoint_pool_exhausted 写入 attemptNumber,避免被 initial_selection/session_reuse 去重吞掉\n- 决策链/技术时间线补齐 resource_not_found 的失败态与说明\n- 更新 provider-chain i18n 文案并新增单测覆盖
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
messages/ru/provider-chain.json (1)
42-42:⚠️ Potential issue | 🟡 Minor俄语 locale 中存在预先存在的语法问题(非本 PR 引入)。
"Пул конечная точкаов исчерпан"语法不正确,正确的俄语应为"Пул конечных точек исчерпан"("конечная точка" 的复数属格形式应为 "конечных точек")。同样的拼写错误出现在文件多处(如endpoint_pool_exhausted、endpointStats*、strictBlock*等行)。建议在后续 PR 中统一修正。
🧹 Nitpick comments (2)
src/lib/utils/provider-chain-formatter.ts (1)
443-482:resource_not_found时间线块逻辑正确,可考虑抽取与retry_failed的公共格式化逻辑。该块与下方
retry_failed(485-550 行)在 provider 详情格式化、耗时计算、upstreamParsed/upstreamBody渲染、请求详情附加等方面高度重复,主要区别在于缺少熔断器状态展示和使用不同的 header/note 文案。当前实现功能无误,但如果后续有更多类似 reason 类型加入,建议提取一个通用的formatProviderErrorBlock辅助函数以减少重复。src/lib/utils/provider-chain-formatter.test.ts (1)
278-327:resource_not_found测试覆盖良好,与现有测试模式一致。新增测试覆盖了三个核心格式化函数(
formatProviderSummary、formatProviderDescription、formatProviderTimeline)对resource_not_found原因的处理,结构清晰,与上方endpoint_pool_exhausted的测试组织方式保持一致。可考虑补充以下场景以增强覆盖率:
resource_not_found与成功重试组合的多 provider 链路(类似endpoint_pool_exhausted中的renders exhausted alongside successful retry in multi-provider chain)- 缺少
errorDetails.provider时的降级渲染路径(formatProviderTimeline中的 else 分支)
- 对 application/json 且 Content-Length<=32KiB 的 2xx 响应也做强信号嗅探\n- 补齐 200+JSON error(带 Content-Length)触发故障转移的回归测试
- fake-200/空响应:verboseProviderError 开启时在 error.details 返回详细报告与上游原文(不落库)\n- forwarder: 将检测到的原文片段挂到 ProxyError.upstreamError.rawBody\n- tests: 覆盖 verbose details 与 rawBody 透传
Additional Comments (2)
Prompt To Fix With AIThis is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 885:888
Comment:
**Invalid Content-Length treated as missing**
`hasValidContentLength` considers any finite `parseInt()` result >= 0 as “valid” (lines ~879-886). That means headers like `Content-Length: 12abc` will parse as `12` and be treated as valid, potentially skipping the `!hasValidContentLength` inspection path and changing fake-200/empty-body detection behavior. If upstreams ever send malformed Content-Length values, this will misclassify them instead of treating them as invalid.
How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix With AIThis is a comment left during a code review.
Path: src/app/v1/_lib/proxy/errors.ts
Line: 737:742
Comment:
**FAKE_200 detail source mismatch**
`ProxyError.getClientSafeMessage()` claims the FAKE_200_* path will only include a “sanitized/truncated snippet” (comment on lines ~739-740), but it uses `this.upstreamError?.body` (lines ~737-742) rather than `rawBody`. In `ProxyForwarder` the thrown `ProxyError` sets `body: detected.detail` and also sets `rawBody: inspectedText`. If future callers construct FAKE_200_* errors with `upstreamError.body` containing larger/less-sanitized content, this method would echo it (after whitespace normalization) into the client message, contradicting the intent and risking leaking more upstream content than expected.
How can I resolve this? If you propose a fix, please make it concise. |
- forwarder: 将非法 Content-Length 视为无效,避免漏检 HTML/空响应\n- errors: FAKE_200 客户端 detail 二次截断 + 轻量脱敏(防御性)\n- tests: 覆盖非法 Content-Length 漏检回归
|
已按 Greptile 的两条建议修正:
本地已跑 npm run lint / typecheck / test / build 均通过。 |
Additional Comments (1)
This function needs to be exported so Prompt To Fix With AIThis is a comment left during a code review.
Path: src/lib/utils/upstream-error-detection.ts
Line: 107:107
Comment:
Export `sanitizeErrorTextForDetail` for use in error-handler
This function needs to be exported so `error-handler.ts` can sanitize `rawBody` before exposing it to clients via the `verboseProviderError` feature.
```suggestion
export function sanitizeErrorTextForDetail(text: string): string {
```
How can I resolve this? If you propose a fix, please make it concise. |
|
补充说明:
|
Additional Comments (1)
Prompt To Fix With AIThis is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 733:741
Comment:
Streaming responses return immediately before fake-200 detection can occur. Detection happens later internally to update circuit breaker stats, but client already received HTTP 200. This is correctly documented in the UI.
How can I resolve this? If you propose a fix, please make it concise. |
背景(Fixes #749)
部分上游在鉴权/风控/WAF/Cloudflare 等场景下会返回 HTTP 200,但 body 实际是错误页/错误 JSON(例如
text/html的拦截页,或 JSON 中包含非空error)。这会导致 CCH 将该请求误记为成功,从而污染熔断/故障转移统计与 session 智能绑定(后续重试可能仍复用到同一坏 provider)。变更
非流式 2xx:对
text/html/application/xhtml+xml、缺失/非法Content-Length、以及小体积 JSON(Content-Length <= 32KiB)做最多 32KiB 的前缀嗅探:<!doctype html/<html)=>FAKE_200_HTML_BODYerror非空 /error.message非空 =>FAKE_200_JSON_*SSE 流式:保持“延迟结算”策略,等待流自然结束后对最终内容做同一套检测;若判定为假200则内部按失败结算(不改变客户端已收到的内容/状态码)。
新增:假200 状态码推断(优先级低,仅在已判定假200后触发;不覆盖非200/其它检测)。
日志/UI:当状态码为推断结果时,显著标记为“推测/Inferred”。
statusCodeInferred=trueStatus Code (inferred)HTTP xxx后追加(推测)/(inferred)verboseProviderError:当系统设置开启时,对 fake-200/空响应等上游异常,在错误响应error.details.upstreamError中附带更详细诊断信息与上游响应原文片段(仅内存短暂保留,不写入 DB/决策链)。附带修复(日志可读性)
endpoint_pool_exhausted的链路归因,避免被去重逻辑吞掉。resource_not_found在技术时间线/决策链中的 failure 展示与 i18n。取舍 / 安全
verboseProviderError开启时回传,且不落库;回传前会做基础脱敏(Bearer/key/JWT/email 等),但仍可能包含敏感信息,仅建议在可信环境启用。测试
npm run lintnpm run typechecknpm run testnpm run buildGreptile Overview
Greptile Summary
This PR implements comprehensive fake-200 detection and status code inference to prevent upstream errors disguised as successful responses from polluting circuit breaker stats and session binding.
Key Changes
errorfieldsresource_not_founderrors excluded from circuit breaker (distinguishes missing resources from provider failures) but still trigger failover after retriesverboseProviderErrorsystem setting exposes sanitizedrawBodyexcerpts in error responses for debuggingFixes Addressed from Previous Threads
✅ Security:
rawBodyproperly sanitized viasanitizeErrorTextForDetail()before client exposure (removes Bearer tokens, API keys, JWTs, emails)✅ JSON detection: Small JSON responses with Content-Length ≤32KiB ARE inspected via
shouldInspectJsonflag✅ Test coverage: Comprehensive tests validate JSON fake-200 detection with Content-Length headers
✅ Reader cleanup: Not an issue - reader released by Response.clone() mechanism
Trade-offs
Confidence Score: 4/5
Important Files Changed
Sequence Diagram
sequenceDiagram participant Client participant Forwarder participant Upstream participant ResponseHandler participant ErrorHandler participant CircuitBreaker participant DB Note over Client,DB: Non-Streaming Fake-200 Detection Flow Client->>Forwarder: Request Forwarder->>Upstream: Forward request Upstream-->>Forwarder: HTTP 200 + error body Note over Forwarder: Inspect body (up to 32KiB) Forwarder->>Forwarder: detectUpstreamErrorFromSseOrJsonText() alt Strong signal detected (HTML/JSON error) Forwarder->>Forwarder: inferUpstreamErrorStatusCodeFromText() Forwarder->>Forwarder: Throw ProxyError(inferred 4xx/5xx) Forwarder->>CircuitBreaker: recordFailure() (skip if 404) Forwarder->>ErrorHandler: Handle fake-200 error ErrorHandler->>ErrorHandler: sanitizeErrorTextForDetail(rawBody) ErrorHandler-->>Client: Error response with inferred status else No error detected Forwarder-->>Client: Success response end Note over Client,DB: SSE Streaming Fake-200 Detection Flow Client->>Forwarder: Request Forwarder->>Upstream: Forward request Upstream-->>Forwarder: HTTP 200 + SSE stream Note over Forwarder: Defer finalization Forwarder-->>Client: Start streaming (200 OK) Client->>ResponseHandler: Stream completes ResponseHandler->>ResponseHandler: detectUpstreamErrorFromSseOrJsonText() alt Fake-200 detected in stream ResponseHandler->>ResponseHandler: inferUpstreamErrorStatusCodeFromText() ResponseHandler->>CircuitBreaker: recordFailure() (skip if 404) ResponseHandler->>DB: Record with inferred status + statusCodeInferred=true Note over Client: Client already received 200<br/>but internal stats show failure else Normal completion ResponseHandler->>CircuitBreaker: recordSuccess() ResponseHandler->>DB: Record success end