fix(proxy): retry transient upstream errors that opt in via x-should-retry#54
Draft
TaprootFreak wants to merge 3 commits into
Draft
fix(proxy): retry transient upstream errors that opt in via x-should-retry#54TaprootFreak wants to merge 3 commits into
TaprootFreak wants to merge 3 commits into
Conversation
…retry
Anthropic responds to HTTP 529 ("Overloaded") and similar transient 5xx
conditions with `x-should-retry: true`, signalling that the same request
can be safely retried against the same upstream. ccflare previously only
treated HTTP 429 as a rate-limit/failover trigger, so 529s flowed
straight through to the client with no retry and no failover.
Add an inner retry loop in `proxyWithAccount` that honours the header:
- triggers on `status >= 500 && x-should-retry: true`
- retries the same account up to `runtime.retry.attempts` times
- exponential backoff `delayMs * backoff^attempt`
- on exhausted retries returns null so the outer loop in handleProxy
fails over to the next account
Persist the per-request retry count via a new `retry_attempt` column on
the `requests` table so operators can quantify the recovery effect with
`SELECT COUNT(*) FROM requests WHERE retry_attempt > 0`. The column is
added through the existing additive-migration path (fresh schema,
in-flight v2 migration, and ALTER on legacy DBs all covered).
Pure exponential backoff under load is a well-known anti-pattern: parallel clients synchronize their retries into waves that amplify the saturation they are reacting to. Adopt the "Exponential Backoff and Jitter" pattern described by the AWS Architecture Blog to spread retries across the backoff window and improve recovery time. - Full jitter: Math.random() * (delayMs * backoff^attempt) - Honor the upstream `retry-after` header as a floor: max(jittered, retryAfterMs). The upstream knows its own pool capacity better than the client does. - Cap the base delay at 30s before jitter is applied. This protects against config footguns such as RETRY_ATTEMPTS=10 with backoff=2, which would otherwise wait 1024s on the final attempt.
…mmary New retry_attempt DB column was written but not read on the API path. Add it to the row->object mapper, the DbRequest/RequestSummary interfaces, the summary mapper, and the isRequestSummary type guard so /api/requests responses surface the same-account retry count alongside failoverAttempts.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Anthropic responds to HTTP 529 ("Overloaded") and similar transient 5xx conditions with the
x-should-retry: trueresponse header. This tells callers that the same request can be safely retried against the same upstream — that is the documented signal to retry, not status code alone.ccflarecurrently only treats HTTP 429 as a retry/failover trigger (response-processor.tsisRateLimited), so 529s flow straight through to the client with no retry and no failover. Under Anthropic capacity bursts this is a user-visible failure even though the upstream is explicitly telling the proxy "try this again, it'll likely work."This is a wider blind spot in the ecosystem — Anthropic's own TypeScript SDK has the same bug, tracked in
anthropic-sdk-typescript#791. Fixing it here is high-leverage: every client behind ccflare benefits.The existing retry config (
runtime.retry.attempts,runtime.retry.delayMs,runtime.retry.backoff) is already declared but unused in the new flow — there is a comment inresponse-handler.tsreadingAlways 0 in new flow, but kept for message compatibility. This PR finally puts it to work.Fix
Add an inner retry loop in
proxyWithAccountthat triggers onstatus >= 500 && x-should-retry: true:runtime.retry.attemptstimes before giving upnullso the existing outer loop inhandleProxyfalls over to the next account (preserves current failover semantics)Math.random() * (delayMs * backoff^attempt)retry-afterheader as a floor:max(jittered, retryAfterMs)RETRY_ATTEMPTS=10 backoff=2would otherwise wait 1024s)retry_attemptcolumn on therequeststable for observabilityWhy this design
Same-account retry, not failover. HTTP 529 is global Anthropic pool saturation, not an account-scoped condition. Failing over to a sibling account that shares the same upstream pool just shifts which account gets blamed in metrics — it does not improve success rate. The right move is to wait briefly and re-ask the same account.
Full jitter (AWS pattern). Pure exponential backoff under load is a known anti-pattern: parallel clients synchronize their retries into waves that amplify the saturation they are reacting to. Full jitter (see AWS Architecture Blog — Exponential Backoff and Jitter) spreads retries across the backoff window.
retry-afteras a floor, not a replacement. When Anthropic sendsretry-after, it knows its pool capacity better than the client does — never wait less than that. But never wait less than the locally-jittered backoff either, so a smallretry-aftervalue cannot cause a tight retry loop.30s cap. Protects against config errors making individual requests hang for many minutes.
Same-account retry vs. existing failover are complementary. The inner loop handles transient overload that benefits from waiting on the same account; the outer loop handles account-specific rate-limits that benefit from switching. Both stay in scope.
Reuses existing retry config that was already declared in
RuntimeConfigbut unused in the new request flow. No new config surface.Database migration
The
retry_attemptcolumn is added through the existing additive-migration path:ensureSchema) — column included from the startensureRequestLinkageColumns) —ALTER TABLE requests ADD COLUMN retry_attempt INTEGER DEFAULT 0(O(1) in SQLite, no rewrite)Backward compatible. Old rows read back as
retry_attempt = 0.Observability
Operators can quantify the recovery effect:
retry_attemptis also exposed via the/api/requestssummary response (alongside the existingfailoverAttempts), so it surfaces in the dashboard request list without extra wiring.Tests
packages/proxy/src/handlers/proxy-operations.test.ts(new file, 11 tests): retry happy path, exhausted retries, non-retryable 500, instant 200, smaller-than-default attempt budget, jitter is applied,retry-afterhonored as floor, 30s cap, retry counter trackingpackages/proxy/src/handlers/response-processor.test.ts(+13 tests):isRetryableUpstreamErrormatrix (429, 500, 502, 503, 529, 200, missing header,x-should-retry: false),parseRetryAfterfor delta-seconds, HTTP-date, missing header, invalid valuespackages/types/src/request.test.tsandpackages/database/src/models/request-row.test.ts: extended to coverretryAttemptmapping through the summary and row layersAll tests in
packages/proxy,packages/types,packages/databasepass locally.Local verification
Pre-existing failures unrelated to this PR (also present on
mainat95c4c6a):apps/server/src/graceful-shutdown.integration.test.ts—Bun.spawn(["bun", ...])fails whenbunis not on the spawned child's$PATHpackages/http/src/client.test.ts— one biome formatter diffThese are noted but not fixed here to keep the PR scoped.
Production note
This patch has been running in production at a small organization sitting between active users and Anthropic. We observed a substantial reduction in user-visible 529 errors during Anthropic capacity bursts — exactly what
x-should-retry: truewas designed to enable.References
anthropic-sdk-typescript#791— same bug in Anthropic's official TypeScript SDKx-should-retry,overloaded_error