fix(proxy): retry transient upstream errors that opt in via x-should-retry by TaprootFreak · Pull Request #54 · snipeship/ccflare

TaprootFreak · 2026-05-26T08:40:23Z

Problem

Anthropic responds to HTTP 529 ("Overloaded") and similar transient 5xx conditions with the x-should-retry: true response header. This tells callers that the same request can be safely retried against the same upstream — that is the documented signal to retry, not status code alone.

ccflare currently only treats HTTP 429 as a retry/failover trigger (response-processor.ts isRateLimited), so 529s flow straight through to the client with no retry and no failover. Under Anthropic capacity bursts this is a user-visible failure even though the upstream is explicitly telling the proxy "try this again, it'll likely work."

This is a wider blind spot in the ecosystem — Anthropic's own TypeScript SDK has the same bug, tracked in anthropic-sdk-typescript#791. Fixing it here is high-leverage: every client behind ccflare benefits.

The existing retry config (runtime.retry.attempts, runtime.retry.delayMs, runtime.retry.backoff) is already declared but unused in the new flow — there is a comment in response-handler.ts reading Always 0 in new flow, but kept for message compatibility. This PR finally puts it to work.

Fix

Add an inner retry loop in proxyWithAccount that triggers on status >= 500 && x-should-retry: true:

Retry the same account up to runtime.retry.attempts times before giving up
On exhausted retries, return null so the existing outer loop in handleProxy falls over to the next account (preserves current failover semantics)
Full jitter on backoff: Math.random() * (delayMs * backoff^attempt)
Honor the upstream retry-after header as a floor: max(jittered, retryAfterMs)
Cap the base delay at 30s before jitter to protect against config footguns (e.g. RETRY_ATTEMPTS=10 backoff=2 would otherwise wait 1024s)
Persist the per-request retry count in a new retry_attempt column on the requests table for observability

Why this design

Same-account retry, not failover. HTTP 529 is global Anthropic pool saturation, not an account-scoped condition. Failing over to a sibling account that shares the same upstream pool just shifts which account gets blamed in metrics — it does not improve success rate. The right move is to wait briefly and re-ask the same account.
Full jitter (AWS pattern). Pure exponential backoff under load is a known anti-pattern: parallel clients synchronize their retries into waves that amplify the saturation they are reacting to. Full jitter (see AWS Architecture Blog — Exponential Backoff and Jitter) spreads retries across the backoff window.
retry-after as a floor, not a replacement. When Anthropic sends retry-after, it knows its pool capacity better than the client does — never wait less than that. But never wait less than the locally-jittered backoff either, so a small retry-after value cannot cause a tight retry loop.
30s cap. Protects against config errors making individual requests hang for many minutes.
Same-account retry vs. existing failover are complementary. The inner loop handles transient overload that benefits from waiting on the same account; the outer loop handles account-specific rate-limits that benefit from switching. Both stay in scope.
Reuses existing retry config that was already declared in RuntimeConfig but unused in the new request flow. No new config surface.

Database migration

The retry_attempt column is added through the existing additive-migration path:

Fresh schema (ensureSchema) — column included from the start
In-flight v2 migration — column included when migrating from legacy
Legacy ALTER (ensureRequestLinkageColumns) — ALTER TABLE requests ADD COLUMN retry_attempt INTEGER DEFAULT 0 (O(1) in SQLite, no rewrite)

Backward compatible. Old rows read back as retry_attempt = 0.

Observability

Operators can quantify the recovery effect:

SELECT COUNT(*) FROM requests WHERE retry_attempt > 0;       -- rescued requests
SELECT MAX(retry_attempt), AVG(retry_attempt) FROM requests; -- retry distribution

retry_attempt is also exposed via the /api/requests summary response (alongside the existing failoverAttempts), so it surfaces in the dashboard request list without extra wiring.

Tests

packages/proxy/src/handlers/proxy-operations.test.ts (new file, 11 tests): retry happy path, exhausted retries, non-retryable 500, instant 200, smaller-than-default attempt budget, jitter is applied, retry-after honored as floor, 30s cap, retry counter tracking
packages/proxy/src/handlers/response-processor.test.ts (+13 tests): isRetryableUpstreamError matrix (429, 500, 502, 503, 529, 200, missing header, x-should-retry: false), parseRetryAfter for delta-seconds, HTTP-date, missing header, invalid values
packages/types/src/request.test.ts and packages/database/src/models/request-row.test.ts: extended to cover retryAttempt mapping through the summary and row layers

All tests in packages/proxy, packages/types, packages/database pass locally.

Local verification

bun install
cd packages/proxy   && bun test  # 76 pass / 0 fail
cd ../types         && bun test  # 17 pass / 0 fail
cd ../database      && bun test  # 31 pass / 0 fail
bun run typecheck                # all packages green
bunx --bun biome check packages/proxy packages/database packages/types
# Checked 109 files in 36ms. No fixes applied.

Pre-existing failures unrelated to this PR (also present on main at 95c4c6a):

apps/server/src/graceful-shutdown.integration.test.ts — Bun.spawn(["bun", ...]) fails when bun is not on the spawned child's $PATH
packages/http/src/client.test.ts — one biome formatter diff

These are noted but not fixed here to keep the PR scoped.

Production note

This patch has been running in production at a small organization sitting between active users and Anthropic. We observed a substantial reduction in user-visible 529 errors during Anthropic capacity bursts — exactly what x-should-retry: true was designed to enable.

References

anthropic-sdk-typescript#791 — same bug in Anthropic's official TypeScript SDK
AWS Architecture Blog — Exponential Backoff and Jitter — full-jitter rationale
Anthropic API errors — x-should-retry, overloaded_error

…retry Anthropic responds to HTTP 529 ("Overloaded") and similar transient 5xx conditions with `x-should-retry: true`, signalling that the same request can be safely retried against the same upstream. ccflare previously only treated HTTP 429 as a rate-limit/failover trigger, so 529s flowed straight through to the client with no retry and no failover. Add an inner retry loop in `proxyWithAccount` that honours the header: - triggers on `status >= 500 && x-should-retry: true` - retries the same account up to `runtime.retry.attempts` times - exponential backoff `delayMs * backoff^attempt` - on exhausted retries returns null so the outer loop in handleProxy fails over to the next account Persist the per-request retry count via a new `retry_attempt` column on the `requests` table so operators can quantify the recovery effect with `SELECT COUNT(*) FROM requests WHERE retry_attempt > 0`. The column is added through the existing additive-migration path (fresh schema, in-flight v2 migration, and ALTER on legacy DBs all covered).

Pure exponential backoff under load is a well-known anti-pattern: parallel clients synchronize their retries into waves that amplify the saturation they are reacting to. Adopt the "Exponential Backoff and Jitter" pattern described by the AWS Architecture Blog to spread retries across the backoff window and improve recovery time. - Full jitter: Math.random() * (delayMs * backoff^attempt) - Honor the upstream `retry-after` header as a floor: max(jittered, retryAfterMs). The upstream knows its own pool capacity better than the client does. - Cap the base delay at 30s before jitter is applied. This protects against config footguns such as RETRY_ATTEMPTS=10 with backoff=2, which would otherwise wait 1024s on the final attempt.

…mmary New retry_attempt DB column was written but not read on the API path. Add it to the row->object mapper, the DbRequest/RequestSummary interfaces, the summary mapper, and the isRequestSummary type guard so /api/requests responses surface the same-account retry count alongside failoverAttempts.

TaprootFreak added 3 commits May 26, 2026 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(proxy): retry transient upstream errors that opt in via x-should-retry#54

fix(proxy): retry transient upstream errors that opt in via x-should-retry#54
TaprootFreak wants to merge 3 commits into
snipeship:mainfrom
DFXswiss:contrib/retry-on-x-should-retry

TaprootFreak commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TaprootFreak commented May 26, 2026

Problem

Fix

Why this design

Database migration

Observability

Tests

Local verification

Production note

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant