Skip to content

fix(proxy): retry transient upstream errors that opt in via x-should-retry#54

Draft
TaprootFreak wants to merge 3 commits into
snipeship:mainfrom
DFXswiss:contrib/retry-on-x-should-retry
Draft

fix(proxy): retry transient upstream errors that opt in via x-should-retry#54
TaprootFreak wants to merge 3 commits into
snipeship:mainfrom
DFXswiss:contrib/retry-on-x-should-retry

Conversation

@TaprootFreak

Copy link
Copy Markdown

Problem

Anthropic responds to HTTP 529 ("Overloaded") and similar transient 5xx conditions with the x-should-retry: true response header. This tells callers that the same request can be safely retried against the same upstream — that is the documented signal to retry, not status code alone.

ccflare currently only treats HTTP 429 as a retry/failover trigger (response-processor.ts isRateLimited), so 529s flow straight through to the client with no retry and no failover. Under Anthropic capacity bursts this is a user-visible failure even though the upstream is explicitly telling the proxy "try this again, it'll likely work."

This is a wider blind spot in the ecosystem — Anthropic's own TypeScript SDK has the same bug, tracked in anthropic-sdk-typescript#791. Fixing it here is high-leverage: every client behind ccflare benefits.

The existing retry config (runtime.retry.attempts, runtime.retry.delayMs, runtime.retry.backoff) is already declared but unused in the new flow — there is a comment in response-handler.ts reading Always 0 in new flow, but kept for message compatibility. This PR finally puts it to work.

Fix

Add an inner retry loop in proxyWithAccount that triggers on status >= 500 && x-should-retry: true:

  • Retry the same account up to runtime.retry.attempts times before giving up
  • On exhausted retries, return null so the existing outer loop in handleProxy falls over to the next account (preserves current failover semantics)
  • Full jitter on backoff: Math.random() * (delayMs * backoff^attempt)
  • Honor the upstream retry-after header as a floor: max(jittered, retryAfterMs)
  • Cap the base delay at 30s before jitter to protect against config footguns (e.g. RETRY_ATTEMPTS=10 backoff=2 would otherwise wait 1024s)
  • Persist the per-request retry count in a new retry_attempt column on the requests table for observability

Why this design

  • Same-account retry, not failover. HTTP 529 is global Anthropic pool saturation, not an account-scoped condition. Failing over to a sibling account that shares the same upstream pool just shifts which account gets blamed in metrics — it does not improve success rate. The right move is to wait briefly and re-ask the same account.

  • Full jitter (AWS pattern). Pure exponential backoff under load is a known anti-pattern: parallel clients synchronize their retries into waves that amplify the saturation they are reacting to. Full jitter (see AWS Architecture Blog — Exponential Backoff and Jitter) spreads retries across the backoff window.

  • retry-after as a floor, not a replacement. When Anthropic sends retry-after, it knows its pool capacity better than the client does — never wait less than that. But never wait less than the locally-jittered backoff either, so a small retry-after value cannot cause a tight retry loop.

  • 30s cap. Protects against config errors making individual requests hang for many minutes.

  • Same-account retry vs. existing failover are complementary. The inner loop handles transient overload that benefits from waiting on the same account; the outer loop handles account-specific rate-limits that benefit from switching. Both stay in scope.

  • Reuses existing retry config that was already declared in RuntimeConfig but unused in the new request flow. No new config surface.

Database migration

The retry_attempt column is added through the existing additive-migration path:

  • Fresh schema (ensureSchema) — column included from the start
  • In-flight v2 migration — column included when migrating from legacy
  • Legacy ALTER (ensureRequestLinkageColumns) — ALTER TABLE requests ADD COLUMN retry_attempt INTEGER DEFAULT 0 (O(1) in SQLite, no rewrite)

Backward compatible. Old rows read back as retry_attempt = 0.

Observability

Operators can quantify the recovery effect:

SELECT COUNT(*) FROM requests WHERE retry_attempt > 0;       -- rescued requests
SELECT MAX(retry_attempt), AVG(retry_attempt) FROM requests; -- retry distribution

retry_attempt is also exposed via the /api/requests summary response (alongside the existing failoverAttempts), so it surfaces in the dashboard request list without extra wiring.

Tests

  • packages/proxy/src/handlers/proxy-operations.test.ts (new file, 11 tests): retry happy path, exhausted retries, non-retryable 500, instant 200, smaller-than-default attempt budget, jitter is applied, retry-after honored as floor, 30s cap, retry counter tracking
  • packages/proxy/src/handlers/response-processor.test.ts (+13 tests): isRetryableUpstreamError matrix (429, 500, 502, 503, 529, 200, missing header, x-should-retry: false), parseRetryAfter for delta-seconds, HTTP-date, missing header, invalid values
  • packages/types/src/request.test.ts and packages/database/src/models/request-row.test.ts: extended to cover retryAttempt mapping through the summary and row layers

All tests in packages/proxy, packages/types, packages/database pass locally.

Local verification

bun install
cd packages/proxy   && bun test  # 76 pass / 0 fail
cd ../types         && bun test  # 17 pass / 0 fail
cd ../database      && bun test  # 31 pass / 0 fail
bun run typecheck                # all packages green
bunx --bun biome check packages/proxy packages/database packages/types
# Checked 109 files in 36ms. No fixes applied.

Pre-existing failures unrelated to this PR (also present on main at 95c4c6a):

  • apps/server/src/graceful-shutdown.integration.test.tsBun.spawn(["bun", ...]) fails when bun is not on the spawned child's $PATH
  • packages/http/src/client.test.ts — one biome formatter diff

These are noted but not fixed here to keep the PR scoped.

Production note

This patch has been running in production at a small organization sitting between active users and Anthropic. We observed a substantial reduction in user-visible 529 errors during Anthropic capacity bursts — exactly what x-should-retry: true was designed to enable.

References

…retry

Anthropic responds to HTTP 529 ("Overloaded") and similar transient 5xx
conditions with `x-should-retry: true`, signalling that the same request
can be safely retried against the same upstream. ccflare previously only
treated HTTP 429 as a rate-limit/failover trigger, so 529s flowed
straight through to the client with no retry and no failover.

Add an inner retry loop in `proxyWithAccount` that honours the header:
- triggers on `status >= 500 && x-should-retry: true`
- retries the same account up to `runtime.retry.attempts` times
- exponential backoff `delayMs * backoff^attempt`
- on exhausted retries returns null so the outer loop in handleProxy
  fails over to the next account

Persist the per-request retry count via a new `retry_attempt` column on
the `requests` table so operators can quantify the recovery effect with
`SELECT COUNT(*) FROM requests WHERE retry_attempt > 0`. The column is
added through the existing additive-migration path (fresh schema,
in-flight v2 migration, and ALTER on legacy DBs all covered).
Pure exponential backoff under load is a well-known anti-pattern: parallel
clients synchronize their retries into waves that amplify the saturation
they are reacting to. Adopt the "Exponential Backoff and Jitter" pattern
described by the AWS Architecture Blog to spread retries across the
backoff window and improve recovery time.

- Full jitter: Math.random() * (delayMs * backoff^attempt)
- Honor the upstream `retry-after` header as a floor:
  max(jittered, retryAfterMs). The upstream knows its own pool capacity
  better than the client does.
- Cap the base delay at 30s before jitter is applied. This protects
  against config footguns such as RETRY_ATTEMPTS=10 with backoff=2,
  which would otherwise wait 1024s on the final attempt.
…mmary

New retry_attempt DB column was written but not read on the API path.
Add it to the row->object mapper, the DbRequest/RequestSummary interfaces,
the summary mapper, and the isRequestSummary type guard so /api/requests
responses surface the same-account retry count alongside failoverAttempts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant