Context
After a series of bouncer error-handling improvements (DecartAI/api PRs #888, #889, #891, #892, #904, #909), the bouncer now sends structured, actionable error messages and close codes to clients. The SDK retry logic should be updated to match.
Current behavior
The SDK retries all errors during initial connect except a hardcoded PERMANENT_ERRORS blocklist ("permission denied", "invalid api key", "401", etc.). This means transient AND permanent server errors are retried identically.
Problem
Errors like "Upstream error (500)" or "Setup failed" are retried 5 times with exponential backoff, but they will never resolve — they indicate a server bug or misconfiguration, not a transient condition. This wastes ~30s of user time and puts unnecessary load on the server.
Bouncer error messages (after API PRs)
| Bouncer error message |
Close code |
Meaning |
Should retry? |
"Server at capacity. Please try again later." |
1013 |
All inference pods full |
Yes — capacity is transient |
"Server is shutting down" |
1001 |
Pod rolling update |
Yes — other pods available |
"Upstream error (503)" |
1011 |
Inference pod rejected (non-capacity 503) |
Yes — may be transient |
"Upstream error (500)" |
1011 |
Inference server bug |
No — won't resolve |
"Setup failed" |
1011 |
Generic setup failure |
No — won't resolve |
"Insufficient credits" |
1008 |
User has no balance |
No — permanent until top-up |
"Invalid API key" |
1008 |
Auth failure |
No — already in PERMANENT_ERRORS |
"Model server disconnected. Please reconnect." |
1012 |
Upstream died mid-session |
Yes (reconnect path only) |
Proposed change
Initial connect (connect() in webrtc-manager.ts): flip from blocklist to allowlist.
const RETRYABLE_ERRORS = [
"server at capacity",
"server is shutting down",
"websocket timeout",
"websocket error",
"websocket closed",
];
// In connect()'s shouldRetry:
shouldRetry: (error) => {
if (this.intentionalDisconnect) return false;
const msg = error.message.toLowerCase();
if (PERMANENT_ERRORS.some((err) => msg.includes(err))) return false;
return RETRYABLE_ERRORS.some((err) => msg.includes(err));
},
Reconnect (reconnect() in webrtc-manager.ts): keep as-is (blocklist). If a user had a working session, it's worth retrying broadly since the disconnect is likely transient.
Impact
- Users hitting a server bug get an immediate error instead of waiting ~30s for 5 retries
- Capacity errors still retry correctly
- Rolling deploys still retry correctly
- Reconnect behavior unchanged
Context
After a series of bouncer error-handling improvements (DecartAI/api PRs #888, #889, #891, #892, #904, #909), the bouncer now sends structured, actionable error messages and close codes to clients. The SDK retry logic should be updated to match.
Current behavior
The SDK retries all errors during initial connect except a hardcoded
PERMANENT_ERRORSblocklist ("permission denied","invalid api key","401", etc.). This means transient AND permanent server errors are retried identically.Problem
Errors like
"Upstream error (500)"or"Setup failed"are retried 5 times with exponential backoff, but they will never resolve — they indicate a server bug or misconfiguration, not a transient condition. This wastes ~30s of user time and puts unnecessary load on the server.Bouncer error messages (after API PRs)
"Server at capacity. Please try again later.""Server is shutting down""Upstream error (503)""Upstream error (500)""Setup failed""Insufficient credits""Invalid API key""Model server disconnected. Please reconnect."Proposed change
Initial connect (
connect()inwebrtc-manager.ts): flip from blocklist to allowlist.Reconnect (
reconnect()inwebrtc-manager.ts): keep as-is (blocklist). If a user had a working session, it's worth retrying broadly since the disconnect is likely transient.Impact