Skip to content

SDK should only retry on known-retryable bouncer errors during initial connect #107

@eilon-decart

Description

@eilon-decart

Context

After a series of bouncer error-handling improvements (DecartAI/api PRs #888, #889, #891, #892, #904, #909), the bouncer now sends structured, actionable error messages and close codes to clients. The SDK retry logic should be updated to match.

Current behavior

The SDK retries all errors during initial connect except a hardcoded PERMANENT_ERRORS blocklist ("permission denied", "invalid api key", "401", etc.). This means transient AND permanent server errors are retried identically.

Problem

Errors like "Upstream error (500)" or "Setup failed" are retried 5 times with exponential backoff, but they will never resolve — they indicate a server bug or misconfiguration, not a transient condition. This wastes ~30s of user time and puts unnecessary load on the server.

Bouncer error messages (after API PRs)

Bouncer error message Close code Meaning Should retry?
"Server at capacity. Please try again later." 1013 All inference pods full Yes — capacity is transient
"Server is shutting down" 1001 Pod rolling update Yes — other pods available
"Upstream error (503)" 1011 Inference pod rejected (non-capacity 503) Yes — may be transient
"Upstream error (500)" 1011 Inference server bug No — won't resolve
"Setup failed" 1011 Generic setup failure No — won't resolve
"Insufficient credits" 1008 User has no balance No — permanent until top-up
"Invalid API key" 1008 Auth failure No — already in PERMANENT_ERRORS
"Model server disconnected. Please reconnect." 1012 Upstream died mid-session Yes (reconnect path only)

Proposed change

Initial connect (connect() in webrtc-manager.ts): flip from blocklist to allowlist.

const RETRYABLE_ERRORS = [
  "server at capacity",
  "server is shutting down",
  "websocket timeout",
  "websocket error",
  "websocket closed",
];

// In connect()'s shouldRetry:
shouldRetry: (error) => {
  if (this.intentionalDisconnect) return false;
  const msg = error.message.toLowerCase();
  if (PERMANENT_ERRORS.some((err) => msg.includes(err))) return false;
  return RETRYABLE_ERRORS.some((err) => msg.includes(err));
},

Reconnect (reconnect() in webrtc-manager.ts): keep as-is (blocklist). If a user had a working session, it's worth retrying broadly since the disconnect is likely transient.

Impact

  • Users hitting a server bug get an immediate error instead of waiting ~30s for 5 retries
  • Capacity errors still retry correctly
  • Rolling deploys still retry correctly
  • Reconnect behavior unchanged

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions