Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 101 additions & 9 deletions docs/proxy/prod.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,24 +84,116 @@ We recommend running **1 Uvicorn worker per pod** and scaling out horizontally w
CMD ["--port", "4000", "--config", "./proxy_server_config.yaml", "--num_workers", "1"]
```

> **Optional:** If you observe gradual memory growth under sustained load, consider recycling workers after a fixed number of requests to mitigate leaks.
> You can configure this either via CLI or environment variable:
### Gunicorn vs. Uvicorn: when to use each

| Scenario | Recommendation | Reason |
|---|---|---|
| Kubernetes with HPA | 1 Uvicorn worker per pod (`--num_workers 1`) | Lowest latency, predictable memory footprint, scales horizontally |
| Non-Kubernetes (VM / bare metal) | Gunicorn (`--run_gunicorn --num_workers $(nproc)`) | Gunicorn manages the worker process group; crashed workers are automatically respawned |
| Worker recycling needed (`--max_requests_before_restart`) | Gunicorn | Gunicorn's `max_requests` recycles workers one at a time, avoiding request spikes. Uvicorn's equivalent is less battle-tested. |

### Recycling workers to cap memory growth

If you observe gradual memory growth under sustained load, recycle workers after a fixed number of requests.

```shell
# CLI
CMD ["--port", "4000", "--config", "./proxy_server_config.yaml", "--num_workers", "$(nproc)", "--max_requests_before_restart", "10000"]
# Uvicorn (Kubernetes, 1 worker per pod)
CMD ["--port", "4000", "--config", "./proxy_server_config.yaml", "--num_workers", "1", "--max_requests_before_restart", "10000"]

# or ENV (for deployment manifests / containers)
export MAX_REQUESTS_BEFORE_RESTART=10000
# Gunicorn (non-Kubernetes, multiple workers per host)
CMD ["--port", "4000", "--config", "./proxy_server_config.yaml", "--run_gunicorn", "--num_workers", "$(nproc)", "--max_requests_before_restart", "10000"]
```

> **Tip:** When using `--max_requests_before_restart`, the `--run_gunicorn` flag is more stable and mature as it uses Gunicorn's battle-tested worker recycling mechanism instead of Uvicorn's implementation.
You can also set this via environment variable instead of a CLI flag:

```shell
# Use Gunicorn for more stable worker recycling
CMD ["--port", "4000", "--config", "./proxy_server_config.yaml", "--num_workers", "$(nproc)", "--run_gunicorn", "--max_requests_before_restart", "10000"]
export MAX_REQUESTS_BEFORE_RESTART=10000
```

:::tip Multi-worker thundering herd
On non-Kubernetes hosts where you run multiple Gunicorn workers, all workers that started at roughly the same time will hit their `max_requests` limit at roughly the same time, causing a brief throughput dip. To stagger restarts, Gunicorn supports a `max_requests_jitter` option. Set it with a custom Gunicorn config file (e.g., `max_requests_jitter = 1000`) alongside `--run_gunicorn`. On a single-worker-per-pod Kubernetes deployment this is not an issue.
:::

### Hitless restarts on Kubernetes

A "hitless" (zero-downtime) restart means no request is dropped when a pod is replaced. Three things need to work together:

**1. Rolling update strategy**

Configure your Deployment so new pods are ready before old ones terminate:

```yaml
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # spin up 1 extra pod before terminating any
maxUnavailable: 0 # never remove a pod until a healthy replacement is ready
```

**2. Probe configuration**

Kubernetes removes a pod from the endpoint slice only after its readiness probe fails. Make sure the probes are tuned so the pod stops receiving traffic before `SIGTERM` arrives.

```yaml
startupProbe:
httpGet:
path: /health/readiness
port: 4000
failureThreshold: 120 # 120 Γ— 5 s = 10 min max startup time
periodSeconds: 5
timeoutSeconds: 15

readinessProbe:
httpGet:
path: /health/readiness
port: 4000
initialDelaySeconds: 0
periodSeconds: 15
timeoutSeconds: 15
failureThreshold: 4 # marks unready after ~60 s of DB / cache unavailability

livenessProbe:
httpGet:
path: /health/liveliness
port: 4000
initialDelaySeconds: 0
periodSeconds: 35
timeoutSeconds: 15
failureThreshold: 3 # kills pod only after ~105 s of liveness failure; set higher (5–10) if your LLM calls are long
```

**3. preStop hook and termination grace period**

When Kubernetes decides to terminate a pod, it sends `SIGTERM` and simultaneously removes the pod from the endpoint slice; there is a propagation delay of a few seconds before the load balancer stops routing to it. The `preStop` sleep bridges that gap.

```yaml
containers:
- name: litellm
lifecycle:
preStop:
exec:
command: ["sleep", "15"] # wait for endpoint slice propagation before SIGTERM is delivered
terminationGracePeriodSeconds: 60 # must be > preStop sleep + longest expected in-flight request
```

The termination sequence with this configuration:

```
Pod marked for deletion
β”‚
β”œβ”€β–Ί preStop sleep (15 s) β€” load balancer drains the pod from rotation
β”‚
β–Ό (SIGTERM delivered after preStop completes)
Uvicorn / Gunicorn begins shutdown
β”‚
└─► terminationGracePeriodSeconds (60 s) β€” in-flight requests finish
└─► SIGKILL if process has not exited by the deadline
```

:::info Streaming requests
If you serve long-running streaming requests, increase `terminationGracePeriodSeconds` beyond your p99 request duration. With a 15 s preStop sleep and 60 s termination grace period, any stream running longer than ~45 s will be killed on pod restart.
:::


## 4. Use Redis 'port','host', 'password'. NOT 'redis_url'

Expand Down