Lightweight reverse proxy for OpenAI API rate limiting, per-key token budgets, and cost dashboards. Deploy in 5 minutes with Docker.
┌──────────┐ ┌──────────────────────────────────────────────┐ ┌──────────┐
│ Client │────▶│ llm-budget-proxy │────▶│ OpenAI │
│ │◀────│ │◀────│ API │
└──────────┘ │ ┌──────┐ ┌───────────┐ ┌────────┐ ┌───────┐│ └──────────┘
│ │ Auth │▶│Rate Limit │▶│Budget │▶│Cache ││
│ └──────┘ └───────────┘ └────────┘ └───────┘│
│ │
│ ┌───────────┐ ┌───────────┐ ┌──────────┐ │
│ │ SQLite DB │ │ Dashboard │ │ Webhooks │ │
│ └───────────┘ └───────────┘ └──────────┘ │
└──────────────────────────────────────────────┘
git clone https://github.com/yourusername/llm-budget-proxy.git
cd llm-budget-proxy
cp .env.example .env
# Edit .env — add your OPENAI_API_KEY
# Option A: Docker (recommended)
docker compose up --build
# Option B: Local development
npm install
npm run seed -- alice team-a 10.00 # Create an API key
npm run devAPI keys use the lbp_ prefix and are stored as SHA-256 hashes. The plaintext key is shown once at creation time.
Create a key via seed script:
npm run seed -- <name> [team] [daily-budget]
npm run seed -- alice team-a 10.00Create a key via API:
curl -X POST http://localhost:3000/api/keys \
-H "Authorization: Bearer $ADMIN_API_KEY" \
-H "Content-Type: application/json" \
-d '{"name": "alice", "team": "team-a", "budgetPeriod": "daily", "budgetLimit": 10.00}'Use a key:
curl http://localhost:3000/v1/chat/completions \
-H "Authorization: Bearer lbp_your_key_here" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}]
}'All configuration is in config/config.yml. Environment variables are substituted using ${VAR_NAME} syntax.
| Section | Key | Default | Description |
|---|---|---|---|
server.port |
Port | 3000 |
Server port |
server.adminKey |
— | required | Admin API key for dashboard and management |
provider.apiKey |
— | required | OpenAI API key |
rateLimits.default.rpm |
RPM | 60 |
Requests per minute per key |
rateLimits.default.tpm |
TPM | 100000 |
Tokens per minute per key |
budgets.defaultDaily |
Budget | 10.00 |
Default daily budget in USD |
budgets.defaultMonthly |
Budget | 100.00 |
Default monthly budget in USD |
modelDowngrade.enabled |
Flag | false |
Enable model downgrade on budget pressure |
cache.enabled |
Flag | true |
Enable exact-match response caching |
cache.defaultTtlSeconds |
TTL | 3600 |
Cache entry lifetime |
alerts.webhookUrl |
URL | — | Webhook URL for budget alerts |
Match API keys by name pattern:
rateLimits:
overrides:
- keyPattern: "premium-*"
rpm: 120
tpm: 500000budgets:
alertThresholds:
- percent: 80
action: warn # X-Budget-Warning header
- percent: 95
action: downgrade # Switch to cheaper model (if enabled)
- percent: 100
action: block # Reject request (402)Disabled by default. Opt-in via config:
modelDowngrade:
enabled: true
rules:
- from: "gpt-4o"
to: "gpt-4o-mini"When triggered, the response includes X-Model-Downgraded: true and X-Original-Model headers.
Model pricing is in config/pricing.yml. Update this file when OpenAI changes pricing.
version: "2026-03-14"
provider: openai
models:
gpt-4o:
inputPer1k: 0.0025
outputPer1k: 0.01
cachedInputPer1k: 0.00125
maxOutputTokens: 16384Every proxied response includes:
| Header | Description |
|---|---|
X-Request-Cost |
Actual cost of this request in USD |
X-Estimated-Cost |
Pre-flight estimated worst-case cost |
X-Input-Tokens |
Input token count |
X-Tokens-Used |
Total tokens (input + output) |
X-Budget-Remaining |
Remaining budget in USD |
X-Budget-Period |
Budget period (daily/monthly) |
X-Budget-Warning |
Set when approaching budget limit |
X-Cache |
HIT or MISS |
X-Model-Downgraded |
true if model was downgraded |
X-RateLimit-Limit-RPM |
RPM limit for this key |
X-RateLimit-Remaining-RPM |
Remaining RPM |
Open http://localhost:3000/dashboard and enter your admin API key. Shows:
- Cost by API key (bar chart)
- Cost over time (line chart)
- Budget status (doughnut chart)
- Recent requests table
Configure a webhook URL to receive budget notifications:
alerts:
webhookUrl: "https://hooks.slack.com/services/xxx/yyy/zzz"
events:
- budgetWarning
- budgetExceededAlerts are debounced (same event + key fires at most once per hour).
LiteLLM (~39k stars) is a mature, full-featured LLMOps platform with 100+ provider integrations, virtual keys, per-key budgets, load balancing, guardrails, and a Postgres-backed dashboard.
llm-budget-proxy is deliberately simpler:
| LiteLLM | llm-budget-proxy | |
|---|---|---|
| Providers | 100+ | OpenAI only (MVP) |
| Database | Postgres/Redis | SQLite |
| Deployment | Multi-service | Single container |
| Setup time | ~30 min | ~5 min |
| Dashboard | Full admin UI | Single-page Chart.js |
| Use case | Enterprise, multi-provider | Dev/staging, single-provider, learning |
Use LiteLLM when you need enterprise scale, multi-provider support, or a full observability platform.
Use llm-budget-proxy when you want a lightweight, self-contained proxy you can understand, modify, and deploy in minutes.
- Single-instance only — SQLite does not support multi-node deployment. For horizontal scaling, migrate to Postgres or Redis.
- OpenAI only — This MVP proxies OpenAI's
/v1/chat/completionsendpoint. Anthropic support is a documented future extension. - Estimated cost — Pre-flight cost checks use estimated input tokens + worst-case output ceiling. Actual cost is recorded after the response completes.
- No semantic caching — Cache uses exact request-body matching only. Semantic similarity caching requires embeddings and vector search, which is out of scope.
npm install
npm run dev # Start with hot reload
npm test # Run tests
npm run test:watch # Watch mode
npm run build # Compile TypeScriptMIT — AGR Group