Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,13 @@ Every version listed here must correspond to a slice in [`PLAN.md`](./PLAN.md) w

---

## [0.9.6] — 2026-05-28

### Added
- **Load-test harness** (`backend/loadtest/`) — a reusable open-loop load tester for the backend warm `/analyze` path, reporting latency percentiles, error rate, and achieved throughput against pass/fail thresholds. Includes a runbook for a local 100 RPS warm-cache run and for pointing at a deployed target.

---

## [0.9.5] — 2026-05-28

### Security
Expand Down
18 changes: 14 additions & 4 deletions PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@
| **v0.9.3** | Deletable `/me` history + back-nav loading fix + creator flair | ✅ shipped |
| **v0.9.4** | DB pool size env-tunable + real back-nav spinner fix | ✅ shipped |
| **v0.9.5** | Security review + hardening (OAuth scope ↓ `read:user`, HTTP security headers) | ✅ shipped |
| **v0.9.6** | Load test to 100 RPS | pending |
| **v0.9.6** | Load-test harness (warm /analyze; full 100 RPS run = operator step) | ✅ shipped |
| **v0.9.7** | Privacy policy + terms (legal docs) | pending |
| **v1.0.0** | Public launch | pending |

Expand Down Expand Up @@ -670,11 +670,21 @@ The narrative-mode CHECK constraint was a third drift in the same family — the

---

## v0.9.6 — Load test to 100 RPS (deferred)
## v0.9.6 — Load-test harness (shipped 2026-05-28)

**Goal:** Load-test to 100 RPS sustained and verify the error budget holds. Needs a deliberate design: target (prod vs preview vs local), cost ceiling (Vercel Active-CPU pricing), and how to handle the v0.9.2 rate limits (a naive test from one IP just measures 429s — raise limits for the window, test `/health` + a warm-cached path, or use a bypass).
**Goal:** Reusable Python/httpx open-loop load harness for the backend warm `/analyze` path; the full 100 RPS validation run is an operator step (hardware-gated).

**Exit criteria:** TBD when the slice begins.
**Delivered:** `backend/loadtest/run.py` (open-loop dispatcher, p50/p95/p99, error rate, achieved RPS, pass/fail thresholds, ramp), unit-tested stats helpers, and `backend/loadtest/README.md` runbook (local SRH warm-cache setup + deploy target). Local warm-cache uses SRH (Upstash-compatible Redis over Docker) — real Upstash's ~10k/day free tier can't absorb a 100 RPS run. Anonymous load + unset `INTERNAL_PROXY_SECRET` means the analyze limiter skips enforcement, so no bypass is needed.

**Design spec:** [`docs/superpowers/specs/2026-05-28-v0.9.6-load-test-harness-design.md`](./docs/superpowers/specs/2026-05-28-v0.9.6-load-test-harness-design.md).
**Sub-plan:** [`docs/superpowers/plans/2026-05-28-v0.9.6-load-test-harness.md`](./docs/superpowers/plans/2026-05-28-v0.9.6-load-test-harness.md).

**Exit criteria:**
- [x] `loadtest/run.py` + unit-tested stats helpers; ruff clean; backend suite green.
- [x] Runbook complete (local SRH + deploy target).
- [x] Light `/health`-class sanity run passes (ran against `/openapi.json`: 10 RPS × 5 s, 0 errors, p95 6.2 ms, PASS).
- [x] Docs ritual + version bump to 0.9.6; tag + release.
- [ ] Full 100 RPS warm-`/analyze` result recorded — operator step, filled in when run.

---

Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Engineering insight first. AI flavor second. Scoring is deterministic and explai

## Status

Pre-alpha. Latest shipped release is **v0.9.5** (a full pre-launch security audit — no high/critical findings — that tightened the GitHub OAuth scope to read-only and added HTTP security headers). v0.9.4 before it made the DB connection pool size env-tunable and genuinely fixed the back-nav search spinner; v0.9.3 added deletable `/me` history with undo, a golden "creator" scorecard for the project's creator account, and a first (incomplete) attempt at the back-nav spinner fix. Live at https://skill-issue-tau.vercel.app — GitHub OAuth sign-in, Neon Postgres persistence, `/me` history, opt-in `/share/[slug]` public links. The AI narrative layer (Roast + Mentor) runs on **Groq** (`llama-3.3-70b-versatile`). v0.7.0 added Upstash Redis caching (warm `/analyze` ≤ 200 ms); v0.7.2 prod-certified the perf budget (CLS 0.080 → **0** structurally, perf 90 → 94, LCP 2,804 → 2,773 ms); v0.8.0 shipped Sentry (FE+BE), PostHog (events + web vitals), structlog JSON logging, on-voice 404, and a full axe a11y pass; v0.8.1 ships the nightly cron with bearer auth; v0.8.2 pairs it with the manual force-refresh button on `/me`; v0.8.3 hotfixes the empty-repo crash; v0.8.4 fixes the silent narrative misattribution; v0.8.5 closes the post-deploy-Sentry loop with a pre-merge CI gate; v0.8.6 closes v0.7.1's deferred share-page caching; v0.8.7 modernizes project config; v0.9.0 opens Beta hardening with bounded GH fan-out; v0.9.1 closes the /me N+1 + adds per-namespace Report cache versioning; v0.9.2 adds rate limiting (per-IP for anonymous, higher per-user caps for signed-in) on `/analyze` and `/narrative`; v0.9.3 adds deletable `/me` history with undo, attempts the back-nav search-spinner fix, and gilds the creator's scorecard. v0.9.4 makes the DB connection pool size env-tunable (defaults unchanged — RUM showed no pool exhaustion) and lands the real back-nav spinner fix (the v0.9.3 attempt addressed the wrong mechanism); v0.9.5 runs a full pre-launch security audit (no high/critical findings), tightens the OAuth scope to `read:user`, and adds HTTP security headers. **v0.9.6 — load test to 100 RPS** is next. See [`CHANGELOG.md`](./CHANGELOG.md) for shipped slices, [`PLAN.md`](./PLAN.md) for the full roadmap, and [`docs/PROGRESS_LOG.md`](./docs/PROGRESS_LOG.md) for the most recent session handoff.
Pre-alpha. Latest shipped release is **v0.9.6** (a reusable load-test harness for the warm `/analyze` path; the full 100 RPS run is an operator step). v0.9.5 before it ran a full pre-launch security audit — no high/critical findings — tightening the GitHub OAuth scope to read-only and adding HTTP security headers; v0.9.4 made the DB connection pool size env-tunable and genuinely fixed the back-nav search spinner; v0.9.3 added deletable `/me` history with undo, a golden "creator" scorecard for the project's creator account, and a first (incomplete) attempt at the back-nav spinner fix. Live at https://skill-issue-tau.vercel.app — GitHub OAuth sign-in, Neon Postgres persistence, `/me` history, opt-in `/share/[slug]` public links. The AI narrative layer (Roast + Mentor) runs on **Groq** (`llama-3.3-70b-versatile`). v0.7.0 added Upstash Redis caching (warm `/analyze` ≤ 200 ms); v0.7.2 prod-certified the perf budget (CLS 0.080 → **0** structurally, perf 90 → 94, LCP 2,804 → 2,773 ms); v0.8.0 shipped Sentry (FE+BE), PostHog (events + web vitals), structlog JSON logging, on-voice 404, and a full axe a11y pass; v0.8.1 ships the nightly cron with bearer auth; v0.8.2 pairs it with the manual force-refresh button on `/me`; v0.8.3 hotfixes the empty-repo crash; v0.8.4 fixes the silent narrative misattribution; v0.8.5 closes the post-deploy-Sentry loop with a pre-merge CI gate; v0.8.6 closes v0.7.1's deferred share-page caching; v0.8.7 modernizes project config; v0.9.0 opens Beta hardening with bounded GH fan-out; v0.9.1 closes the /me N+1 + adds per-namespace Report cache versioning; v0.9.2 adds rate limiting (per-IP for anonymous, higher per-user caps for signed-in) on `/analyze` and `/narrative`; v0.9.3 adds deletable `/me` history with undo, attempts the back-nav search-spinner fix, and gilds the creator's scorecard. v0.9.4 makes the DB connection pool size env-tunable (defaults unchanged — RUM showed no pool exhaustion) and lands the real back-nav spinner fix (the v0.9.3 attempt addressed the wrong mechanism); v0.9.5 runs a full pre-launch security audit (no high/critical findings), tightens the OAuth scope to `read:user`, and adds HTTP security headers; v0.9.6 adds a reusable load-test harness for the warm `/analyze` path (the full 100 RPS run is an operator step). **v0.9.7 — privacy policy + terms** is next. See [`CHANGELOG.md`](./CHANGELOG.md) for shipped slices, [`PLAN.md`](./PLAN.md) for the full roadmap, and [`docs/PROGRESS_LOG.md`](./docs/PROGRESS_LOG.md) for the most recent session handoff.

---

Expand Down Expand Up @@ -76,7 +76,7 @@ cp .env.example .env # then edit .env and add your GITHUB_TOKEN and OPENA
uv run uvicorn app.main:app --reload --port 8000
```

Verify: `curl http://localhost:8000/health` → `{"status":"ok","version":"0.9.5","db":"up"|"down","cache":"up"|"down"|"unconfigured"}`. The `db` field reports DB reachability when `DATABASE_URL` is configured; the `cache` field reports Upstash reachability (`unconfigured` when `UPSTASH_REDIS_REST_URL` isn't set — perfectly fine for local dev, the in-process fallback covers it).
Verify: `curl http://localhost:8000/health` → `{"status":"ok","version":"0.9.6","db":"up"|"down","cache":"up"|"down"|"unconfigured"}`. The `db` field reports DB reachability when `DATABASE_URL` is configured; the `cache` field reports Upstash reachability (`unconfigured` when `UPSTASH_REDIS_REST_URL` isn't set — perfectly fine for local dev, the in-process fallback covers it).
Hit the analyzer: `curl http://localhost:8000/analyze/octocat`.

### Frontend (`:3000`)
Expand Down
2 changes: 1 addition & 1 deletion backend/app/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

from pydantic_settings import BaseSettings, SettingsConfigDict

VERSION = "0.9.5"
VERSION = "0.9.6"


class Settings(BaseSettings):
Expand Down
92 changes: 92 additions & 0 deletions backend/loadtest/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Load test harness

Open-loop load tester for the backend. Drives a target endpoint at a fixed RPS
and reports p50/p95/p99 latency, error rate, achieved throughput, and PASS/FAIL
against thresholds. Run from `backend/`:

```bash
uv run python loadtest/run.py --help
```

> **Windows / Git Bash note:** a bare `--path /health` argument gets mangled by
> MSYS into a Windows path (corrupting the URL). Prefix the command with
> `MSYS_NO_PATHCONV=1`, or run it from PowerShell, or use `--path=//health`.

## Quick sanity check (no Docker, no cache)

Start the backend, then hit a cheap endpoint to confirm the harness works.
Locally there's usually no `DATABASE_URL`, so `/health` blocks ~20 s per request
on a doomed DB ping — use `/openapi.json` instead for a clean check:

```bash
uv run uvicorn app.main:app --port 8000 # terminal 1
uv run python loadtest/run.py --target http://localhost:8000 --path /openapi.json \
--rps 10 --duration 5 --warmup 1 --p95-ms 1000 # terminal 2
```

Expect `errors=0` and `RESULT: PASS`.

## Full warm-`/analyze` 100 RPS run (local)

The warm path needs a populated Report cache. `get_cache()` returns `None`
without Upstash, and real Upstash's free tier (~10k commands/day) can't absorb a
100 RPS run — so use a **local** Upstash-compatible Redis via SRH.

1. **Start a local Upstash-compatible Redis (SRH over Redis):**

```bash
docker run -d --name si-redis -p 6379:6379 redis:7
docker run -d --name si-srh -p 8079:80 \
-e SRH_MODE=env -e SRH_TOKEN=local-token \
-e SRH_CONNECTION_STRING="redis://host.docker.internal:6379" \
hiett/serverless-redis-http:latest
```

2. **Start the backend pointed at SRH, with a real GitHub token, and the proxy
secret UNSET** (so the analyze limiter skips anonymous enforcement):

```bash
UPSTASH_REDIS_REST_URL=http://localhost:8079 \
UPSTASH_REDIS_REST_TOKEN=local-token \
GITHUB_TOKEN=<your_token> \
uv run uvicorn app.main:app --port 8000
```
(Ensure `INTERNAL_PROXY_SECRET` is **not** set in the environment, and send no
session cookie — the harness is anonymous by default.)

3. **Run the load test** (the `--warmup` request cold-ingests once to prime the
cache; the timed run is then pure cache hits):

```bash
MSYS_NO_PATHCONV=1 uv run python loadtest/run.py \
--target http://localhost:8000 --path /analyze/octocat \
--rps 100 --duration 60 --warmup 1
```

4. **Find the knee** with a ramp:

```bash
MSYS_NO_PATHCONV=1 uv run python loadtest/run.py --path /analyze/octocat \
--ramp 50:100:200:400 --duration 30 --warmup 1
```
Record the highest RPS stage that still PASSes (error rate < 1%, achieved
RPS ≥ 95% of target, p95 under `--p95-ms`).

5. **Tear down:** `docker rm -f si-srh si-redis`.

## Pointing at a deployed target

```bash
uv run python loadtest/run.py --target https://<host>/_/backend --path /analyze/octocat --rps 100 --duration 30
```
Mind the cost (Vercel Active-CPU) and rate limits: a deployed backend with
`INTERNAL_PROXY_SECRET` set WILL rate-limit anonymous `/analyze` — sign in or
raise the limits for the window. Keep deployed runs short.

## Thresholds (PASS/FAIL)

- error rate `< --max-error-rate` (default 1%)
- achieved RPS `>= 95%` of `--rps`
- p95 latency `< --p95-ms` (default 250 ms; tune from the warm baseline)

Exit code is 0 on PASS, non-zero on FAIL.
Empty file added backend/loadtest/__init__.py
Empty file.
212 changes: 212 additions & 0 deletions backend/loadtest/run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
"""Open-loop load-test harness for the Skill Issue backend.

Drives a target endpoint at a fixed request rate and reports latency
percentiles, error rate, and achieved throughput against pass/fail
thresholds. See backend/loadtest/README.md for the local warm-/analyze
runbook. Run: uv run python loadtest/run.py --help
"""

from __future__ import annotations

import argparse
import asyncio
import contextlib
import sys
import time
from dataclasses import dataclass

import httpx


@dataclass
class Result:
latency_ms: float
status: int | None # None = connection error / timeout


@dataclass
class Summary:
sent: int
completed: int
dropped: int
error_count: int
error_rate: float
achieved_rps: float
p50_ms: float
p95_ms: float
p99_ms: float
errors_by_status: dict[str, int]
duration_s: float


def percentile(values: list[float], p: float) -> float:
"""Linear-interpolated p-th percentile (p in [0, 100]). 0.0 for empty input."""
if not values:
return 0.0
s = sorted(values)
if len(s) == 1:
return s[0]
k = (len(s) - 1) * (p / 100.0)
lo = int(k)
hi = min(lo + 1, len(s) - 1)
frac = k - lo
return round(s[lo] * (1 - frac) + s[hi] * frac, 10)


def summarize(results: list[Result], *, dropped: int, wall_seconds: float) -> Summary:
completed = len(results)
latencies = [r.latency_ms for r in results]
errors_by_status: dict[str, int] = {}
error_count = 0
for r in results:
if r.status is None or r.status >= 400:
key = "connection_error" if r.status is None else str(r.status)
errors_by_status[key] = errors_by_status.get(key, 0) + 1
error_count += 1
error_rate = (error_count / completed) if completed else 1.0
achieved_rps = (completed / wall_seconds) if wall_seconds > 0 else 0.0
return Summary(
sent=completed + dropped,
completed=completed,
dropped=dropped,
error_count=error_count,
error_rate=error_rate,
achieved_rps=achieved_rps,
p50_ms=percentile(latencies, 50),
p95_ms=percentile(latencies, 95),
p99_ms=percentile(latencies, 99),
errors_by_status=errors_by_status,
duration_s=wall_seconds,
)


def evaluate_thresholds(
summary: Summary, *, target_rps: float, max_error_rate: float, p95_ms: float
) -> tuple[bool, list[str]]:
failures: list[str] = []
if summary.error_rate > max_error_rate:
failures.append(f"error rate {summary.error_rate:.3%} > {max_error_rate:.3%}")
if summary.achieved_rps < target_rps * 0.95:
failures.append(f"achieved RPS {summary.achieved_rps:.1f} < 95% of target {target_rps:.0f}")
if summary.p95_ms > p95_ms:
failures.append(f"p95 {summary.p95_ms:.1f}ms > {p95_ms:.1f}ms")
return (not failures, failures)


async def _one_request(client: httpx.AsyncClient, url: str, results: list[Result]) -> None:
t0 = time.perf_counter()
try:
resp = await client.get(url)
status: int | None = resp.status_code
except (httpx.HTTPError, OSError):
status = None
results.append(Result(latency_ms=(time.perf_counter() - t0) * 1000.0, status=status))


async def run_stage(
client: httpx.AsyncClient,
url: str,
*,
rps: float,
duration: float,
max_inflight: int,
) -> tuple[list[Result], int]:
"""Open-loop: schedule requests at a fixed rate for `duration` seconds.

Returns (results, dropped). `dropped` counts ticks skipped because
`max_inflight` was saturated — a "server can't keep up" signal. The
scheduler never blocks on in-flight requests, so a slow server shows up as
dropped ticks + latency growth rather than self-throttled load.
"""
results: list[Result] = []
tasks: set[asyncio.Task[None]] = set()
inflight = 0
dropped = 0
interval = 1.0 / rps
loop = asyncio.get_running_loop()

def _done(t: asyncio.Task[None]) -> None:
nonlocal inflight
inflight -= 1
tasks.discard(t)

start = loop.time()
i = 0
while loop.time() - start < duration:
delay = (start + i * interval) - loop.time()
if delay > 0:
await asyncio.sleep(delay)
i += 1
if inflight >= max_inflight:
dropped += 1
continue
inflight += 1
task = asyncio.create_task(_one_request(client, url, results))
task.add_done_callback(_done)
tasks.add(task)

if tasks:
await asyncio.gather(*tasks, return_exceptions=True)
return results, dropped


def _parse_ramp(ramp: str) -> list[float]:
"""'10:50:100' -> [10.0, 50.0, 100.0]."""
return [float(part) for part in ramp.split(":") if part]


def _print_summary(url: str, rps: float, s: Summary, ok: bool, failures: list[str]) -> None:
print(f"\n=== {url} @ target {rps:.0f} RPS for {s.duration_s:.1f}s ===")
print(f" sent={s.sent} completed={s.completed} dropped={s.dropped}")
print(f" achieved_rps={s.achieved_rps:.1f}")
print(f" errors={s.error_count} ({s.error_rate:.3%}) {s.errors_by_status or ''}")
print(f" latency p50={s.p50_ms:.1f}ms p95={s.p95_ms:.1f}ms p99={s.p99_ms:.1f}ms")
print(f" RESULT: {'PASS' if ok else 'FAIL'}")
for f in failures:
print(f" - {f}")


async def _amain(args: argparse.Namespace) -> int:
url = args.target.rstrip("/") + args.path
cap = args.max_inflight + 50
limits = httpx.Limits(max_connections=cap, max_keepalive_connections=cap)
overall_ok = True
async with httpx.AsyncClient(timeout=args.timeout, limits=limits) as client:
for _ in range(args.warmup):
with contextlib.suppress(httpx.HTTPError, OSError):
await client.get(url)
stages = _parse_ramp(args.ramp) if args.ramp else [args.rps]
for rps in stages:
wall0 = time.perf_counter()
results, dropped = await run_stage(
client, url, rps=rps, duration=args.duration, max_inflight=args.max_inflight
)
summary = summarize(results, dropped=dropped, wall_seconds=time.perf_counter() - wall0)
ok, failures = evaluate_thresholds(
summary,
target_rps=rps,
max_error_rate=args.max_error_rate,
p95_ms=args.p95_ms,
)
_print_summary(url, rps, summary, ok, failures)
overall_ok = overall_ok and ok
return 0 if overall_ok else 1


def main() -> None:
p = argparse.ArgumentParser(description="Open-loop load tester for the Skill Issue backend.")
p.add_argument("--target", default="http://localhost:8000")
p.add_argument("--path", default="/analyze/octocat")
p.add_argument("--rps", type=float, default=100.0)
p.add_argument("--duration", type=float, default=60.0)
p.add_argument("--warmup", type=int, default=1)
p.add_argument("--ramp", default=None, help="colon-separated RPS stages, e.g. 10:50:100")
p.add_argument("--max-inflight", type=int, default=500, dest="max_inflight")
p.add_argument("--timeout", type=float, default=20.0)
p.add_argument("--p95-ms", type=float, default=250.0, dest="p95_ms")
p.add_argument("--max-error-rate", type=float, default=0.01, dest="max_error_rate")
sys.exit(asyncio.run(_amain(p.parse_args())))


if __name__ == "__main__":
main()
Loading
Loading