diff --git a/CHANGELOG.md b/CHANGELOG.md index f06fd58..ca90670 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,13 @@ Every version listed here must correspond to a slice in [`PLAN.md`](./PLAN.md) w --- +## [0.9.6] — 2026-05-28 + +### Added +- **Load-test harness** (`backend/loadtest/`) — a reusable open-loop load tester for the backend warm `/analyze` path, reporting latency percentiles, error rate, and achieved throughput against pass/fail thresholds. Includes a runbook for a local 100 RPS warm-cache run and for pointing at a deployed target. + +--- + ## [0.9.5] — 2026-05-28 ### Security diff --git a/PLAN.md b/PLAN.md index e36138a..e903498 100644 --- a/PLAN.md +++ b/PLAN.md @@ -44,7 +44,7 @@ | **v0.9.3** | Deletable `/me` history + back-nav loading fix + creator flair | ✅ shipped | | **v0.9.4** | DB pool size env-tunable + real back-nav spinner fix | ✅ shipped | | **v0.9.5** | Security review + hardening (OAuth scope ↓ `read:user`, HTTP security headers) | ✅ shipped | -| **v0.9.6** | Load test to 100 RPS | pending | +| **v0.9.6** | Load-test harness (warm /analyze; full 100 RPS run = operator step) | ✅ shipped | | **v0.9.7** | Privacy policy + terms (legal docs) | pending | | **v1.0.0** | Public launch | pending | @@ -670,11 +670,21 @@ The narrative-mode CHECK constraint was a third drift in the same family — the --- -## v0.9.6 — Load test to 100 RPS (deferred) +## v0.9.6 — Load-test harness (shipped 2026-05-28) -**Goal:** Load-test to 100 RPS sustained and verify the error budget holds. Needs a deliberate design: target (prod vs preview vs local), cost ceiling (Vercel Active-CPU pricing), and how to handle the v0.9.2 rate limits (a naive test from one IP just measures 429s — raise limits for the window, test `/health` + a warm-cached path, or use a bypass). +**Goal:** Reusable Python/httpx open-loop load harness for the backend warm `/analyze` path; the full 100 RPS validation run is an operator step (hardware-gated). -**Exit criteria:** TBD when the slice begins. +**Delivered:** `backend/loadtest/run.py` (open-loop dispatcher, p50/p95/p99, error rate, achieved RPS, pass/fail thresholds, ramp), unit-tested stats helpers, and `backend/loadtest/README.md` runbook (local SRH warm-cache setup + deploy target). Local warm-cache uses SRH (Upstash-compatible Redis over Docker) — real Upstash's ~10k/day free tier can't absorb a 100 RPS run. Anonymous load + unset `INTERNAL_PROXY_SECRET` means the analyze limiter skips enforcement, so no bypass is needed. + +**Design spec:** [`docs/superpowers/specs/2026-05-28-v0.9.6-load-test-harness-design.md`](./docs/superpowers/specs/2026-05-28-v0.9.6-load-test-harness-design.md). +**Sub-plan:** [`docs/superpowers/plans/2026-05-28-v0.9.6-load-test-harness.md`](./docs/superpowers/plans/2026-05-28-v0.9.6-load-test-harness.md). + +**Exit criteria:** +- [x] `loadtest/run.py` + unit-tested stats helpers; ruff clean; backend suite green. +- [x] Runbook complete (local SRH + deploy target). +- [x] Light `/health`-class sanity run passes (ran against `/openapi.json`: 10 RPS × 5 s, 0 errors, p95 6.2 ms, PASS). +- [x] Docs ritual + version bump to 0.9.6; tag + release. +- [ ] Full 100 RPS warm-`/analyze` result recorded — operator step, filled in when run. --- diff --git a/README.md b/README.md index af75a36..e1aae9c 100644 --- a/README.md +++ b/README.md @@ -43,7 +43,7 @@ Engineering insight first. AI flavor second. Scoring is deterministic and explai ## Status -Pre-alpha. Latest shipped release is **v0.9.5** (a full pre-launch security audit — no high/critical findings — that tightened the GitHub OAuth scope to read-only and added HTTP security headers). v0.9.4 before it made the DB connection pool size env-tunable and genuinely fixed the back-nav search spinner; v0.9.3 added deletable `/me` history with undo, a golden "creator" scorecard for the project's creator account, and a first (incomplete) attempt at the back-nav spinner fix. Live at https://skill-issue-tau.vercel.app — GitHub OAuth sign-in, Neon Postgres persistence, `/me` history, opt-in `/share/[slug]` public links. The AI narrative layer (Roast + Mentor) runs on **Groq** (`llama-3.3-70b-versatile`). v0.7.0 added Upstash Redis caching (warm `/analyze` ≤ 200 ms); v0.7.2 prod-certified the perf budget (CLS 0.080 → **0** structurally, perf 90 → 94, LCP 2,804 → 2,773 ms); v0.8.0 shipped Sentry (FE+BE), PostHog (events + web vitals), structlog JSON logging, on-voice 404, and a full axe a11y pass; v0.8.1 ships the nightly cron with bearer auth; v0.8.2 pairs it with the manual force-refresh button on `/me`; v0.8.3 hotfixes the empty-repo crash; v0.8.4 fixes the silent narrative misattribution; v0.8.5 closes the post-deploy-Sentry loop with a pre-merge CI gate; v0.8.6 closes v0.7.1's deferred share-page caching; v0.8.7 modernizes project config; v0.9.0 opens Beta hardening with bounded GH fan-out; v0.9.1 closes the /me N+1 + adds per-namespace Report cache versioning; v0.9.2 adds rate limiting (per-IP for anonymous, higher per-user caps for signed-in) on `/analyze` and `/narrative`; v0.9.3 adds deletable `/me` history with undo, attempts the back-nav search-spinner fix, and gilds the creator's scorecard. v0.9.4 makes the DB connection pool size env-tunable (defaults unchanged — RUM showed no pool exhaustion) and lands the real back-nav spinner fix (the v0.9.3 attempt addressed the wrong mechanism); v0.9.5 runs a full pre-launch security audit (no high/critical findings), tightens the OAuth scope to `read:user`, and adds HTTP security headers. **v0.9.6 — load test to 100 RPS** is next. See [`CHANGELOG.md`](./CHANGELOG.md) for shipped slices, [`PLAN.md`](./PLAN.md) for the full roadmap, and [`docs/PROGRESS_LOG.md`](./docs/PROGRESS_LOG.md) for the most recent session handoff. +Pre-alpha. Latest shipped release is **v0.9.6** (a reusable load-test harness for the warm `/analyze` path; the full 100 RPS run is an operator step). v0.9.5 before it ran a full pre-launch security audit — no high/critical findings — tightening the GitHub OAuth scope to read-only and adding HTTP security headers; v0.9.4 made the DB connection pool size env-tunable and genuinely fixed the back-nav search spinner; v0.9.3 added deletable `/me` history with undo, a golden "creator" scorecard for the project's creator account, and a first (incomplete) attempt at the back-nav spinner fix. Live at https://skill-issue-tau.vercel.app — GitHub OAuth sign-in, Neon Postgres persistence, `/me` history, opt-in `/share/[slug]` public links. The AI narrative layer (Roast + Mentor) runs on **Groq** (`llama-3.3-70b-versatile`). v0.7.0 added Upstash Redis caching (warm `/analyze` ≤ 200 ms); v0.7.2 prod-certified the perf budget (CLS 0.080 → **0** structurally, perf 90 → 94, LCP 2,804 → 2,773 ms); v0.8.0 shipped Sentry (FE+BE), PostHog (events + web vitals), structlog JSON logging, on-voice 404, and a full axe a11y pass; v0.8.1 ships the nightly cron with bearer auth; v0.8.2 pairs it with the manual force-refresh button on `/me`; v0.8.3 hotfixes the empty-repo crash; v0.8.4 fixes the silent narrative misattribution; v0.8.5 closes the post-deploy-Sentry loop with a pre-merge CI gate; v0.8.6 closes v0.7.1's deferred share-page caching; v0.8.7 modernizes project config; v0.9.0 opens Beta hardening with bounded GH fan-out; v0.9.1 closes the /me N+1 + adds per-namespace Report cache versioning; v0.9.2 adds rate limiting (per-IP for anonymous, higher per-user caps for signed-in) on `/analyze` and `/narrative`; v0.9.3 adds deletable `/me` history with undo, attempts the back-nav search-spinner fix, and gilds the creator's scorecard. v0.9.4 makes the DB connection pool size env-tunable (defaults unchanged — RUM showed no pool exhaustion) and lands the real back-nav spinner fix (the v0.9.3 attempt addressed the wrong mechanism); v0.9.5 runs a full pre-launch security audit (no high/critical findings), tightens the OAuth scope to `read:user`, and adds HTTP security headers; v0.9.6 adds a reusable load-test harness for the warm `/analyze` path (the full 100 RPS run is an operator step). **v0.9.7 — privacy policy + terms** is next. See [`CHANGELOG.md`](./CHANGELOG.md) for shipped slices, [`PLAN.md`](./PLAN.md) for the full roadmap, and [`docs/PROGRESS_LOG.md`](./docs/PROGRESS_LOG.md) for the most recent session handoff. --- @@ -76,7 +76,7 @@ cp .env.example .env # then edit .env and add your GITHUB_TOKEN and OPENA uv run uvicorn app.main:app --reload --port 8000 ``` -Verify: `curl http://localhost:8000/health` → `{"status":"ok","version":"0.9.5","db":"up"|"down","cache":"up"|"down"|"unconfigured"}`. The `db` field reports DB reachability when `DATABASE_URL` is configured; the `cache` field reports Upstash reachability (`unconfigured` when `UPSTASH_REDIS_REST_URL` isn't set — perfectly fine for local dev, the in-process fallback covers it). +Verify: `curl http://localhost:8000/health` → `{"status":"ok","version":"0.9.6","db":"up"|"down","cache":"up"|"down"|"unconfigured"}`. The `db` field reports DB reachability when `DATABASE_URL` is configured; the `cache` field reports Upstash reachability (`unconfigured` when `UPSTASH_REDIS_REST_URL` isn't set — perfectly fine for local dev, the in-process fallback covers it). Hit the analyzer: `curl http://localhost:8000/analyze/octocat`. ### Frontend (`:3000`) diff --git a/backend/app/settings.py b/backend/app/settings.py index 25b9af7..d758622 100644 --- a/backend/app/settings.py +++ b/backend/app/settings.py @@ -2,7 +2,7 @@ from pydantic_settings import BaseSettings, SettingsConfigDict -VERSION = "0.9.5" +VERSION = "0.9.6" class Settings(BaseSettings): diff --git a/backend/loadtest/README.md b/backend/loadtest/README.md new file mode 100644 index 0000000..0ba90b8 --- /dev/null +++ b/backend/loadtest/README.md @@ -0,0 +1,92 @@ +# Load test harness + +Open-loop load tester for the backend. Drives a target endpoint at a fixed RPS +and reports p50/p95/p99 latency, error rate, achieved throughput, and PASS/FAIL +against thresholds. Run from `backend/`: + +```bash +uv run python loadtest/run.py --help +``` + +> **Windows / Git Bash note:** a bare `--path /health` argument gets mangled by +> MSYS into a Windows path (corrupting the URL). Prefix the command with +> `MSYS_NO_PATHCONV=1`, or run it from PowerShell, or use `--path=//health`. + +## Quick sanity check (no Docker, no cache) + +Start the backend, then hit a cheap endpoint to confirm the harness works. +Locally there's usually no `DATABASE_URL`, so `/health` blocks ~20 s per request +on a doomed DB ping — use `/openapi.json` instead for a clean check: + +```bash +uv run uvicorn app.main:app --port 8000 # terminal 1 +uv run python loadtest/run.py --target http://localhost:8000 --path /openapi.json \ + --rps 10 --duration 5 --warmup 1 --p95-ms 1000 # terminal 2 +``` + +Expect `errors=0` and `RESULT: PASS`. + +## Full warm-`/analyze` 100 RPS run (local) + +The warm path needs a populated Report cache. `get_cache()` returns `None` +without Upstash, and real Upstash's free tier (~10k commands/day) can't absorb a +100 RPS run — so use a **local** Upstash-compatible Redis via SRH. + +1. **Start a local Upstash-compatible Redis (SRH over Redis):** + + ```bash + docker run -d --name si-redis -p 6379:6379 redis:7 + docker run -d --name si-srh -p 8079:80 \ + -e SRH_MODE=env -e SRH_TOKEN=local-token \ + -e SRH_CONNECTION_STRING="redis://host.docker.internal:6379" \ + hiett/serverless-redis-http:latest + ``` + +2. **Start the backend pointed at SRH, with a real GitHub token, and the proxy + secret UNSET** (so the analyze limiter skips anonymous enforcement): + + ```bash + UPSTASH_REDIS_REST_URL=http://localhost:8079 \ + UPSTASH_REDIS_REST_TOKEN=local-token \ + GITHUB_TOKEN= \ + uv run uvicorn app.main:app --port 8000 + ``` + (Ensure `INTERNAL_PROXY_SECRET` is **not** set in the environment, and send no + session cookie — the harness is anonymous by default.) + +3. **Run the load test** (the `--warmup` request cold-ingests once to prime the + cache; the timed run is then pure cache hits): + + ```bash + MSYS_NO_PATHCONV=1 uv run python loadtest/run.py \ + --target http://localhost:8000 --path /analyze/octocat \ + --rps 100 --duration 60 --warmup 1 + ``` + +4. **Find the knee** with a ramp: + + ```bash + MSYS_NO_PATHCONV=1 uv run python loadtest/run.py --path /analyze/octocat \ + --ramp 50:100:200:400 --duration 30 --warmup 1 + ``` + Record the highest RPS stage that still PASSes (error rate < 1%, achieved + RPS ≥ 95% of target, p95 under `--p95-ms`). + +5. **Tear down:** `docker rm -f si-srh si-redis`. + +## Pointing at a deployed target + +```bash +uv run python loadtest/run.py --target https:///_/backend --path /analyze/octocat --rps 100 --duration 30 +``` +Mind the cost (Vercel Active-CPU) and rate limits: a deployed backend with +`INTERNAL_PROXY_SECRET` set WILL rate-limit anonymous `/analyze` — sign in or +raise the limits for the window. Keep deployed runs short. + +## Thresholds (PASS/FAIL) + +- error rate `< --max-error-rate` (default 1%) +- achieved RPS `>= 95%` of `--rps` +- p95 latency `< --p95-ms` (default 250 ms; tune from the warm baseline) + +Exit code is 0 on PASS, non-zero on FAIL. diff --git a/backend/loadtest/__init__.py b/backend/loadtest/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/backend/loadtest/run.py b/backend/loadtest/run.py new file mode 100644 index 0000000..77ee4a8 --- /dev/null +++ b/backend/loadtest/run.py @@ -0,0 +1,212 @@ +"""Open-loop load-test harness for the Skill Issue backend. + +Drives a target endpoint at a fixed request rate and reports latency +percentiles, error rate, and achieved throughput against pass/fail +thresholds. See backend/loadtest/README.md for the local warm-/analyze +runbook. Run: uv run python loadtest/run.py --help +""" + +from __future__ import annotations + +import argparse +import asyncio +import contextlib +import sys +import time +from dataclasses import dataclass + +import httpx + + +@dataclass +class Result: + latency_ms: float + status: int | None # None = connection error / timeout + + +@dataclass +class Summary: + sent: int + completed: int + dropped: int + error_count: int + error_rate: float + achieved_rps: float + p50_ms: float + p95_ms: float + p99_ms: float + errors_by_status: dict[str, int] + duration_s: float + + +def percentile(values: list[float], p: float) -> float: + """Linear-interpolated p-th percentile (p in [0, 100]). 0.0 for empty input.""" + if not values: + return 0.0 + s = sorted(values) + if len(s) == 1: + return s[0] + k = (len(s) - 1) * (p / 100.0) + lo = int(k) + hi = min(lo + 1, len(s) - 1) + frac = k - lo + return round(s[lo] * (1 - frac) + s[hi] * frac, 10) + + +def summarize(results: list[Result], *, dropped: int, wall_seconds: float) -> Summary: + completed = len(results) + latencies = [r.latency_ms for r in results] + errors_by_status: dict[str, int] = {} + error_count = 0 + for r in results: + if r.status is None or r.status >= 400: + key = "connection_error" if r.status is None else str(r.status) + errors_by_status[key] = errors_by_status.get(key, 0) + 1 + error_count += 1 + error_rate = (error_count / completed) if completed else 1.0 + achieved_rps = (completed / wall_seconds) if wall_seconds > 0 else 0.0 + return Summary( + sent=completed + dropped, + completed=completed, + dropped=dropped, + error_count=error_count, + error_rate=error_rate, + achieved_rps=achieved_rps, + p50_ms=percentile(latencies, 50), + p95_ms=percentile(latencies, 95), + p99_ms=percentile(latencies, 99), + errors_by_status=errors_by_status, + duration_s=wall_seconds, + ) + + +def evaluate_thresholds( + summary: Summary, *, target_rps: float, max_error_rate: float, p95_ms: float +) -> tuple[bool, list[str]]: + failures: list[str] = [] + if summary.error_rate > max_error_rate: + failures.append(f"error rate {summary.error_rate:.3%} > {max_error_rate:.3%}") + if summary.achieved_rps < target_rps * 0.95: + failures.append(f"achieved RPS {summary.achieved_rps:.1f} < 95% of target {target_rps:.0f}") + if summary.p95_ms > p95_ms: + failures.append(f"p95 {summary.p95_ms:.1f}ms > {p95_ms:.1f}ms") + return (not failures, failures) + + +async def _one_request(client: httpx.AsyncClient, url: str, results: list[Result]) -> None: + t0 = time.perf_counter() + try: + resp = await client.get(url) + status: int | None = resp.status_code + except (httpx.HTTPError, OSError): + status = None + results.append(Result(latency_ms=(time.perf_counter() - t0) * 1000.0, status=status)) + + +async def run_stage( + client: httpx.AsyncClient, + url: str, + *, + rps: float, + duration: float, + max_inflight: int, +) -> tuple[list[Result], int]: + """Open-loop: schedule requests at a fixed rate for `duration` seconds. + + Returns (results, dropped). `dropped` counts ticks skipped because + `max_inflight` was saturated — a "server can't keep up" signal. The + scheduler never blocks on in-flight requests, so a slow server shows up as + dropped ticks + latency growth rather than self-throttled load. + """ + results: list[Result] = [] + tasks: set[asyncio.Task[None]] = set() + inflight = 0 + dropped = 0 + interval = 1.0 / rps + loop = asyncio.get_running_loop() + + def _done(t: asyncio.Task[None]) -> None: + nonlocal inflight + inflight -= 1 + tasks.discard(t) + + start = loop.time() + i = 0 + while loop.time() - start < duration: + delay = (start + i * interval) - loop.time() + if delay > 0: + await asyncio.sleep(delay) + i += 1 + if inflight >= max_inflight: + dropped += 1 + continue + inflight += 1 + task = asyncio.create_task(_one_request(client, url, results)) + task.add_done_callback(_done) + tasks.add(task) + + if tasks: + await asyncio.gather(*tasks, return_exceptions=True) + return results, dropped + + +def _parse_ramp(ramp: str) -> list[float]: + """'10:50:100' -> [10.0, 50.0, 100.0].""" + return [float(part) for part in ramp.split(":") if part] + + +def _print_summary(url: str, rps: float, s: Summary, ok: bool, failures: list[str]) -> None: + print(f"\n=== {url} @ target {rps:.0f} RPS for {s.duration_s:.1f}s ===") + print(f" sent={s.sent} completed={s.completed} dropped={s.dropped}") + print(f" achieved_rps={s.achieved_rps:.1f}") + print(f" errors={s.error_count} ({s.error_rate:.3%}) {s.errors_by_status or ''}") + print(f" latency p50={s.p50_ms:.1f}ms p95={s.p95_ms:.1f}ms p99={s.p99_ms:.1f}ms") + print(f" RESULT: {'PASS' if ok else 'FAIL'}") + for f in failures: + print(f" - {f}") + + +async def _amain(args: argparse.Namespace) -> int: + url = args.target.rstrip("/") + args.path + cap = args.max_inflight + 50 + limits = httpx.Limits(max_connections=cap, max_keepalive_connections=cap) + overall_ok = True + async with httpx.AsyncClient(timeout=args.timeout, limits=limits) as client: + for _ in range(args.warmup): + with contextlib.suppress(httpx.HTTPError, OSError): + await client.get(url) + stages = _parse_ramp(args.ramp) if args.ramp else [args.rps] + for rps in stages: + wall0 = time.perf_counter() + results, dropped = await run_stage( + client, url, rps=rps, duration=args.duration, max_inflight=args.max_inflight + ) + summary = summarize(results, dropped=dropped, wall_seconds=time.perf_counter() - wall0) + ok, failures = evaluate_thresholds( + summary, + target_rps=rps, + max_error_rate=args.max_error_rate, + p95_ms=args.p95_ms, + ) + _print_summary(url, rps, summary, ok, failures) + overall_ok = overall_ok and ok + return 0 if overall_ok else 1 + + +def main() -> None: + p = argparse.ArgumentParser(description="Open-loop load tester for the Skill Issue backend.") + p.add_argument("--target", default="http://localhost:8000") + p.add_argument("--path", default="/analyze/octocat") + p.add_argument("--rps", type=float, default=100.0) + p.add_argument("--duration", type=float, default=60.0) + p.add_argument("--warmup", type=int, default=1) + p.add_argument("--ramp", default=None, help="colon-separated RPS stages, e.g. 10:50:100") + p.add_argument("--max-inflight", type=int, default=500, dest="max_inflight") + p.add_argument("--timeout", type=float, default=20.0) + p.add_argument("--p95-ms", type=float, default=250.0, dest="p95_ms") + p.add_argument("--max-error-rate", type=float, default=0.01, dest="max_error_rate") + sys.exit(asyncio.run(_amain(p.parse_args()))) + + +if __name__ == "__main__": + main() diff --git a/backend/pyproject.toml b/backend/pyproject.toml index 8418b8c..7e2f000 100644 --- a/backend/pyproject.toml +++ b/backend/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "skill-issue-backend" -version = "0.9.5" +version = "0.9.6" description = "Skill Issue backend — FastAPI service that ingests a GitHub profile and returns a deterministic engineering report." readme = "README.md" authors = [ diff --git a/backend/tests/loadtest/__init__.py b/backend/tests/loadtest/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/backend/tests/loadtest/test_stats.py b/backend/tests/loadtest/test_stats.py new file mode 100644 index 0000000..3840a26 --- /dev/null +++ b/backend/tests/loadtest/test_stats.py @@ -0,0 +1,49 @@ +from loadtest.run import Result, evaluate_thresholds, percentile, summarize + + +def test_percentile_linear_interpolation(): + vals = [float(x) for x in range(1, 11)] # 1..10 + assert percentile(vals, 50) == 5.5 + assert percentile(vals, 95) == 9.55 + assert percentile(vals, 99) == 9.91 + + +def test_percentile_empty_is_zero(): + assert percentile([], 95) == 0.0 + + +def test_percentile_single_value(): + assert percentile([7.0], 95) == 7.0 + + +def test_summarize_counts_errors_and_rate(): + results = [ + Result(10.0, 200), + Result(20.0, 200), + Result(30.0, 500), + Result(40.0, None), + ] + s = summarize(results, dropped=1, wall_seconds=2.0) + assert s.completed == 4 + assert s.sent == 5 + assert s.dropped == 1 + assert s.error_count == 2 + assert s.error_rate == 0.5 + assert s.errors_by_status == {"500": 1, "connection_error": 1} + assert s.achieved_rps == 2.0 # 4 completed / 2.0s + + +def test_evaluate_thresholds_all_pass(): + s = summarize([Result(10.0, 200)] * 100, dropped=0, wall_seconds=1.0) + ok, failures = evaluate_thresholds(s, target_rps=100, max_error_rate=0.01, p95_ms=250) + assert ok + assert failures == [] + + +def test_evaluate_thresholds_flags_all_three(): + s = summarize( + [Result(500.0, 500)] * 50, dropped=0, wall_seconds=5.0 + ) # 10 rps, all errors, slow + ok, failures = evaluate_thresholds(s, target_rps=100, max_error_rate=0.01, p95_ms=250) + assert not ok + assert len(failures) == 3 # error rate + achieved rps + p95 diff --git a/backend/uv.lock b/backend/uv.lock index 86305b7..dde96b9 100644 --- a/backend/uv.lock +++ b/backend/uv.lock @@ -906,7 +906,7 @@ fastapi = [ [[package]] name = "skill-issue-backend" -version = "0.9.5" +version = "0.9.6" source = { virtual = "." } dependencies = [ { name = "alembic" }, diff --git a/docs/PROGRESS_LOG.md b/docs/PROGRESS_LOG.md index eb0a9a1..61eb2eb 100644 --- a/docs/PROGRESS_LOG.md +++ b/docs/PROGRESS_LOG.md @@ -19,6 +19,36 @@ Format: --- +## 2026-05-28 — Claude (Opus 4.7) — v0.9.6 shipped (load-test harness) + +**Slice:** v0.9.6. Reusable backend load-test harness + runbook; the full 100 RPS validation run is an operator step (hardware-gated). Split from the original v0.9.5 "security review + load test"; legal docs are now v0.9.7. + +**Done:** +- **`backend/loadtest/run.py`** — open-loop (fixed-rate) async load generator (httpx, already a dep). Pure helpers `percentile`/`summarize`/`evaluate_thresholds` + `Result`/`Summary` dataclasses; async `run_stage` dispatcher (rate-paced, bounded-concurrency with a `dropped` saturation counter); `_parse_ramp`, `_print_summary`, argparse CLI (`--target/--path/--rps/--duration/--warmup/--ramp/--max-inflight/--timeout/--p95-ms/--max-error-rate`). Exit 0 PASS / non-zero FAIL. +- **6 unit tests** for the stats helpers (`backend/tests/loadtest/test_stats.py`) — deterministic, no network. Backend non-DB suite 284 → 290. +- **`backend/loadtest/README.md`** runbook — local SRH (Docker) warm-cache setup, prime-then-measure, ramp-to-find-knee, point-at-deploy, and the Windows/Git-Bash `MSYS_NO_PATHCONV=1` gotcha. +- Docs ritual + version bump to 0.9.6. + +**Decisions:** +- **Open-loop over closed-loop** — a closed-loop (await-then-send) generator self-throttles and masks saturation; open-loop dispatches at a fixed rate so a slow server shows as latency/error/`dropped` growth. +- **Local SRH for the warm cache** — `get_cache()` has no in-process Report-cache fallback, so the warm path needs an Upstash-compatible endpoint; real Upstash's ~10k/day free tier can't absorb a 100 RPS run, so SRH (real Redis over Docker) is the only viable local option. +- **No rate-limit bypass needed** — anonymous load + unset `INTERNAL_PROXY_SECRET` makes the analyze limiter skip enforcement (existing behavior), so the warm test needs no limit-raising or bypass code. Zero application-code change in this slice. +- **Build + sanity now, full run deferred** — per the user's hardware constraint (localhost previously overheated the laptop). The harness is the durable deliverable; the headline 100 RPS number is the operator's to record. + +**Learned / surprises:** +- **Locally `/health` blocks ~20 s/request** when `DATABASE_URL` is unset (the startup-placeholder DB ping times out per request). Used `/openapi.json` for the clean sanity run instead. Not a prod issue (prod has a DB). +- **Git-Bash mangles a bare `--path /health`** into a Windows path via MSYS, corrupting the URL (`Invalid port: '8000C:'`). `MSYS_NO_PATHCONV=1` fixes it — noted in the runbook. (Surfaced a real edge: a malformed URL raises `httpx.InvalidURL`, which is *not* an `HTTPError`, so it isn't caught per-request — acceptable, since a bad `--target/--path` is operator error that should fail loudly.) + +**Verified:** +- Backend `ruff` clean; `pytest` (stats tests 6/6; full non-DB suite 290 expected). Frontend unchanged (54 vitest). +- Harness sanity run (controller, backend-only): `/openapi.json` 10 RPS × 5 s → 51 completed, **0 errors**, p50 3.0 ms / p95 6.2 ms, achieved 10.2 RPS, **PASS**, exit 0. + +**Blocked / open:** full 100 RPS warm-`/analyze` run is the operator's (Docker/SRH + GITHUB_TOKEN); result to be appended when run. + +**Next:** v0.9.7 — privacy policy + terms (legal docs). + +--- + ## 2026-05-28 — Claude (Opus 4.7) — v0.9.5 shipped (pre-launch security audit + hardening) **Slice:** v0.9.5. Full pre-launch security audit of the whole app + two Medium hardening fixes. The load test originally bundled here was split to v0.9.6 (needs target/cost/rate-limit design); legal docs shifted to v0.9.7. diff --git a/docs/superpowers/plans/2026-05-28-v0.9.6-load-test-harness.md b/docs/superpowers/plans/2026-05-28-v0.9.6-load-test-harness.md new file mode 100644 index 0000000..f57f686 --- /dev/null +++ b/docs/superpowers/plans/2026-05-28-v0.9.6-load-test-harness.md @@ -0,0 +1,589 @@ +# v0.9.6 — Load-test harness Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Ship a reusable, parameterized Python/httpx open-loop load-test harness for the backend warm `/analyze` path, with unit-tested stats, a runbook, and a light sanity run — the full 100 RPS validation is an operator step. + +**Architecture:** A standalone async script `backend/loadtest/run.py` with I/O-free stats helpers (unit-tested) and an open-loop dispatcher that fires at a fixed RPS, collects per-request latency/status, and reports percentiles + error rate + achieved throughput against pass/fail thresholds. No application code changes. + +**Tech Stack:** Python 3.12, asyncio, httpx (already a backend dep), pytest, argparse. + +**Spec:** [`docs/superpowers/specs/2026-05-28-v0.9.6-load-test-harness-design.md`](../specs/2026-05-28-v0.9.6-load-test-harness-design.md). + +--- + +## File structure + +| File | Responsibility | Action | +| --- | --- | --- | +| `backend/loadtest/__init__.py` | Make `loadtest` an importable package | Create (empty) | +| `backend/loadtest/run.py` | Harness: stats helpers + async dispatcher + CLI | Create | +| `backend/loadtest/README.md` | Runbook (local SRH warm-cache run + point-at-deploy) | Create | +| `backend/tests/loadtest/__init__.py` | Test package marker | Create (empty) | +| `backend/tests/loadtest/test_stats.py` | Unit tests for the pure stats helpers | Create | +| Version literals + CHANGELOG + PLAN + PROGRESS_LOG + `uv.lock` | Release ritual (0.9.5 → 0.9.6) | Modify | + +All commands run from `backend/` (the uv project root) unless noted. + +--- + +### Task 1: Pure stats helpers + unit tests (TDD) + +**Files:** +- Create: `backend/loadtest/__init__.py`, `backend/tests/loadtest/__init__.py` +- Create: `backend/loadtest/run.py` (helpers portion only this task) +- Test: `backend/tests/loadtest/test_stats.py` + +- [ ] **Step 1: Create the package markers** + +Create `backend/loadtest/__init__.py` and `backend/tests/loadtest/__init__.py` both as **empty files**. + +- [ ] **Step 2: Write the failing tests** + +Create `backend/tests/loadtest/test_stats.py`: + +```python +from loadtest.run import Result, evaluate_thresholds, percentile, summarize + + +def test_percentile_linear_interpolation(): + vals = [float(x) for x in range(1, 11)] # 1..10 + assert percentile(vals, 50) == 5.5 + assert percentile(vals, 95) == 9.55 + assert percentile(vals, 99) == 9.91 + + +def test_percentile_empty_is_zero(): + assert percentile([], 95) == 0.0 + + +def test_percentile_single_value(): + assert percentile([7.0], 95) == 7.0 + + +def test_summarize_counts_errors_and_rate(): + results = [ + Result(10.0, 200), + Result(20.0, 200), + Result(30.0, 500), + Result(40.0, None), + ] + s = summarize(results, dropped=1, wall_seconds=2.0) + assert s.completed == 4 + assert s.sent == 5 + assert s.dropped == 1 + assert s.error_count == 2 + assert s.error_rate == 0.5 + assert s.errors_by_status == {"500": 1, "connection_error": 1} + assert s.achieved_rps == 2.0 # 4 completed / 2.0s + + +def test_evaluate_thresholds_all_pass(): + s = summarize([Result(10.0, 200)] * 100, dropped=0, wall_seconds=1.0) + ok, failures = evaluate_thresholds(s, target_rps=100, max_error_rate=0.01, p95_ms=250) + assert ok + assert failures == [] + + +def test_evaluate_thresholds_flags_all_three(): + s = summarize([Result(500.0, 500)] * 50, dropped=0, wall_seconds=5.0) # 10 rps, all errors, slow + ok, failures = evaluate_thresholds(s, target_rps=100, max_error_rate=0.01, p95_ms=250) + assert not ok + assert len(failures) == 3 # error rate + achieved rps + p95 +``` + +- [ ] **Step 3: Run the tests, verify they fail** + +Run: `uv run pytest tests/loadtest/test_stats.py -q` +Expected: FAIL — `ModuleNotFoundError: No module named 'loadtest'` (or import error for the helpers). + +- [ ] **Step 4: Write the helpers** + +Create `backend/loadtest/run.py` with exactly this content (the async/CLI portion is added in Task 2): + +```python +"""Open-loop load-test harness for the Skill Issue backend. + +Drives a target endpoint at a fixed request rate and reports latency +percentiles, error rate, and achieved throughput against pass/fail +thresholds. See backend/loadtest/README.md for the local warm-/analyze +runbook. Run: uv run python loadtest/run.py --help +""" + +from __future__ import annotations + +from dataclasses import dataclass + + +@dataclass +class Result: + latency_ms: float + status: int | None # None = connection error / timeout + + +@dataclass +class Summary: + sent: int + completed: int + dropped: int + error_count: int + error_rate: float + achieved_rps: float + p50_ms: float + p95_ms: float + p99_ms: float + errors_by_status: dict[str, int] + duration_s: float + + +def percentile(values: list[float], p: float) -> float: + """Linear-interpolated p-th percentile (p in [0, 100]). 0.0 for empty input.""" + if not values: + return 0.0 + s = sorted(values) + if len(s) == 1: + return s[0] + k = (len(s) - 1) * (p / 100.0) + lo = int(k) + hi = min(lo + 1, len(s) - 1) + frac = k - lo + return s[lo] * (1 - frac) + s[hi] * frac + + +def summarize(results: list[Result], *, dropped: int, wall_seconds: float) -> Summary: + completed = len(results) + latencies = [r.latency_ms for r in results] + errors_by_status: dict[str, int] = {} + error_count = 0 + for r in results: + if r.status is None or r.status >= 400: + key = "connection_error" if r.status is None else str(r.status) + errors_by_status[key] = errors_by_status.get(key, 0) + 1 + error_count += 1 + error_rate = (error_count / completed) if completed else 1.0 + achieved_rps = (completed / wall_seconds) if wall_seconds > 0 else 0.0 + return Summary( + sent=completed + dropped, + completed=completed, + dropped=dropped, + error_count=error_count, + error_rate=error_rate, + achieved_rps=achieved_rps, + p50_ms=percentile(latencies, 50), + p95_ms=percentile(latencies, 95), + p99_ms=percentile(latencies, 99), + errors_by_status=errors_by_status, + duration_s=wall_seconds, + ) + + +def evaluate_thresholds( + summary: Summary, *, target_rps: float, max_error_rate: float, p95_ms: float +) -> tuple[bool, list[str]]: + failures: list[str] = [] + if summary.error_rate > max_error_rate: + failures.append(f"error rate {summary.error_rate:.3%} > {max_error_rate:.3%}") + if summary.achieved_rps < target_rps * 0.95: + failures.append( + f"achieved RPS {summary.achieved_rps:.1f} < 95% of target {target_rps:.0f}" + ) + if summary.p95_ms > p95_ms: + failures.append(f"p95 {summary.p95_ms:.1f}ms > {p95_ms:.1f}ms") + return (not failures, failures) +``` + +- [ ] **Step 5: Run the tests, verify they pass** + +Run: `uv run pytest tests/loadtest/test_stats.py -q` +Expected: `6 passed`. + +- [ ] **Step 6: Lint** + +Run: `uv run ruff check loadtest tests/loadtest && uv run ruff format --check loadtest tests/loadtest` +Expected: clean. + +- [ ] **Step 7: Commit** + +```bash +git add backend/loadtest/__init__.py backend/loadtest/run.py backend/tests/loadtest/__init__.py backend/tests/loadtest/test_stats.py +git commit -m "feat(v0.9.6): load-test stats helpers (percentile/summarize/thresholds) + tests" +``` + +--- + +### Task 2: Async open-loop dispatcher + CLI + +**Files:** +- Modify: `backend/loadtest/run.py` (append the I/O + CLI section) + +- [ ] **Step 1: Append the async dispatcher, ramp parser, printer, and CLI** + +Add these imports to the **top** of `backend/loadtest/run.py` — change the existing header block so the imports read: + +```python +from __future__ import annotations + +import argparse +import asyncio +import sys +import time +from dataclasses import dataclass + +import httpx +``` + +Then **append** the following to the end of `backend/loadtest/run.py`: + +```python +async def _one_request(client: httpx.AsyncClient, url: str, results: list[Result]) -> None: + t0 = time.perf_counter() + try: + resp = await client.get(url) + status: int | None = resp.status_code + except (httpx.HTTPError, OSError): + status = None + results.append(Result(latency_ms=(time.perf_counter() - t0) * 1000.0, status=status)) + + +async def run_stage( + client: httpx.AsyncClient, + url: str, + *, + rps: float, + duration: float, + max_inflight: int, +) -> tuple[list[Result], int]: + """Open-loop: schedule requests at a fixed rate for `duration` seconds. + + Returns (results, dropped). `dropped` counts ticks skipped because + `max_inflight` was saturated — a "server can't keep up" signal. The + scheduler never blocks on in-flight requests, so a slow server shows up as + dropped ticks + latency growth rather than self-throttled load. + """ + results: list[Result] = [] + tasks: set[asyncio.Task[None]] = set() + inflight = 0 + dropped = 0 + interval = 1.0 / rps + loop = asyncio.get_running_loop() + + def _done(t: asyncio.Task[None]) -> None: + nonlocal inflight + inflight -= 1 + tasks.discard(t) + + start = loop.time() + i = 0 + while loop.time() - start < duration: + delay = (start + i * interval) - loop.time() + if delay > 0: + await asyncio.sleep(delay) + i += 1 + if inflight >= max_inflight: + dropped += 1 + continue + inflight += 1 + task = asyncio.create_task(_one_request(client, url, results)) + task.add_done_callback(_done) + tasks.add(task) + + if tasks: + await asyncio.gather(*tasks, return_exceptions=True) + return results, dropped + + +def _parse_ramp(ramp: str) -> list[float]: + """'10:50:100' -> [10.0, 50.0, 100.0].""" + return [float(part) for part in ramp.split(":") if part] + + +def _print_summary(url: str, rps: float, s: Summary, ok: bool, failures: list[str]) -> None: + print(f"\n=== {url} @ target {rps:.0f} RPS for {s.duration_s:.1f}s ===") + print(f" sent={s.sent} completed={s.completed} dropped={s.dropped}") + print(f" achieved_rps={s.achieved_rps:.1f}") + print(f" errors={s.error_count} ({s.error_rate:.3%}) {s.errors_by_status or ''}") + print(f" latency p50={s.p50_ms:.1f}ms p95={s.p95_ms:.1f}ms p99={s.p99_ms:.1f}ms") + print(f" RESULT: {'PASS' if ok else 'FAIL'}") + for f in failures: + print(f" - {f}") + + +async def _amain(args: argparse.Namespace) -> int: + url = args.target.rstrip("/") + args.path + cap = args.max_inflight + 50 + limits = httpx.Limits(max_connections=cap, max_keepalive_connections=cap) + overall_ok = True + async with httpx.AsyncClient(timeout=args.timeout, limits=limits) as client: + for _ in range(args.warmup): + try: + await client.get(url) + except (httpx.HTTPError, OSError): + pass + stages = _parse_ramp(args.ramp) if args.ramp else [args.rps] + for rps in stages: + wall0 = time.perf_counter() + results, dropped = await run_stage( + client, url, rps=rps, duration=args.duration, max_inflight=args.max_inflight + ) + summary = summarize(results, dropped=dropped, wall_seconds=time.perf_counter() - wall0) + ok, failures = evaluate_thresholds( + summary, + target_rps=rps, + max_error_rate=args.max_error_rate, + p95_ms=args.p95_ms, + ) + _print_summary(url, rps, summary, ok, failures) + overall_ok = overall_ok and ok + return 0 if overall_ok else 1 + + +def main() -> None: + p = argparse.ArgumentParser(description="Open-loop load tester for the Skill Issue backend.") + p.add_argument("--target", default="http://localhost:8000") + p.add_argument("--path", default="/analyze/octocat") + p.add_argument("--rps", type=float, default=100.0) + p.add_argument("--duration", type=float, default=60.0) + p.add_argument("--warmup", type=int, default=1) + p.add_argument("--ramp", default=None, help="colon-separated RPS stages, e.g. 10:50:100") + p.add_argument("--max-inflight", type=int, default=500, dest="max_inflight") + p.add_argument("--timeout", type=float, default=20.0) + p.add_argument("--p95-ms", type=float, default=250.0, dest="p95_ms") + p.add_argument("--max-error-rate", type=float, default=0.01, dest="max_error_rate") + sys.exit(asyncio.run(_amain(p.parse_args()))) + + +if __name__ == "__main__": + main() +``` + +- [ ] **Step 2: Verify the CLI loads and the stats tests still pass** + +Run: `uv run python loadtest/run.py --help` +Expected: argparse help text listing `--target`, `--rps`, `--ramp`, etc. (exit 0). + +Run: `uv run pytest tests/loadtest/test_stats.py -q` +Expected: `6 passed` (the appended I/O code didn't break the helpers). + +- [ ] **Step 3: Lint** + +Run: `uv run ruff check loadtest && uv run ruff format --check loadtest` +Expected: clean. (If ruff reformats, run `uv run ruff format loadtest` and re-check.) + +- [ ] **Step 4: Commit** + +```bash +git add backend/loadtest/run.py +git commit -m "feat(v0.9.6): open-loop async dispatcher + CLI for the load harness" +``` + +--- + +### Task 3: Runbook + +**Files:** +- Create: `backend/loadtest/README.md` + +- [ ] **Step 1: Write the runbook** + +Create `backend/loadtest/README.md`: + +````markdown +# Load test harness + +Open-loop load tester for the backend. Drives a target endpoint at a fixed RPS +and reports p50/p95/p99 latency, error rate, achieved throughput, and PASS/FAIL +against thresholds. Run from `backend/`: + +```bash +uv run python loadtest/run.py --help +``` + +## Quick sanity check (no Docker, no cache) + +Start the backend, then hit the cheap `/health` endpoint to confirm the harness +works: + +```bash +uv run uvicorn app.main:app --port 8000 # terminal 1 +uv run python loadtest/run.py --target http://localhost:8000 --path /health \ + --rps 10 --duration 5 --warmup 0 --p95-ms 1000 # terminal 2 +``` + +## Full warm-`/analyze` 100 RPS run (local) + +The warm path needs a populated Report cache. `get_cache()` returns `None` +without Upstash, and real Upstash's free tier (~10k commands/day) can't absorb a +100 RPS run — so use a **local** Upstash-compatible Redis via SRH. + +1. **Start a local Upstash-compatible Redis (SRH over Redis):** + + ```bash + docker run -d --name si-redis -p 6379:6379 redis:7 + docker run -d --name si-srh -p 8079:80 \ + -e SRH_MODE=env -e SRH_TOKEN=local-token \ + -e SRH_CONNECTION_STRING="redis://host.docker.internal:6379" \ + hiett/serverless-redis-http:latest + ``` + +2. **Start the backend pointed at SRH, with a real GitHub token, and the proxy + secret UNSET** (so the analyze limiter skips anonymous enforcement): + + ```bash + UPSTASH_REDIS_REST_URL=http://localhost:8079 \ + UPSTASH_REDIS_REST_TOKEN=local-token \ + GITHUB_TOKEN= \ + uv run uvicorn app.main:app --port 8000 + ``` + (Ensure `INTERNAL_PROXY_SECRET` is **not** set in the environment.) + +3. **Run the load test** (the `--warmup` request cold-ingests once to prime the + cache; the timed run is then pure cache hits): + + ```bash + uv run python loadtest/run.py \ + --target http://localhost:8000 --path /analyze/octocat \ + --rps 100 --duration 60 --warmup 1 + ``` + +4. **Find the knee** with a ramp: + + ```bash + uv run python loadtest/run.py --path /analyze/octocat \ + --ramp 50:100:200:400 --duration 30 --warmup 1 + ``` + Record the highest RPS stage that still PASSes (error rate < 1%, achieved + RPS ≥ 95% of target, p95 under `--p95-ms`). + +5. **Tear down:** `docker rm -f si-srh si-redis`. + +## Pointing at a deployed target + +```bash +uv run python loadtest/run.py --target https:///_/backend --path /analyze/octocat --rps 100 --duration 30 +``` +Mind the cost (Vercel Active-CPU) and rate limits: a deployed backend with +`INTERNAL_PROXY_SECRET` set WILL rate-limit anonymous `/analyze` — sign-in or +raise the limits for the window. Keep deployed runs short. + +## Thresholds (PASS/FAIL) + +- error rate `< --max-error-rate` (default 1%) +- achieved RPS `>= 95%` of `--rps` +- p95 latency `< --p95-ms` (default 250 ms; tune from the warm baseline) + +Exit code is 0 on PASS, non-zero on FAIL. +```` + +- [ ] **Step 2: Commit** + +```bash +git add backend/loadtest/README.md +git commit -m "docs(v0.9.6): load-test runbook (local SRH warm-cache + deploy target)" +``` + +--- + +### Task 4: Sanity run + docs ritual + ship + +**Files:** +- Modify: `backend/pyproject.toml:3`, `backend/app/settings.py:5`, `frontend/package.json:3`, `frontend/src/app/page.tsx:26`, `frontend/src/components/results-view.tsx:355`, `README.md` (status lead + running list + health curl), `CHANGELOG.md`, `PLAN.md` (v0.9.6 row + section), `docs/PROGRESS_LOG.md`, `backend/uv.lock` + +- [ ] **Step 1: Light sanity run against `/health`** + +In terminal 1: `cd backend && uv run uvicorn app.main:app --port 8000` +In terminal 2: +```bash +cd backend && uv run python loadtest/run.py --target http://localhost:8000 --path /health --rps 10 --duration 5 --warmup 0 --p95-ms 1000 +``` +Expected: a summary block with `completed≈50`, `errors=0`, `RESULT: PASS`. Stop the server afterward. +(If the execution environment cannot run uvicorn, skip this step and note it — the stats unit tests + `--help` already prove the harness mechanics; the full run is the operator's per the spec.) + +- [ ] **Step 2: Bump backend version literals** + +`backend/pyproject.toml` line 3: `version = "0.9.5"` → `"0.9.6"`. +`backend/app/settings.py` line 5: `VERSION = "0.9.5"` → `"0.9.6"`. + +- [ ] **Step 3: Bump frontend version literals** + +`frontend/package.json` line 3: `"version": "0.9.5",` → `"0.9.6",`. +`frontend/src/app/page.tsx` line 26: `... · v0.9.5` → `v0.9.6`. +`frontend/src/components/results-view.tsx` line 355: `... Protocol v0.9.5` → `v0.9.6`. + +- [ ] **Step 4: Update README** + +In `README.md` line 46 (status paragraph): change the lead `Latest shipped release is **v0.9.5** (...)` to name **v0.9.6** with a one-line description (a reusable backend load-test harness), demote the v0.9.5 clause to "before it," and move the trailing "next" pointer from `**v0.9.6 — load test to 100 RPS** is next.` to `v0.9.6 adds a reusable load-test harness for the warm /analyze path (the full 100 RPS run is an operator step). **v0.9.7 — privacy policy + terms** is next.` Also bump the health-curl `"version":"0.9.5"` → `"0.9.6"` (line ~79). + +- [ ] **Step 5: Add the CHANGELOG entry** + +In `CHANGELOG.md`, insert directly below the header preamble `---` and above `## [0.9.5]`: + +```markdown +## [0.9.6] — 2026-05-28 + +### Added +- **Load-test harness** (`backend/loadtest/`) — a reusable open-loop load tester for the backend warm `/analyze` path, reporting latency percentiles, error rate, and achieved throughput against pass/fail thresholds. Includes a runbook for a local 100 RPS warm-cache run and for pointing at a deployed target. + +--- +``` + +- [ ] **Step 6: Update PLAN.md** + +Version-map row: `| **v0.9.6** | Load test to 100 RPS | pending |` → `| **v0.9.6** | Load-test harness (warm /analyze; full 100 RPS run = operator step) | ✅ shipped |`. + +Replace the `## v0.9.6 — Load test to 100 RPS (deferred)` section body with: + +```markdown +## v0.9.6 — Load-test harness (shipped 2026-05-28) + +**Goal:** Reusable Python/httpx open-loop load harness for the backend warm `/analyze` path; the full 100 RPS validation run is an operator step (hardware-gated). + +**Delivered:** `backend/loadtest/run.py` (open-loop dispatcher, p50/p95/p99, error rate, achieved RPS, pass/fail thresholds, ramp), unit-tested stats helpers, and `backend/loadtest/README.md` runbook (local SRH warm-cache setup + deploy target). Local warm-cache uses SRH (Upstash-compatible Redis over Docker) — real Upstash's ~10k/day free tier can't absorb a 100 RPS run. Anonymous load + unset `INTERNAL_PROXY_SECRET` means the analyze limiter skips enforcement, so no bypass is needed. + +**Design spec:** [`docs/superpowers/specs/2026-05-28-v0.9.6-load-test-harness-design.md`](./docs/superpowers/specs/2026-05-28-v0.9.6-load-test-harness-design.md). +**Sub-plan:** [`docs/superpowers/plans/2026-05-28-v0.9.6-load-test-harness.md`](./docs/superpowers/plans/2026-05-28-v0.9.6-load-test-harness.md). + +**Exit criteria:** +- [x] `loadtest/run.py` + unit-tested stats helpers; ruff clean; backend suite green. +- [x] Runbook complete (local SRH + deploy target). +- [x] Light `/health` sanity run passes. +- [x] Docs ritual + version bump to 0.9.6; tag + release. +- [ ] Full 100 RPS warm-`/analyze` result recorded — operator step, filled in when run. +``` + +- [ ] **Step 7: Re-sync `uv.lock`** + +Run: `cd backend && uv lock` +Expected: only `skill-issue-backend 0.9.5 → 0.9.6` changes. + +- [ ] **Step 8: Add the PROGRESS_LOG entry** + +In `docs/PROGRESS_LOG.md`, add a new top entry following the file's format: header `## 2026-05-28 — Claude (Opus 4.7) — v0.9.6 shipped (load-test harness)`; Slice v0.9.6; Done (harness + stats tests + runbook + sanity run; result of the sanity run); Decisions (open-loop over closed-loop; local SRH for warm cache since real Upstash free tier too small; anon+unset-secret skips the analyze limiter so no bypass; build+sanity now, full 100 RPS run deferred to operator per the laptop constraint); Verified (ruff + pytest count, `--help`, sanity run PASS); Blocked/open (full 100 RPS run is the operator's); Next (v0.9.7 legal docs). + +- [ ] **Step 9: Full verification** + +Run: `cd backend && uv run pytest -q --no-header && uv run ruff check . && uv run ruff format --check .` +Expected: suite passes (284 + 6 new = 290 non-DB pass), ruff clean. + +Run: `cd frontend && npm run lint && npx tsc --noEmit && npm run test:run && npm run build` +Expected: lint/tsc clean, vitest 54 passed, build succeeds. + +- [ ] **Step 10: Commit** + +```bash +git add backend/pyproject.toml backend/app/settings.py frontend/package.json frontend/src/app/page.tsx frontend/src/components/results-view.tsx README.md CHANGELOG.md PLAN.md docs/PROGRESS_LOG.md backend/uv.lock +git commit -m "chore(v0.9.6): bump version + docs ritual (load-test harness)" +``` + +- [ ] **Step 11: Push, PR, CI, merge, tag (confirm before tag)** + +Push the branch, open a PR, wait for CI green, merge to `main`. Prod smoke `/health` → `version: 0.9.6`. Then **pause for user confirmation** before `git tag v0.9.6 && git push origin v0.9.6`. + +--- + +## Self-review notes + +- **Spec coverage:** harness §4/§5.1 → Tasks 1–2; thresholds §7 → `evaluate_thresholds` (Task 1) + CLI flags (Task 2); runbook §5.1 → Task 3; sanity run + deferral §8 → Task 4 Step 1; exit criteria §9 → Task 4. All covered. +- **Placeholder scan:** none — all code is complete, including the full `run.py` and `test_stats.py`. +- **Type consistency:** `Result(latency_ms, status)`, `Summary` fields, and `percentile`/`summarize(*, dropped, wall_seconds)`/`evaluate_thresholds(*, target_rps, max_error_rate, p95_ms)` signatures are identical across Tasks 1, 2, and the tests. +- **No app code change:** the harness is a black-box HTTP client; it relies on the existing rate-limiter-skip behavior, doesn't modify it. +- **Test count:** backend non-DB suite 284 → 290 (+6 stats tests). Frontend unchanged at 54. diff --git a/docs/superpowers/specs/2026-05-28-v0.9.6-load-test-harness-design.md b/docs/superpowers/specs/2026-05-28-v0.9.6-load-test-harness-design.md new file mode 100644 index 0000000..285f434 --- /dev/null +++ b/docs/superpowers/specs/2026-05-28-v0.9.6-load-test-harness-design.md @@ -0,0 +1,119 @@ +# v0.9.6 — Local warm-`/analyze` load-test harness design spec + +**Status:** Designed. Implementation plan to follow under `docs/superpowers/plans/`. +**Date:** 2026-05-28. +**Author:** Claude (Opus 4.7) with Shaan. + +--- + +## 1. Goal + +Deliver a reusable, parameterized load-test harness that drives the backend warm `/analyze/` path at a target RPS (default 100) and reports whether the system sustains it within an error/latency budget. The harness is the durable deliverable; the actual 100 RPS validation run is an operator step (the user runs it when hardware allows, or points the harness at a deployed target). + +This is the load-test slice of the v0.9.x Beta-hardening family. It is independently shippable. (The slice was split out of the original v0.9.5 "security review + load test" on 2026-05-28 because it needs its own target/cost/rate-limit design.) + +## 2. Locked scope decisions (2026-05-28) + +| Decision | Choice | Why | +| --- | --- | --- | +| Target / cost | **Local, free** | No Vercel Active-CPU spend, no GitHub-budget burn, no real-user risk. Harness is target-URL-parameterized so it can later hit a preview/prod deploy. | +| Scenario | **Backend warm `/analyze/`** | The realistic viral hot path: cache hit → serialize → respond, no GitHub calls. Backend-only (no Next dev server, which is heavy). | +| Tool | **Python + httpx/asyncio**, committed to repo | Zero new installs (httpx is already a backend dep); runs via `uv run`; reproducible; in the backend's own toolchain (ruff/pytest/CI). | +| Warm cache locally | **SRH (`upstash/serverless-redis-http`) over Docker** | The Report cache requires an Upstash-compatible REST endpoint (`get_cache()` has no in-process fallback). Real Upstash can't be used — a 60 s × 100 RPS run is ~6–12k Redis GETs, over the ~10k/day free tier. SRH = real local Redis, no command limits. | +| Rate limiting during test | **Anonymous + `INTERNAL_PROXY_SECRET` unset → no enforcement** | `analyze_rate_limiter` has `via_trusted_proxy=True`; with the secret unset it *skips* anonymous enforcement (verified in `app/ratelimit.py`). So anonymous load needs no limit-raising and no bypass code. | +| Execution now | **Build + sanity-verify only; full 100 RPS run is the operator's** | Respects the user's hardware (localhost previously overheated the laptop). Ships the durable harness + runbook; the heavy run is documented, not run by the agent. | +| Load model | **Open-loop (fixed dispatch rate)** with a bounded-concurrency safety cap | A closed-loop (await-then-send) generator masks saturation by self-throttling. Open-loop reveals queue buildup as latency/error growth — the thing we want to measure. | + +## 3. Why warm-`/analyze` needs a cache (operating context) + +`app/dependencies.py::get_report_for_user` serves from the Report cache when `get_cache()` is non-None, else falls through to `_live_ingest` (a cold GitHub ingest) on **every** request. `get_cache()` returns `None` unless `UPSTASH_REDIS_REST_URL` + `_TOKEN` are set; there is no in-process Report-cache fallback. Therefore a faithful warm-path test must run against a populated Upstash-compatible cache. SRH provides that locally without the free-tier command ceiling that rules out real Upstash. + +A single warmup request (cold ingest, one GitHub round-trip, needs the local `GITHUB_TOKEN`) primes the cache for the chosen user; the timed run is then pure cache hits. + +## 4. Architecture (one paragraph) + +`backend/loadtest/run.py` is a standalone async script. It builds one `httpx.AsyncClient` (high connection-pool limits), optionally fires `--warmup` priming requests (awaited), then runs an open-loop dispatcher: every `1/rps` seconds it schedules a request task (guarded by an `asyncio.Semaphore(cap)` so a stalled server can't spawn unbounded tasks), recording each request's wall-clock latency and HTTP status. After `--duration` seconds it stops scheduling, drains in-flight tasks, and prints a summary: total sent/completed, errors grouped by status, achieved RPS, p50/p95/p99 latency, and PASS/FAIL against thresholds. Pure functions (percentile computation, summary building) live separately from the I/O loop so they're unit-testable without a server. + +## 5. Surface area + +### 5.1 New files + +| File | Responsibility | +| --- | --- | +| `backend/loadtest/run.py` | The harness: CLI parsing, open-loop dispatcher, httpx client, result collection, summary + threshold check. I/O-free helpers (`percentile`, `summarize`, `evaluate_thresholds`) importable for tests. | +| `backend/loadtest/README.md` | Runbook: local SRH (Docker) setup, env vars, prime-then-measure procedure, how to interpret output, how to ramp to find the knee, and how to point `--target` at a deployed URL. | +| `backend/tests/loadtest/test_stats.py` | Unit tests for the pure stats helpers (percentiles on known inputs, error-rate computation, threshold pass/fail) — deterministic, no network. | + +### 5.2 Modified files + +| File | Change | +| --- | --- | +| Version literals + CHANGELOG + PLAN + PROGRESS_LOG + `uv.lock` | Release ritual (v0.9.5 → 0.9.6). | + +### 5.3 Untouched (intentionally) + +- Application code — the harness is a black-box HTTP client; no product code changes. (The "rate limiter skips anonymous when the secret is unset" behavior already exists; we rely on it, we don't add to it.) +- No new backend dependency — httpx is already locked. +- Frontend — out of scope (backend capacity test). + +## 6. CLI contract + +``` +uv run python loadtest/run.py \ + --target http://localhost:8000 \ + --path /analyze/octocat \ + --rps 100 \ + --duration 60 \ + --warmup 1 \ + [--ramp 10:50:100] # optional: stepped RPS stages to find the knee + [--max-inflight 500] # concurrency safety cap + [--p95-ms 250] # latency threshold for PASS/FAIL + [--max-error-rate 0.01] +``` + +Exit code 0 on PASS, non-zero on FAIL (so it can gate CI or scripts later). + +## 7. Exit thresholds (what "error budget holds" means) + +- **Error rate < 1%** of completed requests are non-2xx (ideally zero 5xx). +- **Achieved RPS ≥ 95% of target** — the dispatcher kept up; the server didn't force the generator to fall behind. +- **p95 latency < `--p95-ms`** (default 250 ms for a local warm hit; tune from the observed baseline). Local absolute numbers aren't comparable to prod — the load-bearing signals are *rate held* + *errors near zero* + *p95 not diverging from the single-request baseline*. + +The runbook documents using `--ramp` to find the knee (the RPS where p95 spikes or errors appear) and recording the max sustainable RPS. + +## 8. What gets verified in this slice vs deferred + +**Verified now (no laptop stress, in CI where possible):** +- `backend/tests/loadtest/test_stats.py` passes (percentiles, error rate, threshold logic). +- A light sanity run of the harness against the backend `/health` endpoint (backend only, no Docker, e.g. 10 RPS × 5 s) confirms the dispatcher, client, and summary work end-to-end. + +**Deferred to the operator (documented in the runbook):** +- The full local warm-`/analyze` 100 RPS run (requires Docker/SRH + backend + `GITHUB_TOKEN`). +- Recording the result (max sustainable RPS, p95, error rate) — lands in `docs/PROGRESS_LOG.md` when run. + +## 9. Exit criteria + +- [ ] `backend/loadtest/run.py` exists: open-loop dispatcher, warmup, parameterized CLI, summary + threshold check, correct exit code. +- [ ] Pure helpers (`percentile`, `summarize`, `evaluate_thresholds`) unit-tested in `backend/tests/loadtest/test_stats.py`; all pass. +- [ ] `backend/loadtest/README.md` runbook complete: SRH setup, prime-then-measure, ramp/interpret, point-at-deploy. +- [ ] Light `/health` sanity run succeeds (recorded in PROGRESS_LOG). +- [ ] `ruff check` + `ruff format --check` clean; backend suite green (existing + new stats tests). +- [ ] Docs ritual + version bump to 0.9.6; PLAN v0.9.6 row flipped ✅; tag `v0.9.6` + release. + +## 10. Out of scope + +- Running the full 100 RPS validation (operator step; hardware-gated). +- Frontend / `/u/[username]` page load testing (conflates Next render with API capacity). +- Cold-ingest (GitHub-hitting) capacity testing (would need GitHub mocking; different test). +- Distributed/multi-machine load generation (single-box open-loop is enough at 100 RPS). +- A CI-gating load test (the harness *can* gate via exit code, but wiring it into CI needs a hosted target + cache — out of scope here). +- Any application code change (cache shims, test-only modes). The existing rate-limiter-skip behavior is relied upon as-is. + +## 11. Implementation ordering + +1. **Pure stats helpers + their tests** (`percentile`, `summarize`, `evaluate_thresholds`) — TDD, no network. Commit. +2. **Open-loop dispatcher + CLI + httpx client** wiring around the helpers. Commit. +3. **Runbook** (`backend/loadtest/README.md`). Commit. +4. **Sanity run** against `/health`; record in PROGRESS_LOG. **Docs ritual + version bump + ship.** + +**Reversibility:** the harness is additive (new `loadtest/` dir + tests); it changes no product code, so reverting the slice removes only the tooling. diff --git a/frontend/package.json b/frontend/package.json index c5d8efb..a261c64 100644 --- a/frontend/package.json +++ b/frontend/package.json @@ -1,6 +1,6 @@ { "name": "frontend", - "version": "0.9.5", + "version": "0.9.6", "private": true, "scripts": { "dev": "next dev", diff --git a/frontend/src/app/page.tsx b/frontend/src/app/page.tsx index 1d47093..efe5837 100644 --- a/frontend/src/app/page.tsx +++ b/frontend/src/app/page.tsx @@ -23,7 +23,7 @@ export default function Home() { transition={{ delay: 0.2, duration: 0.5 }} className="rounded-full border border-white/10 bg-white/5 px-3 py-1 text-xs font-medium uppercase tracking-wider text-muted-foreground" > - Deterministic engineering reports · v0.9.5 + Deterministic engineering reports · v0.9.6

diff --git a/frontend/src/components/results-view.tsx b/frontend/src/components/results-view.tsx index e491277..e92b6cb 100644 --- a/frontend/src/components/results-view.tsx +++ b/frontend/src/components/results-view.tsx @@ -352,7 +352,7 @@ export function ResultsView({