agjs · agjs · Jul 5, 2026 · Jul 3, 2026 · Jul 3, 2026 · Jul 3, 2026
diff --git a/.gitignore b/.gitignore
@@ -27,3 +27,4 @@ models.json
 /vite.config.ts
 /components.json
 /src/
+__pycache__/
diff --git a/apps/docs/astro.config.mjs b/apps/docs/astro.config.mjs
@@ -205,6 +205,7 @@ export default defineConfig({
           label: "Reference",
           items: [
             { label: "Commands", link: "/reference/commands/" },
+            { label: "Input editor", link: "/reference/input-editor/" },
             { label: "Rule catalog", link: "/reference/rules-catalog/" },
             { label: "Roadmap", link: "/reference/roadmap/" },
           ],

diff --git a/apps/docs/src/content/docs/agent/model-agent.mdx b/apps/docs/src/content/docs/agent/model-agent.mdx
@@ -20,12 +20,12 @@ One approved task can involve many agent cycles until the gate passes or tsforge
 | Group | Tools | When |
 | --- | --- | --- |
 | Core | `read`, `run`, `edit`, `create` | always |
-| Line edits | `edit_lines` | when hashline is enabled |
+| Line edits | `edit_lines` | always (line-number edits with hash verification) |
+| Script | `script` | always (programmatic tool calling — batch multi-step tool use in one program); withhold with `TSFORGE_NO_SCRIPT=1` for eval |
 | Navigation | `search`, `symbol_search`, `find_references`, `type_at`, `diagnostics`, `rename_symbol`, `move_file`, `organize_imports` | existing-code repos |
-| Git context | `git_context` | existing-code repos (read-only: diff/log/blame/show to scope a change); `TSFORGE_NO_GIT_TOOL=1` to withhold |
+| Git context | `git_context` | existing-code repos (read-only: diff/log/blame/show to scope a change) |
 | Web | `scaffold_web`, `scaffold_ui`, `scaffold_routes`, `add_dependency` | web builds |
-| Web research | `package_info`, `package_docs`, `web_fetch`, `web_search`, `web_browse` | when `TSFORGE_WEB=1` (no required API keys or paid browser/search service) |
-| Control | `yield_status` | end turn with a summary |
+| Web research | `package_info`, `package_docs`, `web_fetch`, `web_search`, `web_browse` | when **Web tools** is on in `/config` (no required API keys or paid browser/search service) |
 
 On greenfield specs, navigation tools are often withheld so the model focuses on creating files instead of exploring an empty tree. See [TypeScript language server](/lsp/typescript-server/).
 

diff --git a/apps/docs/src/content/docs/cli/interactive.mdx b/apps/docs/src/content/docs/cli/interactive.mdx
@@ -28,7 +28,7 @@ Most users run `tsforge` and stay in the interactive session.
 | `--no-gate` | skip auto gate detection |
 | `--web` | pre-scaffold web stack + web gate on first build message |
 | `--browser <html>` | append headless render check to gate |
-| `--plan` | force plan mode on (already the default for interactive sessions) |
+| `--plan` | force plan mode on for an interactive session — plan is the default anyway, so this only matters to override a repo that configured an autonomous `policy.mode`; ignored by one-shot/headless |
 | `--continue` / `-c` | resume latest saved session for this dir |
 | `--resume <id>` | resume a specific session |
 | `--log` | append JSONL event stream to `~/.tsforge/logs/` |
@@ -41,18 +41,21 @@ Model endpoint overrides: `TSFORGE_BASE_URL`, `TSFORGE_MODEL` — see [Environme
 | --- | --- |
 | `/help` | list commands |
 | `/plan` | toggle plan mode (on by default) |
+| `/config` | settings hub — model (switch/add), mode, gate, editable scope, and tools (web, TDD); each with a description + live value |
 | `/gate <cmd>` | set gate command (`/gate` alone clears) |
 | `/files <globs>` | set editable scope |
 | `/review [base]` | review your current change (logic, regressions, edge cases) |
 | `/map [status\|forget]` | build a structural map of the repo to prime the agent |
+| `/trace [logfile]` | summarize a `--log` run (calls, policy decisions, gate verdicts, turns-to-green) |
+| `/setup` | infer + write project conventions (the setup wizard) |
 | `/model [name]` | list models or switch active model |
 | `/sessions` | list saved sessions |
 | `/compact` | summarize conversation to free context |
 | `/clear` | reset conversation (keeps workspace + gate) |
 | `/cost` | rough token estimate |
 | `/metrics` | token totals + generation rate (tok/s) this session |
 | `/memory` | show learned failure→fix lessons (`/memory forget` clears them) |
-| `/exit` | quit |
+| `/exit` | quit (`/quit` is an alias) |
 
 Anything else is sent to the agent. While it runs, type to **steer** the next turn. Ctrl-C interrupts the current run.
 

diff --git a/apps/docs/src/content/docs/cli/plan-mode.mdx b/apps/docs/src/content/docs/cli/plan-mode.mdx
@@ -13,7 +13,7 @@ Plan mode is a safety rail for ambiguous work. The model can **read** your repo
 - When the plan looks right, reply **`approve`**, **`go`**, or **`lgtm`** — the model implements it
 - Web builds also accept **`yes`** / **`ok`** at the design checkpoint
 
-There is no disable *flag*: it's a mode you cycle with Shift+Tab. (`tsforge --plan` still forces it on for a one-off launch.)
+There is no disable *flag*: it's a mode you cycle with Shift+Tab. (`tsforge --plan` forces plan mode on for an interactive session even in a repo that configured an autonomous `policy.mode` — one-shot and headless runs are autonomous regardless.)
 
 ## What the model can do in plan mode
 

diff --git a/apps/docs/src/content/docs/eval/ab-testing.mdx b/apps/docs/src/content/docs/eval/ab-testing.mdx
@@ -3,17 +3,17 @@ title: A/B testing
 description: Run feature sweeps, compare edit mechanisms, and land defaults from measured wins.
 ---
 
-Compare [stream rules (TTSR)](/uplift/ttsr/), [hashline](/uplift/hashline/), and [write diagnostics](/uplift/write-diagnostics/) settings across benchmark runs before changing feature defaults. See [Big picture](/big-picture/) for what each feature does.
+Compare feature settings across benchmark runs before changing a default. The sweep harness A/Bs any **tool-availability dimension** — whether the model is offered a given tool — by toggling the env var behind it per run. See [Big picture](/big-picture/) for what each feature does.
 
-## Feature flags
+## Sweepable dimensions
 
-| Variable | Default | Disable |
+| Dimension | On → | Off → |
 | --- | --- | --- |
-| `TSFORGE_TTSR` | ON | `=0` |
-| `TSFORGE_HASHLINE` | ON | `=0` |
-| `TSFORGE_LSP_WRITE_FEEDBACK` | ON | `=0` |
+| `git` | `git_context` available | `TSFORGE_NO_GIT_TOOL=1` |
+| `script` | `script` tool available | `TSFORGE_NO_SCRIPT=1` |
+| `web` | web research tools available (`TSFORGE_WEB=1`) | off |
 
-Full flag reference: [Environment variables](/reference/flags/).
+Core uplifts ([TTSR](/uplift/ttsr/), [hashline](/uplift/hashline/), [write diagnostics](/uplift/write-diagnostics/)) are always on and no longer sweepable — they landed as defaults from earlier sweeps. Full flag reference: [Environment variables](/reference/flags/).
 
 :::note
 Running a sweep drives a real model, so you need an OpenAI-compatible endpoint (the default is local qwen at `http://localhost:8000/v1`; override with `TSFORGE_BASE_URL`/`TSFORGE_MODEL`/`TSFORGE_API_KEY`). The corpus, analysis, and report tooling below ship with the repo and are exercised by the test suite, but the runs themselves need a model.
@@ -42,29 +42,29 @@ A greenfield seed is regenerated from scratch (the sweep deletes the task's file
 
 `bun run eval:sweep` accepts `TSFORGE_FEATURE_VARIANTS` — a comma-separated list of dimensions to sweep (cartesian product).
 
-### Hashline on/off
+### script on/off
 
 ```bash
-TSFORGE_SEED=math \
+TSFORGE_SEED=checkout \
 TSFORGE_TEMPS=0 \
 TSFORGE_REPEATS=2 \
-TSFORGE_FEATURE_VARIANTS=hashline \
+TSFORGE_FEATURE_VARIANTS=script \
 bun run eval:sweep
 ```
 
-Creates four runs: `math-hashline=on-t0-...` and `math-hashline=off-t0-...` (two repeats each).
+Creates four runs: `checkout-script=on-t0-...` and `checkout-script=off-t0-...` (two repeats each).
 
-### TTSR × hashline
+### git × script
 
 ```bash
-TSFORGE_SEED=orders \
+TSFORGE_SEED=fix-regression \
 TSFORGE_TEMPS=0.5 \
 TSFORGE_REPEATS=3 \
-TSFORGE_FEATURE_VARIANTS=ttsr,hashline \
+TSFORGE_FEATURE_VARIANTS=git,script \
 bun run eval:sweep
 ```
 
-Runs `3 repeats × 2 temps × 4 variants = 24` runs with IDs like `orders-ttsr=on,hashline=off-t0.5-...`.
+Runs `3 repeats × 2 temps × 4 variants = 24` runs with IDs like `fix-regression-git=on,script=off-t0.5-...`.
 
 ### git_context on/off
 
@@ -86,7 +86,7 @@ Each run directory contains `run.log` (human transcript) and `result.json` (stru
 
 ```bash
 # newest sweep under evals/runs, comparing every variant to the all-off baseline
-TSFORGE_BASELINE="ttsr=off,hashline=off temp=0" bun run eval:report
+TSFORGE_BASELINE="git=off,script=off temp=0" bun run eval:report
 
 # or point at a specific sweep file
 bun run eval:report evals/runs/sweep-math-20260613-120000.json
@@ -97,8 +97,8 @@ It prints the table and writes it next to the sweep JSON as `….report.md`:
 ```
 | Variant | Runs | Pass | 95% CI | Cycles | Ms | Quality | vs baseline |
 | --- | --- | --- | --- | --- | --- | --- | --- |
-| ttsr=off,hashline=off temp=0 | 10 | 60% | 31%–83% | 6.1 | 41000 | 3.8 | baseline |
-| ttsr=on,hashline=on temp=0   | 10 | 90% | 60%–98% | 4.7 | 33000 | 4.2 | +30% (z=2.13) * |
+| git=off,script=off temp=0 | 10 | 60% | 31%–83% | 6.1 | 41000 | 3.8 | baseline |
+| git=on,script=on temp=0   | 10 | 90% | 60%–98% | 4.7 | 33000 | 4.2 | +30% (z=2.13) * |
 ```
 
 Wilson intervals (not naive ±) keep the bounds sane at small N, and the z-test tells you whether a pass-rate gap is signal or noise — the bar for "measured wins" before flipping a default.
@@ -109,23 +109,21 @@ Pass rate tells you *how often* a variant failed; the **failure breakdown** tell
 
 ```
 ### Failure breakdown
-- ttsr=off,hashline=off temp=0: type-error×3, no-progress×1
-- ttsr=on,hashline=on temp=0: type-error×1
+- git=off,script=off temp=0: type-error×3, no-progress×1
+- git=on,script=on temp=0: type-error×1
 ```
 
 Each failed run is classified from its event stream into one of: `type-error`, `lint-rule`, `hallucinated-import`, `tool-malformed`, `edit-reject`, `degeneration`, `no-progress`, `build-fail`, `browser-fail`, `route-phantom`, or `timeout`. This turns a sweep from "feature X passes more" into "feature X eliminates the `type-error` failures" — pointing at the next rule, prompt, or fixer to build. The same classifier powers the `failure class` line in [`cli-metrics`](/observability/metrics/) for a single `--log` run.
 
 ## Compare edit mechanisms
 
-After a sweep, use `bun run eval:benchmark` to compare edit tool performance:
+`bun run eval:benchmark` reports edit-tool performance across a set of run directories — useful for spotting how `edit` vs `edit_lines` behave, stale-anchor recovery rates, and token cost across models or seeds:
 
 ```bash
-bun run eval:benchmark \
-  evals/money-hashline=on-t0-* \
-  evals/money-hashline=off-t0-*
+bun run eval:benchmark evals/checkout-*
 ```
 
-Output table compares variants on:
+Output table compares runs on:
 
 | Metric | Meaning |
 | --- | --- |
@@ -141,8 +139,7 @@ Output table compares variants on:
 ```bash
 bun run eval:benchmark \
   --json evals/comparison.json \
-  evals/money-hashline=on-t0-* \
-  evals/money-hashline=off-t0-*
+  evals/checkout-*
 ```
 
 ## Run artifacts
@@ -154,10 +151,10 @@ Each run directory contains:
 
 ```json
 {
-  "seed": "money",
-  "runId": "money-hashline=on-t0-20260612-120000-1",
+  "seed": "checkout",
+  "runId": "checkout-script=on-t0-20260612-120000-1",
   "temperature": 0,
-  "features": { "TSFORGE_HASHLINE": "1" },
+  "features": { "TSFORGE_NO_SCRIPT": "0" },
   "status": "done",
   "cycles": 5,
   "ms": 42000,
@@ -183,11 +180,11 @@ Each run directory contains:
 
 ## How to read results
 
-**Edit success** — if `hashline=on` has higher `edit_lines` success than `hashline=off` `edit` success, hashline is reducing rejections.
+**Edit success** — higher `edit_lines` success rate (vs `edit` rejections) means the hashline mechanism is reducing stale-anchor failures.
 
-**Stale recovery** — non-zero recovery counts on hashline-on runs show 3-way merge is active; correlate with pass rate.
+**Stale recovery** — non-zero recovery counts show the 3-way merge is active; correlate with pass rate.
 
-**Turns to green** — lower on feature-on variants means less loop churn.
+**Turns to green** — lower on a variant means less loop churn.
 
 **Token efficiency** — smaller `mean args (bytes)` at similar success rate is better.
 

diff --git a/apps/docs/src/content/docs/guardrails/rule-packs.mdx b/apps/docs/src/content/docs/guardrails/rule-packs.mdx
@@ -21,14 +21,15 @@ These load without waiting for a dependency match:
 
 | ID | What it covers |
 | --- | --- |
+| `generic-ts` | Core TypeScript safety rules for every project (the bundled ESLint safety config) |
 | `env-access` | Validated env access, no `process.exit` in libraries |
 | `module-boundaries` | Layering, no React in services |
 | `code-flow` | Deterministic time/random, early returns |
 | `comment-hygiene` | No narration, PR refs, or historical comments |
 | `security` | Command injection, ReDoS, DOM XSS, silent catch blocks, no tokens in storage |
 | `runtime-boundaries` | Open redirects, SSRF fetches, prototype pollution, webhook verify, upload limits |
 
-`generic-ts` is a detection label only — strict TypeScript comes from `tsc` and the bundled ESLint config.
+`generic-ts` runs on every project alongside `tsc`; stack detection layers framework-specific packs (`react`, `elysia`, `nextjs`, …) on top.
 
 ## Pack list
 

diff --git a/apps/docs/src/content/docs/integrations/web-tools.mdx b/apps/docs/src/content/docs/integrations/web-tools.mdx
@@ -3,7 +3,7 @@ title: Web research (no API keys)
 description: "Opt-in web_fetch, web_search, package_info, package_docs, and web_browse tools — no paid search/browser API, no required service key."
 ---
 
-Set `TSFORGE_WEB=1` to give the agent read-only internet research tools. They're built for **no required API keys and no paid vendor coupling**: npm metadata comes from the configured registry, search defaults to DuckDuckGo's keyless HTML endpoint, pages are extracted locally, and browser rendering uses local Playwright/Chromium when available. Off by default, so a run without the flag has no network reach beyond your model endpoint.
+Interactive sessions get read-only internet research tools **on by default** (an assistant that can't look things up is silly); toggle them under **Web tools** in [`/config`](/cli/interactive/). They're built for **no required API keys and no paid vendor coupling**: npm metadata comes from the configured registry, search defaults to DuckDuckGo's keyless HTML endpoint, pages are extracted locally, and browser rendering uses local Playwright/Chromium when available. One-shot and eval runs stay **off** unless you set `TSFORGE_WEB=1`, so headless sweeps have no network reach beyond your model endpoint.
 
 ```bash
 TSFORGE_WEB=1 tsforge "update the deprecated API call — check the library's current docs"
@@ -29,13 +29,13 @@ For current TypeScript/library work, ask the agent to search the official host f
 Check the current TanStack Query docs before changing this hook. Use domain-scoped web search if needed.
 ```
 
-## Why opt-in
+## When they're active
 
-The tools are read-only and offline-safe, but web access is still more reach than the agent has by default — so it's a deliberate flag, not an always-on capability. Under a policy mode that denies `network` (e.g. `ci`), the tools are unavailable even with the flag set. See [Permissions & policy](/guardrails/policy/).
+The tools are read-only and offline-safe. Interactive sessions enable them by default, but one-shot and eval runs stay offline unless you opt in — so headless sweeps are deterministic. Under a policy mode that denies `network` (e.g. `ci`), the tools are unavailable even with the flag set. See [Permissions & policy](/guardrails/policy/).
 
 | Env var | Default | Effect |
 | --- | --- | --- |
-| `TSFORGE_WEB` | off | enable keyless research tools (`=1`) |
+| `TSFORGE_WEB` | on interactive, off one-shot/eval | force keyless research tools on (`=1`) or off (`=0`) |
 | `TSFORGE_NPM_REGISTRY` | npm registry | registry used by `package_info` / `package_docs` |
 | `TSFORGE_SEARXNG_URL` | unset | route `web_search` to a SearXNG instance you already run |
 | `TSFORGE_WEB_SEARCH_BACKEND` | auto | `duckduckgo` or `searxng`; `searxng` fails closed if no SearXNG URL is set |

diff --git a/apps/docs/src/content/docs/loop/gate-floor.mdx b/apps/docs/src/content/docs/loop/gate-floor.mdx
@@ -111,6 +111,16 @@ If the session scaffolds a new browser app (`scaffold_web` or `tsforge --web`),
 
 Details: [Web scaffolding](/scaffold/web/).
 
+### Staged gate progress & failures
+
+The web gate runs as **named stages** rather than one opaque `&&` chain. Each stage prints a `━━ <label> ━━` banner and streams its output live; a passing stage prints `✓ <label>`. On the first failure the runner prints
+
+```
+✗ <label> FAILED (exit N)
+```
+
+and **stops** — later stages don't run, and the failing stage's exit code is preserved. So when a build goes red, both you and the agent's feedback loop see *which* stage broke (`vite build`, `typecheck`, `lint`, `type-aware lint`, `stub check`, `format`, `tests`, or `browser smoke`) instead of a wall of interleaved output. The core (non-web) gate is short enough that it stays a plain command chain.
+
 Add a one-off page render check with `--browser path/to/index.html`.
 
 ### Accessibility, screenshots, and a perf budget

diff --git a/apps/docs/src/content/docs/loop/greenfield.mdx b/apps/docs/src/content/docs/loop/greenfield.mdx
@@ -54,10 +54,6 @@ Each role can run on its own model (names from your [models.json](/inference/mod
 tsforge run kanban "build a kanban board"
 ```
 
-## Contract negotiation (experimental)
-
-Set `TSFORGE_CONTRACT=1` to make the generator and evaluator agree a **build contract** for each feature *before* building — the generator proposes "I'll build X, verified by Y" and the evaluator pushes back until it's concrete. The agreed contract then anchors the implementation, and the negotiation is saved to `.tsforge/greenfield/contracts/<feature>.md`. Off by default — it's unproven and adds model calls.
-
 ## Unattended runs & scheduling
 
 Greenfield runs are long and headless-friendly. There's no built-in scheduler — wire one with your OS:

diff --git a/apps/docs/src/content/docs/loop/spec-runner.mdx b/apps/docs/src/content/docs/loop/spec-runner.mdx
@@ -29,7 +29,7 @@ Outputs include per-task status (`done`, `stuck`, interrupted) and a final pass/
 ```bash
 bun run eval:spec
 
-TSFORGE_SEED=money TSFORGE_FEATURE_VARIANTS=hashline \
+TSFORGE_SEED=money TSFORGE_FEATURE_VARIANTS=script \
   bun run eval:sweep
 ```
 

diff --git a/apps/docs/src/content/docs/loop/validation.mdx b/apps/docs/src/content/docs/loop/validation.mdx
@@ -39,7 +39,7 @@ tsforge's primary stop is **lack of progress, not a raw turn count**. Two guards
 - **Same-error persistence** — if one specific error (the same `file` + `rule`) survives **5 consecutive** fix cycles, tsforge stops, even if _other_ errors are changing around it. The stop names the blocker: `stuck on no-explicit-any in src/views/Foo/index.tsx after 5 attempts (last: …)`. Interactively, you get that diagnosis and the prompt back — the session stays alive, so you can re-steer.
 - **Whole-set stall** — a coarser net: the entire error set unchanged for 6 cycles.
 
-The **turn cap** is only a runaway backstop now. Interactive sessions ride a high ceiling (≈250 turns) so long, productive back-and-forth is never cut off; headless/eval runs keep a real cap (40, or 180 for web builds) since no human is present to intervene.
+The **turn cap** is only a runaway backstop now. Interactive sessions ride a high ceiling (≈250 turns) so long, productive back-and-forth is never cut off; headless/eval runs keep a real cap (40, or 400 for web builds) since no human is present to intervene.
 
 When the gate fails, tsforge sends structured errors (file, line, rule name, message) back to the model, not a generic failure blob. That is what makes repair workable.
 

diff --git a/apps/docs/src/content/docs/lsp/typescript-server.mdx b/apps/docs/src/content/docs/lsp/typescript-server.mdx
@@ -27,9 +27,9 @@ Offered when tsforge detects real code to explore (existing repo, resumed sessio
 | `move_file` | move/rename a file and rewrite every importer |
 | `organize_imports` | sort and clean imports |
 
-Disable navigation tools: `TSFORGE_NO_LSP_TOOLS=1`. Disable write feedback: `TSFORGE_LSP_WRITE_FEEDBACK=0`.
+Navigation and write feedback (instant per-file type diagnostics after each edit) are always on for real work; navigation can be withheld for eval/headless runs with `TSFORGE_NO_LSP_TOOLS=1`.
 
-On existing repos the model is also offered `git_context` — read-only, structured access to history and diffs (scope a fix to what changed). It is git-backed, not part of the language server, so `TSFORGE_NO_LSP_TOOLS` does not affect it; disable it with `TSFORGE_NO_GIT_TOOL=1`. See [Git context](/reference/flags/#git-context).
+On existing repos the model is also offered `git_context` — read-only, structured access to history and diffs (scope a fix to what changed). It is git-backed, not part of the language server, so `TSFORGE_NO_LSP_TOOLS` does not affect it; withhold it for eval/headless runs with `TSFORGE_NO_GIT_TOOL=1`. See [Git context](/reference/flags/#git-context).
 
 ## Safe auto-fixes