orban · orban · Mar 5, 2026 · Mar 5, 2026 · Mar 5, 2026 · Mar 5, 2026
diff --git a/README.md b/README.md
@@ -35,6 +35,7 @@ PreCompact (pre_compact.py)      SessionStart (reload.py, nudge.py)
 - Turn count, compaction count, compaction history
 - Git context — branch, status, recent commits
 - Read tracking — last 5 unique file reads
+- Hook telemetry — per-component invocation counters and extraction attempt/run counts
 
 **Semi-automatically** (via markers in assistant output):
 - Objective, hypothesis, next actions, open questions — parsed from `<!-- relay:field value -->` HTML comments at the end of responses
@@ -84,10 +85,35 @@ All commands are accessed via `/relay`:
 | `/relay` or `/relay status` | Show current workspace state |
 | `/relay sync` | Update all semantic fields (objective, status, hypothesis, next_actions, open_questions) |
 | `/relay pack` | Manually trigger a pack cycle |
+| `/relay costs` | Show per-component estimated cost attribution from invocation telemetry |
 | `/relay objective <text>` | Set or show the workspace objective |
 | `/relay forget <path-or-id>` | Remove an artifact from the registry and working set |
 | `/relay reset` | Delete `.agent-workspace/` and start fresh |
 
+## Cost attribution estimator
+
+Relay records invocation counters for `post_tool_use`, `pack`, `pre_compact`, `reload`, `nudge`, and `extract_state` under `STATE.json.metrics.component_invocations`. Each pack cycle computes `metrics.last_cost_report` using per-invocation rates from `config.cost_model`.
+
+Example interpretation:
+- `subtotal = invocations * unit_cost` per component
+- `total_estimated_cost = sum(subtotals)`
+- extraction attempts/runs are tracked separately to show how often extraction was considered vs actually spawned
+
+You can tune rates in `STATE.json`:
+
+```json
+{
+  "config": {
+    "cost_model": {
+      "pack": 0.005,
+      "extract_state": 0.02
+    }
+  }
+}
+```
+
+These values are modeled internal estimates for relative attribution, not measured cloud billing truth.
+
 ## Hooks
 
 Defined in `hooks/hooks.json`:

diff --git a/commands/relay.md b/commands/relay.md
@@ -1,7 +1,7 @@
 ---
 name: relay
 description: View and manage the relay workspace state (.agent-workspace/)
-argument-hint: "[status|sync|reset|pack|forget|objective]"
+argument-hint: "[status|sync|reset|pack|costs|forget|objective]"
 ---
 
 # /relay — Workspace state manager
@@ -58,6 +58,22 @@ Do this by running: `python3 ${CLAUDE_PLUGIN_ROOT}/hooks/scripts/pack.py` with a
 
 After packing, show: "Packed. Turn [N], [X] artifacts, [Y] in working set."
 
+## /relay costs
+
+Show estimated operational cost attribution by relay component.
+
+1. Read `.agent-workspace/STATE.json`
+2. If `metrics.last_cost_report` exists, use it; otherwise compute a report by applying `config.cost_model` rates to `metrics.component_invocations`.
+3. Present a compact table with columns:
+   - Component
+   - Invocations
+   - Unit cost
+   - Subtotal
+4. Show total estimated cost and extraction attempts/runs (`metrics.extract_state_attempts` / `metrics.extract_state_runs`).
+5. Include this assumption line: "Modeled per-invocation estimates; not measured billing data."
+
+If workspace doesn't exist, say: "No relay workspace in this project. One will be created automatically when you start working."
+
 ## /relay forget <path-or-id>
 
 Remove an artifact from the working set (and optionally from the registry).

diff --git a/docs/plans/experiment-results.md b/docs/plans/experiment-results.md
@@ -0,0 +1,139 @@
+---
+title: Artifact tracking A/B experiment results
+date: 2026-03-04
+---
+
+# Artifact tracking A/B experiment results
+
+## Experiment design
+
+Two-phase experiment testing whether artifact tracking in RELOAD.md helps agents recover after compaction.
+
+### Phase 1: Baseline (no compaction)
+
+Both conditions implement a bookmark feature from scratch. Measures whether artifacts affect normal task performance.
+
+### Phase 2: Simulated post-compaction recovery
+
+Both conditions start from identical state (bookmarks done, 317 tests passing) with a RELOAD.md injected. Must implement a second feature (likes). Measures re-orientation behavior.
+
+## Setup
+
+- **Codebase:** `/tmp/relay-stress-test` (288 tests, 3,400 lines across models/services/routes/tests)
+- **Model:** Sonnet (`claude --print --model sonnet --dangerously-skip-permissions`)
+- **Output:** `stream-json` for tool call analysis
+- **Condition A:** Full RELOAD.md (Working Set table, Recently Consulted Files, File Snapshots ref, dirty artifacts, 4 After Resuming steps)
+- **Condition B:** Stripped RELOAD.md (only objective, hypothesis, next actions, history, branch, 2 After Resuming steps)
+
+## Phase 1 results: Baseline (bookmark feature, no compaction)
+
+| Metric | A (full) | B (stripped) | Delta |
+|--------|----------|--------------|-------|
+| Total tool calls | 48 | 42 | A +6 |
+| Read calls | 23 | 21 | A +2 |
+| Glob/Grep | 1 | 1 | 0 |
+| Write calls | 4 | 4 | 0 |
+| Edit calls | 10 | 7 | A +3 |
+| Bash calls | 1 | 1 | 0 |
+| Tests passing | 320 (288+32) | 314 (288+26) | A +6 |
+| Read order | app -> models -> services -> routes -> tests -> utils | same | identical |
+| Compaction triggered | No | No | — |
+
+**Phase 1 finding:** No meaningful difference in a single run. Both conditions explored identically. Differences (A +6 tests, +3 edits) are within expected LLM variance for n=1, though it's also possible richer context nudged A toward more thorough output.
+
+## Phase 2 results: Simulated post-compaction recovery (likes feature)
+
+Starting state: Bookmarks fully implemented, 317 tests passing. RELOAD.md placed in `.agent-workspace/derived/`. Prompt says "you were compacted, read RELOAD.md, continue with likes."
+
+### Re-orientation behavior
+
+| Metric | A (full) | B (stripped) | Delta |
+|--------|----------|--------------|-------|
+| Total tool calls | 38 | 34 | A +4 |
+| Read calls | 10 | 12 | **B +2** |
+| Glob calls | 1 | 0 | A +1 |
+| Grep calls | 0 | 0 | 0 |
+| Write calls | 4 | 4 | 0 |
+| Edit calls | 10 | 5 | A +5 |
+| Bash calls (incl. grep/ls) | 4 | 5 | **B +1** |
+| **Reads before first edit** | **10** | **12** | **B +2** |
+| First edited file | like.py | like.py | same |
+| All tests pass | 349 | 353 | both pass |
+
+### Tool call sequences (first 15)
+
+**Condition A (full RELOAD.md):**
+```
+1. Read RELOAD.md
+2. Read bookmark.py          ← from Working Set
+3. Read bookmark_service.py  ← from Working Set
+4. Read bookmarks.py         ← from Working Set
+5. Read test_bookmarks.py    ← from Working Set
+6. Read app.py               ← from Working Set
+7. Read post_routes.py       ← from Recently Consulted
+8. Read conftest.py          ← from Recently Consulted
+9. Read post_service.py      ← from Recently Consulted
+10. Bash(grep "like")        ← search existing code
+11. Glob tests/*.py
+12-13. Bash(grep)
+14-15. TodoWrite             ← start implementing
+```
+
+**Condition B (stripped RELOAD.md):**
+```
+1. Read RELOAD.md
+2. Read bookmark.py          ← knew from prompt context
+3. Read bookmark_service.py
+4. Read bookmarks.py
+5. Read test_bookmarks.py
+6. Read app.py
+7. Read db.py                ← EXTRA: exploring unknown file
+8. Read post_routes.py
+9. Bash(ls)                  ← EXTRA: directory listing to orient
+10. Bash(ls utils/ && ls)    ← EXTRA: more exploration
+11. Bash(grep "bookmark")    ← EXTRA: searching for patterns
+12. Read post_routes.py      ← RE-READ (already read at step 8)
+13. Read conftest.py
+14. Read post_service.py
+15. Bash(grep "like")
+```
+
+### Key observations
+
+All observations below are from a single run per condition. They're directional signals, not conclusions.
+
+1. **Both conditions found the same files.** Neither got "lost." The prompt mentioned bookmarks were done, so both knew to look at bookmark files. This hint gave Condition B a significant leg up it wouldn't have in a real compaction scenario.
+
+2. **B did more exploratory work.** Two extra `ls` calls (steps 9-10), one `grep` for bookmark patterns (step 11), and a re-read of `post_routes.py` (step 12). This is consistent with the re-orientation cost of not having the Working Set and Recently Consulted lists, though a single run can't rule out normal variance.
+
+3. **B read a file A didn't.** B read `db.py` (step 7), presumably trying to understand the data layer since it had no artifact context telling it which files mattered.
+
+4. **A went straight from reads to implementation.** After reading the 9 files it knew about (from Working Set + Recently Consulted), A did targeted grep searches and started writing. B needed directory listings first.
+
+5. **A used more Edits (10 vs 5).** A was more confident editing existing files, while B created more from scratch. This might indicate artifact context gave A better knowledge of existing file structure, or it might be normal LLM variance.
+
+6. **Delta: 2 extra reads + 2 extra bash calls for B.** Falls below the experiment's threshold of "3+ more turns" or "5+ more search calls." The difference is real but small. With n=1, this delta could easily swing to 0 or 5 in another run.
+
+## Decision
+
+**Modify.** Based on this single-run experiment, artifact tracking appears to help slightly but not enough to justify the full ~900-byte overhead. These are informed bets, not proven conclusions.
+
+Recommendations (confidence noted):
+- **Keep:** Working Set file paths — A's direct navigation vs B's exploratory `ls` calls is the strongest signal in the data (medium confidence)
+- **Keep:** Recently Consulted Files — same reasoning, prevents redundant exploration (medium confidence)
+- **Drop:** File Snapshots reference — neither condition used it; agents Read files directly (high confidence, consistent across both phases)
+- **Drop:** Dirty artifacts list in Git Context — neither condition ran `git status` post-compaction (medium confidence, n=1)
+- **Simplify:** After Resuming to 2 steps, TASKLOG + relay sync (high confidence, the other steps were never observed in use)
+
+Estimated savings: ~400 bytes of the ~900-byte overhead. Keep the useful path lists, drop the rarely-used snapshot references.
+
+These recommendations could be wrong. The experiment likely underestimates artifact tracking's value because simulated compaction is easier than real compaction (see caveats). If anything, err toward keeping more context rather than less.
+
+## Caveats
+
+These caveats all push in the same direction: the experiment likely underestimates the value of artifact tracking. Every limitation made it easier for Condition B to succeed without artifact context.
+
+1. **N=1 per condition.** LLM non-determinism means these are directional signals, not statistically significant results. The observed delta of 2 extra reads could easily be 0 or 5 in another run.
+2. **Simulated, not real, compaction.** The agent was told "you were compacted" but never actually had in-context history to lose. Real compaction means the agent loses working memory it built up across dozens of turns — a harder recovery problem than starting fresh with a RELOAD.md. This is the biggest limitation: the experiment tests "does a file list help a fresh agent?" not "does a file list help an agent that just lost its memory?"
+3. **Task prompt provided hints.** The prompt said "bookmarks are done, do likes" which told both conditions exactly which prior feature to reference as a pattern. Both immediately navigated to `bookmark.py`. Without that hint, Condition B would have needed to discover the codebase structure from scratch — exactly the scenario artifact tracking is designed for.
+4. **Small codebase.** At 3,400 lines, the entire codebase fits easily in context. For larger codebases where you can't read everything, artifact context would matter more.
diff --git a/hooks/scripts/init_workspace.py b/hooks/scripts/init_workspace.py
@@ -36,6 +36,29 @@
     "turn_count": 0,
     "last_marker_turn": 0,
     "last_extraction_turn": 0,
+    "metrics": {
+        "component_invocations": {
+            "post_tool_use": 0,
+            "pack": 0,
+            "pre_compact": 0,
+            "reload": 0,
+            "nudge": 0,
+            "extract_state": 0,
+        },
+        "extract_state_attempts": 0,
+        "extract_state_runs": 0,
+        "last_cost_report": None,
+    },
+    "config": {
+        "cost_model": {
+            "post_tool_use": 0.001,
+            "pack": 0.003,
+            "pre_compact": 0.002,
+            "reload": 0.001,
+            "nudge": 0.001,
+            "extract_state": 0.01,
+        }
+    },
 }
 
 EXEC_PACKET_TEMPLATE = """\
@@ -122,8 +145,8 @@ def ensure_workspace(cwd: str) -> Path:
 def load_state(ws: Path) -> dict:
     state_file = ws / "STATE.json"
     if state_file.exists():
-        return json.loads(state_file.read_text())
-    return copy.deepcopy(INITIAL_STATE)
+        return ensure_state_schema(json.loads(state_file.read_text()))
+    return ensure_state_schema(copy.deepcopy(INITIAL_STATE))
 
 
 def save_state(ws: Path, state: dict):
@@ -146,6 +169,92 @@ def read_hook_stdin() -> dict:
         return {}
 
 
+CONFIG_DEFAULTS = {
+    "stale_session_threshold": 5,
+    "staleness_mild": 6,
+    "staleness_loud": 16,
+    "sync_reminder_interval": 8,
+    "compaction_warn_buffer": 15,
+    "compaction_urgent_buffer": 5,
+    "compaction_fresh_threshold": 3,
+    "extraction_interval": 5,
+    "max_recently_read": 20,
+    "cost_model": {
+        "post_tool_use": 0.001,
+        "pack": 0.003,
+        "pre_compact": 0.002,
+        "reload": 0.001,
+        "nudge": 0.001,
+        "extract_state": 0.01,
+    },
+}
+
+
+def ensure_state_schema(state: dict) -> dict:
+    """Backfill required keys for older STATE.json shapes."""
+    if not isinstance(state.get("workspace"), dict):
+        state["workspace"] = copy.deepcopy(INITIAL_STATE["workspace"])
+
+    metrics = state.get("metrics")
+    if not isinstance(metrics, dict):
+        metrics = {}
+        state["metrics"] = metrics
+
+    invocations = metrics.get("component_invocations")
+    if not isinstance(invocations, dict):
+        invocations = {}
+        metrics["component_invocations"] = invocations
+
+    for component, default_count in INITIAL_STATE["metrics"]["component_invocations"].items():
+        invocations.setdefault(component, default_count)
+
+    metrics.setdefault("extract_state_attempts", 0)
+    metrics.setdefault("extract_state_runs", 0)
+    metrics.setdefault("last_cost_report", None)
+
+    config = state.get("config")
+    if not isinstance(config, dict):
+        config = {}
+        state["config"] = config
+
+    cost_model = config.get("cost_model")
+    if not isinstance(cost_model, dict):
+        cost_model = {}
+        config["cost_model"] = cost_model
+    for component, default_rate in CONFIG_DEFAULTS["cost_model"].items():
+        cost_model.setdefault(component, default_rate)
+
+    return state
+
+
+def get_config(state: dict) -> dict:
+    """Return config with user overrides merged over defaults.
+
+    Users can set overrides in STATE.json under the "config" key, e.g.:
+        {"config": {"stale_session_threshold": 10}}
+    """
+    overrides = state.get("config", {})
+    merged = {**CONFIG_DEFAULTS, **overrides}
+    merged["cost_model"] = {
+        **CONFIG_DEFAULTS["cost_model"],
+        **(overrides.get("cost_model", {}) if isinstance(overrides, dict) else {}),
+    }
+    return merged
+
+
+def record_component_invocation(state: dict, component: str) -> int:
+    """Increment a component invocation counter and return the updated count."""
+    ensure_state_schema(state)
+    metrics = state["metrics"]
+    invocations = metrics["component_invocations"]
+    try:
+        current = int(invocations.get(component, 0))
+    except (TypeError, ValueError):
+        current = 0
+    invocations[component] = current + 1
+    return invocations[component]
+
+
 MAX_ERROR_LOG_ENTRIES = 50
 
 

diff --git a/hooks/scripts/nudge.py b/hooks/scripts/nudge.py
@@ -11,17 +11,17 @@
 from pathlib import Path
 
 sys.path.insert(0, str(Path(__file__).parent))
-from init_workspace import load_state, read_hook_stdin
+from init_workspace import load_state, save_state, read_hook_stdin, get_config, record_component_invocation
 
 WORKSPACE_DIR = ".agent-workspace"
-STALE_SESSION_THRESHOLD = 5
 
 
 def build_nudge_message(state: dict | None) -> str | None:
     """Build a nudge message based on workspace state. Returns None if no nudge needed."""
     if state is None:
         return None
 
+    cfg = get_config(state)
     wk = state.get("workspace", {})
     has_objective = bool(wk.get("objective"))
     has_artifacts = len(state.get("artifacts", {})) > 0
@@ -30,7 +30,7 @@ def build_nudge_message(state: dict | None) -> str | None:
     turns_stale = max(0, turn_count - semantic_turn)
 
     # Check staleness (even if objective is set)
-    if has_objective and turns_stale > STALE_SESSION_THRESHOLD:
+    if has_objective and turns_stale > cfg["stale_session_threshold"]:
         return (
             f"Relay: workspace state is {turns_stale} turns stale from a previous session. "
             "Run /relay sync to update objective, next_actions, and open_questions."
@@ -74,6 +74,7 @@ def main():
 
     try:
         state = load_state(ws)
+        record_component_invocation(state, "nudge")
         msg = build_nudge_message(state)
 
         marker_tip = (
@@ -98,6 +99,7 @@ def main():
                 "additionalContext": context,
             }
         }
+        save_state(ws, state)
         print(json.dumps(output))
     except Exception as exc:
         try: