Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ PreCompact (pre_compact.py) SessionStart (reload.py, nudge.py)
- Turn count, compaction count, compaction history
- Git context — branch, status, recent commits
- Read tracking — last 5 unique file reads
- Hook telemetry — per-component invocation counters and extraction attempt/run counts

**Semi-automatically** (via markers in assistant output):
- Objective, hypothesis, next actions, open questions — parsed from `<!-- relay:field value -->` HTML comments at the end of responses
Expand Down Expand Up @@ -84,10 +85,35 @@ All commands are accessed via `/relay`:
| `/relay` or `/relay status` | Show current workspace state |
| `/relay sync` | Update all semantic fields (objective, status, hypothesis, next_actions, open_questions) |
| `/relay pack` | Manually trigger a pack cycle |
| `/relay costs` | Show per-component estimated cost attribution from invocation telemetry |
| `/relay objective <text>` | Set or show the workspace objective |
| `/relay forget <path-or-id>` | Remove an artifact from the registry and working set |
| `/relay reset` | Delete `.agent-workspace/` and start fresh |

## Cost attribution estimator

Relay records invocation counters for `post_tool_use`, `pack`, `pre_compact`, `reload`, `nudge`, and `extract_state` under `STATE.json.metrics.component_invocations`. Each pack cycle computes `metrics.last_cost_report` using per-invocation rates from `config.cost_model`.

Example interpretation:
- `subtotal = invocations * unit_cost` per component
- `total_estimated_cost = sum(subtotals)`
- extraction attempts/runs are tracked separately to show how often extraction was considered vs actually spawned

You can tune rates in `STATE.json`:

```json
{
"config": {
"cost_model": {
"pack": 0.005,
"extract_state": 0.02
}
}
}
```

These values are modeled internal estimates for relative attribution, not measured cloud billing truth.

## Hooks

Defined in `hooks/hooks.json`:
Expand Down
18 changes: 17 additions & 1 deletion commands/relay.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
name: relay
description: View and manage the relay workspace state (.agent-workspace/)
argument-hint: "[status|sync|reset|pack|forget|objective]"
argument-hint: "[status|sync|reset|pack|costs|forget|objective]"
---

# /relay — Workspace state manager
Expand Down Expand Up @@ -58,6 +58,22 @@ Do this by running: `python3 ${CLAUDE_PLUGIN_ROOT}/hooks/scripts/pack.py` with a

After packing, show: "Packed. Turn [N], [X] artifacts, [Y] in working set."

## /relay costs

Show estimated operational cost attribution by relay component.

1. Read `.agent-workspace/STATE.json`
2. If `metrics.last_cost_report` exists, use it; otherwise compute a report by applying `config.cost_model` rates to `metrics.component_invocations`.
3. Present a compact table with columns:
- Component
- Invocations
- Unit cost
- Subtotal
4. Show total estimated cost and extraction attempts/runs (`metrics.extract_state_attempts` / `metrics.extract_state_runs`).
5. Include this assumption line: "Modeled per-invocation estimates; not measured billing data."

If workspace doesn't exist, say: "No relay workspace in this project. One will be created automatically when you start working."

## /relay forget <path-or-id>

Remove an artifact from the working set (and optionally from the registry).
Expand Down
139 changes: 139 additions & 0 deletions docs/plans/experiment-results.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
---
title: Artifact tracking A/B experiment results
date: 2026-03-04
---

# Artifact tracking A/B experiment results

## Experiment design

Two-phase experiment testing whether artifact tracking in RELOAD.md helps agents recover after compaction.

### Phase 1: Baseline (no compaction)

Both conditions implement a bookmark feature from scratch. Measures whether artifacts affect normal task performance.

### Phase 2: Simulated post-compaction recovery

Both conditions start from identical state (bookmarks done, 317 tests passing) with a RELOAD.md injected. Must implement a second feature (likes). Measures re-orientation behavior.

## Setup

- **Codebase:** `/tmp/relay-stress-test` (288 tests, 3,400 lines across models/services/routes/tests)
- **Model:** Sonnet (`claude --print --model sonnet --dangerously-skip-permissions`)
- **Output:** `stream-json` for tool call analysis
- **Condition A:** Full RELOAD.md (Working Set table, Recently Consulted Files, File Snapshots ref, dirty artifacts, 4 After Resuming steps)
- **Condition B:** Stripped RELOAD.md (only objective, hypothesis, next actions, history, branch, 2 After Resuming steps)

## Phase 1 results: Baseline (bookmark feature, no compaction)

| Metric | A (full) | B (stripped) | Delta |
|--------|----------|--------------|-------|
| Total tool calls | 48 | 42 | A +6 |
| Read calls | 23 | 21 | A +2 |
| Glob/Grep | 1 | 1 | 0 |
| Write calls | 4 | 4 | 0 |
| Edit calls | 10 | 7 | A +3 |
| Bash calls | 1 | 1 | 0 |
| Tests passing | 320 (288+32) | 314 (288+26) | A +6 |
| Read order | app -> models -> services -> routes -> tests -> utils | same | identical |
| Compaction triggered | No | No | — |

**Phase 1 finding:** No meaningful difference in a single run. Both conditions explored identically. Differences (A +6 tests, +3 edits) are within expected LLM variance for n=1, though it's also possible richer context nudged A toward more thorough output.

## Phase 2 results: Simulated post-compaction recovery (likes feature)

Starting state: Bookmarks fully implemented, 317 tests passing. RELOAD.md placed in `.agent-workspace/derived/`. Prompt says "you were compacted, read RELOAD.md, continue with likes."

### Re-orientation behavior

| Metric | A (full) | B (stripped) | Delta |
|--------|----------|--------------|-------|
| Total tool calls | 38 | 34 | A +4 |
| Read calls | 10 | 12 | **B +2** |
| Glob calls | 1 | 0 | A +1 |
| Grep calls | 0 | 0 | 0 |
| Write calls | 4 | 4 | 0 |
| Edit calls | 10 | 5 | A +5 |
| Bash calls (incl. grep/ls) | 4 | 5 | **B +1** |
| **Reads before first edit** | **10** | **12** | **B +2** |
| First edited file | like.py | like.py | same |
| All tests pass | 349 | 353 | both pass |

### Tool call sequences (first 15)

**Condition A (full RELOAD.md):**
```
1. Read RELOAD.md
2. Read bookmark.py ← from Working Set
3. Read bookmark_service.py ← from Working Set
4. Read bookmarks.py ← from Working Set
5. Read test_bookmarks.py ← from Working Set
6. Read app.py ← from Working Set
7. Read post_routes.py ← from Recently Consulted
8. Read conftest.py ← from Recently Consulted
9. Read post_service.py ← from Recently Consulted
10. Bash(grep "like") ← search existing code
11. Glob tests/*.py
12-13. Bash(grep)
14-15. TodoWrite ← start implementing
```

**Condition B (stripped RELOAD.md):**
```
1. Read RELOAD.md
2. Read bookmark.py ← knew from prompt context
3. Read bookmark_service.py
4. Read bookmarks.py
5. Read test_bookmarks.py
6. Read app.py
7. Read db.py ← EXTRA: exploring unknown file
8. Read post_routes.py
9. Bash(ls) ← EXTRA: directory listing to orient
10. Bash(ls utils/ && ls) ← EXTRA: more exploration
11. Bash(grep "bookmark") ← EXTRA: searching for patterns
12. Read post_routes.py ← RE-READ (already read at step 8)
13. Read conftest.py
14. Read post_service.py
15. Bash(grep "like")
```

### Key observations

All observations below are from a single run per condition. They're directional signals, not conclusions.

1. **Both conditions found the same files.** Neither got "lost." The prompt mentioned bookmarks were done, so both knew to look at bookmark files. This hint gave Condition B a significant leg up it wouldn't have in a real compaction scenario.

2. **B did more exploratory work.** Two extra `ls` calls (steps 9-10), one `grep` for bookmark patterns (step 11), and a re-read of `post_routes.py` (step 12). This is consistent with the re-orientation cost of not having the Working Set and Recently Consulted lists, though a single run can't rule out normal variance.

3. **B read a file A didn't.** B read `db.py` (step 7), presumably trying to understand the data layer since it had no artifact context telling it which files mattered.

4. **A went straight from reads to implementation.** After reading the 9 files it knew about (from Working Set + Recently Consulted), A did targeted grep searches and started writing. B needed directory listings first.

5. **A used more Edits (10 vs 5).** A was more confident editing existing files, while B created more from scratch. This might indicate artifact context gave A better knowledge of existing file structure, or it might be normal LLM variance.

6. **Delta: 2 extra reads + 2 extra bash calls for B.** Falls below the experiment's threshold of "3+ more turns" or "5+ more search calls." The difference is real but small. With n=1, this delta could easily swing to 0 or 5 in another run.

## Decision

**Modify.** Based on this single-run experiment, artifact tracking appears to help slightly but not enough to justify the full ~900-byte overhead. These are informed bets, not proven conclusions.

Recommendations (confidence noted):
- **Keep:** Working Set file paths — A's direct navigation vs B's exploratory `ls` calls is the strongest signal in the data (medium confidence)
- **Keep:** Recently Consulted Files — same reasoning, prevents redundant exploration (medium confidence)
- **Drop:** File Snapshots reference — neither condition used it; agents Read files directly (high confidence, consistent across both phases)
- **Drop:** Dirty artifacts list in Git Context — neither condition ran `git status` post-compaction (medium confidence, n=1)
- **Simplify:** After Resuming to 2 steps, TASKLOG + relay sync (high confidence, the other steps were never observed in use)

Estimated savings: ~400 bytes of the ~900-byte overhead. Keep the useful path lists, drop the rarely-used snapshot references.

These recommendations could be wrong. The experiment likely underestimates artifact tracking's value because simulated compaction is easier than real compaction (see caveats). If anything, err toward keeping more context rather than less.

## Caveats

These caveats all push in the same direction: the experiment likely underestimates the value of artifact tracking. Every limitation made it easier for Condition B to succeed without artifact context.

1. **N=1 per condition.** LLM non-determinism means these are directional signals, not statistically significant results. The observed delta of 2 extra reads could easily be 0 or 5 in another run.
2. **Simulated, not real, compaction.** The agent was told "you were compacted" but never actually had in-context history to lose. Real compaction means the agent loses working memory it built up across dozens of turns — a harder recovery problem than starting fresh with a RELOAD.md. This is the biggest limitation: the experiment tests "does a file list help a fresh agent?" not "does a file list help an agent that just lost its memory?"
3. **Task prompt provided hints.** The prompt said "bookmarks are done, do likes" which told both conditions exactly which prior feature to reference as a pattern. Both immediately navigated to `bookmark.py`. Without that hint, Condition B would have needed to discover the codebase structure from scratch — exactly the scenario artifact tracking is designed for.
4. **Small codebase.** At 3,400 lines, the entire codebase fits easily in context. For larger codebases where you can't read everything, artifact context would matter more.
113 changes: 111 additions & 2 deletions hooks/scripts/init_workspace.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,29 @@
"turn_count": 0,
"last_marker_turn": 0,
"last_extraction_turn": 0,
"metrics": {
"component_invocations": {
"post_tool_use": 0,
"pack": 0,
"pre_compact": 0,
"reload": 0,
"nudge": 0,
"extract_state": 0,
},
"extract_state_attempts": 0,
"extract_state_runs": 0,
"last_cost_report": None,
},
"config": {
"cost_model": {
"post_tool_use": 0.001,
"pack": 0.003,
"pre_compact": 0.002,
"reload": 0.001,
"nudge": 0.001,
"extract_state": 0.01,
}
},
}

EXEC_PACKET_TEMPLATE = """\
Expand Down Expand Up @@ -122,8 +145,8 @@ def ensure_workspace(cwd: str) -> Path:
def load_state(ws: Path) -> dict:
state_file = ws / "STATE.json"
if state_file.exists():
return json.loads(state_file.read_text())
return copy.deepcopy(INITIAL_STATE)
return ensure_state_schema(json.loads(state_file.read_text()))
return ensure_state_schema(copy.deepcopy(INITIAL_STATE))


def save_state(ws: Path, state: dict):
Expand All @@ -146,6 +169,92 @@ def read_hook_stdin() -> dict:
return {}


CONFIG_DEFAULTS = {
"stale_session_threshold": 5,
"staleness_mild": 6,
"staleness_loud": 16,
"sync_reminder_interval": 8,
"compaction_warn_buffer": 15,
"compaction_urgent_buffer": 5,
"compaction_fresh_threshold": 3,
"extraction_interval": 5,
"max_recently_read": 20,
"cost_model": {
"post_tool_use": 0.001,
"pack": 0.003,
"pre_compact": 0.002,
"reload": 0.001,
"nudge": 0.001,
"extract_state": 0.01,
},
}


def ensure_state_schema(state: dict) -> dict:
"""Backfill required keys for older STATE.json shapes."""
if not isinstance(state.get("workspace"), dict):
state["workspace"] = copy.deepcopy(INITIAL_STATE["workspace"])

metrics = state.get("metrics")
if not isinstance(metrics, dict):
metrics = {}
state["metrics"] = metrics

invocations = metrics.get("component_invocations")
if not isinstance(invocations, dict):
invocations = {}
metrics["component_invocations"] = invocations

for component, default_count in INITIAL_STATE["metrics"]["component_invocations"].items():
invocations.setdefault(component, default_count)

metrics.setdefault("extract_state_attempts", 0)
metrics.setdefault("extract_state_runs", 0)
metrics.setdefault("last_cost_report", None)

config = state.get("config")
if not isinstance(config, dict):
config = {}
state["config"] = config

cost_model = config.get("cost_model")
if not isinstance(cost_model, dict):
cost_model = {}
config["cost_model"] = cost_model
for component, default_rate in CONFIG_DEFAULTS["cost_model"].items():
cost_model.setdefault(component, default_rate)

return state


def get_config(state: dict) -> dict:
"""Return config with user overrides merged over defaults.

Users can set overrides in STATE.json under the "config" key, e.g.:
{"config": {"stale_session_threshold": 10}}
"""
overrides = state.get("config", {})
merged = {**CONFIG_DEFAULTS, **overrides}
merged["cost_model"] = {
**CONFIG_DEFAULTS["cost_model"],
**(overrides.get("cost_model", {}) if isinstance(overrides, dict) else {}),
}
return merged


def record_component_invocation(state: dict, component: str) -> int:
"""Increment a component invocation counter and return the updated count."""
ensure_state_schema(state)
metrics = state["metrics"]
invocations = metrics["component_invocations"]
try:
current = int(invocations.get(component, 0))
except (TypeError, ValueError):
current = 0
invocations[component] = current + 1
return invocations[component]


MAX_ERROR_LOG_ENTRIES = 50


Expand Down
8 changes: 5 additions & 3 deletions hooks/scripts/nudge.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,17 @@
from pathlib import Path

sys.path.insert(0, str(Path(__file__).parent))
from init_workspace import load_state, read_hook_stdin
from init_workspace import load_state, save_state, read_hook_stdin, get_config, record_component_invocation

WORKSPACE_DIR = ".agent-workspace"
STALE_SESSION_THRESHOLD = 5


def build_nudge_message(state: dict | None) -> str | None:
"""Build a nudge message based on workspace state. Returns None if no nudge needed."""
if state is None:
return None

cfg = get_config(state)
wk = state.get("workspace", {})
has_objective = bool(wk.get("objective"))
has_artifacts = len(state.get("artifacts", {})) > 0
Expand All @@ -30,7 +30,7 @@ def build_nudge_message(state: dict | None) -> str | None:
turns_stale = max(0, turn_count - semantic_turn)

# Check staleness (even if objective is set)
if has_objective and turns_stale > STALE_SESSION_THRESHOLD:
if has_objective and turns_stale > cfg["stale_session_threshold"]:
return (
f"Relay: workspace state is {turns_stale} turns stale from a previous session. "
"Run /relay sync to update objective, next_actions, and open_questions."
Expand Down Expand Up @@ -74,6 +74,7 @@ def main():

try:
state = load_state(ws)
record_component_invocation(state, "nudge")
msg = build_nudge_message(state)

marker_tip = (
Expand All @@ -98,6 +99,7 @@ def main():
"additionalContext": context,
}
}
save_state(ws, state)
print(json.dumps(output))
except Exception as exc:
try:
Expand Down
Loading