English | 简体中文
A single Go binary you drop between your Coding Agent harness and your OpenAI-compatible model server. It measures — and, with
--pin, protects — the server-side KV Cache your harness keeps silently invalidating.
If you self-host a model (llama.cpp, vLLM) and drive it with a Coding Agent harness like Claude Code, Cursor, or opencode, you've paid this tax: the harness re-renders a tool result or compacts context, your message array changes at message 3, and the inference server's KV Cache invalidates from that point on — so it silently reprocesses 30k+ tokens every single turn. @CreativelyBankrupt has been pointing at exactly this prefix-cache fragility; the r/LocalLLaMA "checkpoints" thread and bespoke agents like Hmbown/CodeWhale work around it one harness at a time. CachePin is the portable version of that idea: a harness-neutral proxy that sits in front of any OpenAI-compatible server, shows you the exact mutation boundary, and pins requests back to append-only form so the cache survives. No agent fork, no model lock-in — point OPENAI_BASE_URL at it and keep working.
Your coding agent points OPENAI_BASE_URL at CachePin instead of the model server. Inside the proxy boundary, the session tracker content-hashes every message and computes the longest common prefix against the canonical history — that boundary is exactly where the upstream prefix cache stops being valid. The metrics unit emits preserved-prefix %, reprocessed tokens, and the mutation index per turn; with --pin the reconciler rewrites a mutated request back to append-only form so the server-side KV Cache survives. Streaming /v1/chat/completions responses are relayed chunk-by-chunk, so the harness can't tell the proxy is there.
- Quickstart (10 minutes)
- Demo
- What you'll see
- How it works
- Configuration
- Benchmark
- vs CodeWhale
- Roadmap
- Contributing
- License
- Share this
# 1. install the single binary
go install github.com/SuperMarioYL/cachepin/cmd/cachepin@latest
# 2. point it at your OpenAI-compatible server (llama.cpp, vLLM, ...)
cachepin --upstream http://localhost:8080 # listens on :8089
# 3. tell your coding agent to talk through CachePin
export OPENAI_BASE_URL=http://localhost:8089Use your coding agent exactly as before. CachePin prints one line per turn; nothing else changes. When you want the cache protected instead of just measured, restart with --pin.
The VHS tape records the happy path: start CachePin in measure-only mode, watch a mutated turn reprocess ~31k tokens, then restart with --pin and watch the same turn drop back to zero.
A clean, append-only session reuses the whole prefix:
turn 12 | prefix preserved 100% | 0 tokens reprocessed
When the harness rewrites history, CachePin names the exact boundary:
turn 13 | prefix preserved 41% | ~31k tokens reprocessed | MUTATION at msg[3]
Add --pin and the same mutated turn is reconciled to append-only form before it reaches the server, so the KV Cache survives and the reprocessed count drops back toward zero.
machine-readable output (--ndjson)
{"ts":"2026-05-29T12:00:00Z","session_id":"a1b2c3","turn":13,"preserved_prefix_pct":41.0,"reprocessed_tokens":31000,"total_tokens":52000,"mutated":true,"mutation_index":3,"prev_len":24,"incoming_len":26,"lcp":3}One JSON object per line — the same stream the benchmark and any dashboard you build consume.
The core primitive is a canonical append-only session history plus one contract: every forwarded request's message array must be a prefix-extension of it. CachePin content-hashes each message, computes the longest common prefix against the canonical history, and that boundary is exactly where the server's prefix cache stops being valid.
harness ──HTTP──▶ proxy ──▶ session tracker ──▶ metrics ──▶ stdout / NDJSON
│ │
│ pin/reconcile (when --pin)
▼
upstream model server (llama.cpp / vLLM / API)
One binary, one process, standard library only — no containers, no Kubernetes, no model-specific tokenizer. Streaming /v1/chat/completions responses (SSE) pass through chunk-by-chunk, so the harness can't tell CachePin is there.
CachePin is configured entirely by flags — no config file.
| Flag | Type | Default | Meaning |
|---|---|---|---|
--upstream |
string | (required) | Base URL of the OpenAI-compatible model server, e.g. http://localhost:8080 |
--listen |
string | :8089 |
Address CachePin's proxy binds to |
--pin |
bool | false |
Reconcile mutated requests to append-only form so the upstream KV Cache survives |
--ndjson |
string | (off) | Path to also write one machine-readable metrics object per turn |
Reproduce the before/after chart yourself — it replays a fixed 50-turn transcript whose harness rewrites an early message every turn, once without pinning and once with:
go run ./bench -turns 50 -out chart.csvIt writes turn,reprocessed_no_pin,reprocessed_pin,cumulative_no_pin,cumulative_pin as CSV and prints a savings summary to stderr. The whole point: the curve that climbs linearly without --pin goes flat with it.
Honest positioning — CachePin is a shim, not a competing agent.
| CachePin | Hmbown/CodeWhale | |
|---|---|---|
| Harness-neutral (works with Claude Code, Cursor, opencode) | ✓ | ✗ (is its own agent) |
| Full coding-agent experience (planning, tools, edits) | ✗ (proxy only) | ✓ |
| Pins KV Cache across any OpenAI-compatible server | ✓ | partial (its own model path) |
| Drop-in: keep your current agent | ✓ | ✗ (you switch agents) |
| Measures the exact mutation boundary | ✓ | — |
If you want a batteries-included agent, CodeWhale is the better answer. If you want to keep the agent you already use and just stop burning the cache, that's CachePin.
- m1 — proxy passthrough: transparent OpenAI-compatible reverse proxy with SSE streaming; the harness can't tell it's there.
- m2 — track & report: per-session canonical-history tracker emitting preserved-prefix %, reprocessed tokens, and mutation events per turn.
- m3 — pin & bench:
--pinreconciliation that keeps the upstream KV Cache alive, plus the reproducible 50-turn benchmark. - Future: protocol spec for harness ↔ server append-only context; ecosystem docs links.
Issues and PRs welcome — file an issue describing your harness + server combo and the mutation you're seeing, and attach the --ndjson output if you can. It makes the boundary obvious.
MIT © supermario_leo.
CachePin — the harness-neutral proxy that keeps your Coding Agent's KV Cache alive across turns. Self-hosting llama.cpp/vLLM and reprocessing 30k tokens every turn? Point OPENAI_BASE_URL at it. Go, 10-min drop-in. https://github.com/SuperMarioYL/cachepin
