GitHub - SuperMarioYL/cachepin: Harness-neutral proxy that keeps your coding agent's KV cache alive across turns

English | 简体中文

A single Go binary you drop between your Coding Agent harness and your OpenAI-compatible model server. It measures — and, with --pin, protects — the server-side KV Cache your harness keeps silently invalidating.

Why now

If you self-host a model (llama.cpp, vLLM) and drive it with a Coding Agent harness like Claude Code, Cursor, or opencode, you've paid this tax: the harness re-renders a tool result or compacts context, your message array changes at message 3, and the inference server's KV Cache invalidates from that point on — so it silently reprocesses 30k+ tokens every single turn. @CreativelyBankrupt has been pointing at exactly this prefix-cache fragility; the r/LocalLLaMA "checkpoints" thread and bespoke agents like Hmbown/CodeWhale work around it one harness at a time. CachePin is the portable version of that idea: a harness-neutral proxy that sits in front of any OpenAI-compatible server, shows you the exact mutation boundary, and pins requests back to append-only form so the cache survives. No agent fork, no model lock-in — point OPENAI_BASE_URL at it and keep working.

Architecture

A coding-agent harness sends OpenAI-compatible requests to the CachePin proxy, whose session tracker content-hashes each message to find the longest-common-prefix mutation boundary; pin mode reconciles mutated turns to append-only form; metrics are emitted per turn; requests forward to the upstream model server whose KV cache survives

Your coding agent points OPENAI_BASE_URL at CachePin instead of the model server. Inside the proxy boundary, the session tracker content-hashes every message and computes the longest common prefix against the canonical history — that boundary is exactly where the upstream prefix cache stops being valid. The metrics unit emits preserved-prefix %, reprocessed tokens, and the mutation index per turn; with --pin the reconciler rewrites a mutated request back to append-only form so the server-side KV Cache survives. Streaming /v1/chat/completions responses are relayed chunk-by-chunk, so the harness can't tell the proxy is there.

Quickstart (10 minutes)

# 1. install the single binary
go install github.com/SuperMarioYL/cachepin/cmd/cachepin@latest

# 2. point it at your OpenAI-compatible server (llama.cpp, vLLM, ...)
cachepin --upstream http://localhost:8080      # listens on :8089

# 3. tell your coding agent to talk through CachePin
export OPENAI_BASE_URL=http://localhost:8089

Use your coding agent exactly as before. CachePin prints one line per turn; nothing else changes. When you want the cache protected instead of just measured, restart with --pin.

Demo

The VHS tape records the happy path: start CachePin in measure-only mode, watch a mutated turn reprocess ~31k tokens, then restart with --pin and watch the same turn drop back to zero.

What you'll see

A clean, append-only session reuses the whole prefix:

turn 12 | prefix preserved 100% | 0 tokens reprocessed

When the harness rewrites history, CachePin names the exact boundary:

turn 13 | prefix preserved 41% | ~31k tokens reprocessed | MUTATION at msg[3]

Add --pin and the same mutated turn is reconciled to append-only form before it reaches the server, so the KV Cache survives and the reprocessed count drops back toward zero.

machine-readable output (--ndjson)

{"ts":"2026-05-29T12:00:00Z","session_id":"a1b2c3","turn":13,"preserved_prefix_pct":41.0,"reprocessed_tokens":31000,"total_tokens":52000,"mutated":true,"mutation_index":3,"prev_len":24,"incoming_len":26,"lcp":3}

One JSON object per line — the same stream the benchmark and any dashboard you build consume.

How it works

The core primitive is a canonical append-only session history plus one contract: every forwarded request's message array must be a prefix-extension of it. CachePin content-hashes each message, computes the longest common prefix against the canonical history, and that boundary is exactly where the server's prefix cache stops being valid.

harness ──HTTP──▶ proxy ──▶ session tracker ──▶ metrics ──▶ stdout / NDJSON
                    │              │
                    │        pin/reconcile (when --pin)
                    ▼
             upstream model server (llama.cpp / vLLM / API)

One binary, one process, standard library only — no containers, no Kubernetes, no model-specific tokenizer. Streaming /v1/chat/completions responses (SSE) pass through chunk-by-chunk, so the harness can't tell CachePin is there.

Configuration

CachePin is configured entirely by flags — no config file.

Flag	Type	Default	Meaning
`--upstream`	string	(required)	Base URL of the OpenAI-compatible model server, e.g. `http://localhost:8080`
`--listen`	string	`:8089`	Address CachePin's proxy binds to
`--pin`	bool	`false`	Reconcile mutated requests to append-only form so the upstream KV Cache survives
`--ndjson`	string	(off)	Path to also write one machine-readable metrics object per turn

Benchmark

Reproduce the before/after chart yourself — it replays a fixed 50-turn transcript whose harness rewrites an early message every turn, once without pinning and once with:

go run ./bench -turns 50 -out chart.csv

It writes turn,reprocessed_no_pin,reprocessed_pin,cumulative_no_pin,cumulative_pin as CSV and prints a savings summary to stderr. The whole point: the curve that climbs linearly without --pin goes flat with it.

vs CodeWhale

Honest positioning — CachePin is a shim, not a competing agent.

	CachePin	Hmbown/CodeWhale
Harness-neutral (works with Claude Code, Cursor, opencode)	✓	✗ (is its own agent)
Full coding-agent experience (planning, tools, edits)	✗ (proxy only)	✓
Pins KV Cache across any OpenAI-compatible server	✓	partial (its own model path)
Drop-in: keep your current agent	✓	✗ (you switch agents)
Measures the exact mutation boundary	✓	—

If you want a batteries-included agent, CodeWhale is the better answer. If you want to keep the agent you already use and just stop burning the cache, that's CachePin.

Roadmap

m1 — proxy passthrough: transparent OpenAI-compatible reverse proxy with SSE streaming; the harness can't tell it's there.
m2 — track & report: per-session canonical-history tracker emitting preserved-prefix %, reprocessed tokens, and mutation events per turn.
m3 — pin & bench: --pin reconciliation that keeps the upstream KV Cache alive, plus the reproducible 50-turn benchmark.
Future: protocol spec for harness ↔ server append-only context; ecosystem docs links.

Contributing

Issues and PRs welcome — file an issue describing your harness + server combo and the mutation you're seeing, and attach the --ndjson output if you can. It makes the boundary obvious.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
assets		assets
bench		bench
cmd/cachepin		cmd/cachepin
docs		docs
internal		internal
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
VERSION		VERSION
go.mod		go.mod

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why now

Architecture

Table of contents

Quickstart (10 minutes)

Demo

What you'll see

How it works

Configuration

Benchmark

vs CodeWhale

Roadmap

Contributing

License

Share this

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Why now

Architecture

Table of contents

Quickstart (10 minutes)

Demo

What you'll see

How it works

Configuration

Benchmark

vs CodeWhale

Roadmap

Contributing

License

Share this

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages