You correct Claude Code the same way every single time. And every single time, it forgets. What if it didn't?
A learning layer that turns your corrections into the rules, memory, and skills your agent actually keeps.
No fine-tuning. No cloud. No weights touched. Just your real corrections and endorsements, compounding into the agent's instructions, gated by one human review.
Every coding session you teach your agent something: "don't put !important inline", "lead with the conclusion", "this codebase uses pnpm not npm". Then the session ends and all of it evaporates. Next week you type the same correction again.
Tools that fix this usually reach for fine-tuning (expensive, opaque, irreversible) or a giant always-on prompt (slow, and it drowns the signal). Both are the wrong size for "remember that I prefer X."
Treat learning as two cheap capture streams feeding one deliberate judgment:
- Watch what you do. A hook records tool-call patterns; a small observer model distills repeated ones into confidence-weighted instincts. (This engine is ECC, vendored here.)
- Listen to the conversation, both sides. Session-end skills capture the moments that matter, and not just yours: when you corrected Claude, when Claude caught itself, when you endorsed a move, when a framing landed. Correction and delight are each captured in two directions, from you and from Claude.
Both streams just dump candidates to a queue. Nothing becomes durable until you run /evolve, the single review surface that clusters every signal, routes each one through four gates, and writes whichever artifact fits: a rule, a memory entry, or a skill (preferring the cheapest one that does the job).
flowchart LR
A["What you DO<br/>(tool calls)"] -->|observer hook| Q1[("instincts")]
B["What's SAID<br/>(corrections + delight,<br/>from you AND from Claude)"] -->|session-end capture| Q2[("pending-evolve<br/>queue")]
Q1 -.read.-> E
Q2 -.read.-> E
E{{"/evolve<br/>4 gates, you approve once"}}
E --> R["rule<br/>(cheap)"]
E --> M["memory"]
E --> S["skill<br/>(expensive)"]
E --> X["defer"]
style E fill:#ffe0b2,stroke:#e65100,color:#1a1a1a
style B fill:#e8f4f8,stroke:#4a90a4,color:#143a47
style A fill:#e8f4f8,stroke:#4a90a4,color:#143a47
The phrase to remember: capture is cheap and automatic, judgment is deliberate and rare.
Most self-evolving agents either retrain weights or auto-rewrite one prompt. This one is built on a few opinions the others don't share:
- Capture and judgment are separated by design. Two always-on streams notice things cheaply. One human-triggered surface decides. Your review burden batches into a single moment instead of a nag after every session.
- Both polarities, both directions. It captures corrections and endorsements, and not just yours: when Claude catches its own mistake (self-correction) or gets an aha from your framing, that counts too. Claude can read its own reasoning trace, so it flags patterns you never even noticed. Most systems only ever learn from explicit user feedback.
- Routing is cost-aware. A
ruleis a few lines of markdown loaded only when relevant. Askillcosts system-prompt tokens in every session forever./evolveprefers the cheapest reversible artifact and only escalates when it must:rule < memory < skill < agent. - Conflict is a veto, not an average. One real counterexample refutes "this is globally true" and blocks promotion, no matter how many times the pattern agreed elsewhere. A learning system that only counts agreement quietly grows a self-contradictory ruleset.
- Confidence moves on three axes. Recurrence raises it, weekly decay lowers it, semantic conflict vetoes it. Stale lessons fade on their own.
- The eval corpus is free. Every accept/reject/defer you make at
/evolveis logged. That log is the labeled dataset for grading whether the capture skills are too noisy. No separate eval harness.
See docs/COMPARISON-OTHER-SYSTEMS.md for an honest table against Hermes, Sepo, the OpenAI cookbook, the self-evolving-agents survey, and EvoMap (including where the ideas overlap). In the survey's taxonomy (arXiv 2507.21046), this project sits at Memory/Tools, evolved inter-test-time, from textual feedback, in a single-agent coding setting: it deliberately avoids the heaviest routes (evolving weights or architecture) because only reversible behavioral instructions are worth letting an agent rewrite about itself.
git clone https://github.com/annexiao/claude-code-self-evolution.git
cd claude-code-self-evolution
bash install.sh --dry-run # see exactly what it will copy
bash install.sh # install into ~/.claude/Then two manual steps the installer prints for you:
- Merge
config/settings.example.jsoninto~/.claude/settings.json(wires the tool-call capture hook). - (optional) Schedule weekly confidence decay (launchd on macOS, cron on Linux).
Heads up: the observer is on by default. Once the hook is wired, the first tool call in a project lazy-starts a background daemon that calls Claude Haiku to distill your tool-call patterns into instincts. That is the point of the tool, but it does spend Haiku tokens. To run with only the conversation-capture half, set
"enabled": falseunderobserverin~/.claude/skills/continuous-learning-v2/config.json.
Requirements: Claude Code, the claude CLI on your PATH (the observer calls Claude Haiku), git, python3, and jq (for the review script).
| Moment | What happens | Who triggers it |
|---|---|---|
| During a session | The observer hook quietly records tool-call patterns | Automatic |
| End of a session | Run /save-session, and the capture skills scan the conversation for corrections + endorsements, dumping candidates to the queue |
You (/save-session) |
| Every few weeks | Run /evolve. It clusters both streams, proposes routes, shows one table, you approve |
You (/evolve) |
| Weekly, in the background | Unused instincts slowly decay in confidence | Automatic |
The chain at session end (/save-session runs these in order):
/save-session
-> correction-capture (you corrected Claude, or Claude self-corrected)
-> delight-capture (you endorsed a move, or a framing reframed someone)
All three are dump-only: they write candidate files and exit. They never interrupt you and never promote anything durable. That is /evolve's job, later, with your sign-off.
| Path | What it is | Origin |
|---|---|---|
skills/correction-capture/ |
Captures corrections, two directions (you to Claude, Claude to itself) | This project |
skills/delight-capture/ |
Captures endorsements + reframings, two directions | This project |
commands/evolve.md |
The cost-aware, four-gate /evolve judgment surface |
This project (rewrite of ECC's) |
commands/save-session.md |
Saves session state and fires the capture chain. If you already have a save-session, the installer injects the chain into it instead | This project |
scripts/apply-instinct-decay.py |
Deterministic weekly confidence decay | This project |
scripts/verify-pending-evolve.sh |
Queue health check, run by /evolve |
This project |
scripts/review-evolve-signals.sh |
Aggregates the decision log into capture-skill quality signals | This project |
engine/continuous-learning-v2/ |
The instinct/observer engine (the tool-call stream) | ECC, vendored |
docs/ARCHITECTURE.md |
The full single-source map, with diagrams | This project |
- docs/ARCHITECTURE.md: the complete map. Capture layer, distill layer, judgment layer, output sinks, the four gates, the eval loop, and the six design philosophies. Start here if you want to understand or fork it.
- docs/COMPARISON-WITH-ECC.md: exactly what this adds on top of ECC, and why.
- docs/COMPARISON-OTHER-SYSTEMS.md: how it sits among other self-evolving agent projects.
Built on ECC (everything-claude-code) by Affaan Mustafa (MIT). The instinct/observer engine is entirely ECC's design; this project adds the conversation-capture streams and the cost-aware, conflict-vetoing judgment layer on top. If you only want the tool-call learning engine, use ECC directly.