A multi-agent AI development pipeline for Claude Code. Catches bugs that "all tests passing" does not.
Built from research on 20+ academic papers and 50+ open-source repos, validated on a 12-module production integration. Mutation testing showed our TDD tests caught only 52% of potential bugs. After implementing this pipeline: 91% average.
A 7-step pipeline that every code module goes through:
0. SPECIFY → AI drafts acceptance criteria + edge cases + constraints
1. architect → Design doc + threat model (constrained by spec)
2. gatekeeper → Validate design + spec completeness
3. builder → TDD implementation (tests derived from spec)
4. VERIFY → Quality gates → attacker + reviewer → fix → Stryker → fix (loop)
5. cross-check → Different model family reviews (Gemini CLI)
6. gatekeeper → Final go/no-go with completion checklist
7. commit
5 specialized Claude Code agents, each with a specific role and specific tools:
| Agent | Role | Model |
|---|---|---|
architect |
Design + threat model. Read-only. | opus |
builder |
TDD implementation. Fingerprints neighbors. | sonnet |
attacker |
Adversarial chaos testing. Tries to break it. | opus |
reviewer |
Pattern compliance via neighbor-diff. Read-only. | sonnet |
gatekeeper |
Go/no-go decisions. Read-only. | opus |
See SETUP-GUIDE.md for step-by-step installation.
See presentation.html for the full evidence — research findings, mutation testing results, A/B experiments, and the specific incidents that justified each pipeline step.
├── README.md # You're here
├── SETUP-GUIDE.md # Engineer setup (15 min)
├── presentation.html # Evidence + rationale (CTO pitch)
├── LEARNINGS.md # Full research narrative
├── templates/
│ ├── CLAUDE.md # Pipeline template — adapt for your project
│ ├── agents/ # 5 agent definitions — genericized
│ │ ├── architect.md
│ │ ├── builder.md
│ │ ├── attacker.md
│ │ ├── reviewer.md
│ │ └── gatekeeper.md
│ └── prompts/
│ └── cross-validator.md
├── stryker/
│ ├── stryker.config.mjs # Reference Stryker config
│ └── run-stryker.sh # Helper script
└── research/ # Research artifacts
└── PRESENTATION.md # Markdown source of presentation
Templates have [ADAPT] markers where you plug in project-specific details. The pipeline structure stays the same — only the domain knowledge changes.
See SETUP-GUIDE.md for details on what to adapt vs keep as-is.
| Metric | Before | After |
|---|---|---|
| Mutation score (worst module) | 52% | 95% |
| Mutation score (average) | 80% | 91% |
| Bugs caught by reviewer | 0 | 1 CRITICAL (rate limiter misuse) |
| Bugs caught by Stryker | 0 | 4 modules below 50% exposed |
All numbers from our own codebase, not benchmarks.