An experiment in structured AI development — specs, quality gates, and parallel agents applied to Claude Code.
I use Claude Code every day. It's great for writing code, but shipping software — coordinating parallel agents, testing, merging, recovering from failures — needs more structure. set-core is what I built around it: give it a markdown spec, it decomposes into independent changes, dispatches parallel agents in git worktrees, runs quality gates on each, and merges the results.
Every change goes through OpenSpec — a structured workflow (proposal → design → spec → tasks → code → verify) that gives agents contracts to work against instead of prompts to interpret. Quality gates are deterministic: exit codes, not LLM judgment.
Built with set-core, using set-core. This project was developed using its own orchestration pipeline.
This is an early alpha — a working experiment, not a finished product. It works well enough that I use it daily, but expect rough edges. The sentinel auto-recovers from most of them. See
docs/release/alpha-release.mdfor the current state, known issues, and what's not yet implemented.Open source (MIT) — in case it's useful to someone else too.
Read the FAQ — how SET compares to Cursor, Devin, Kiro, Copilot, Augment Intent, and others. Honest comparison — what SET does well, where others are ahead.
| Repository | Description |
|---|---|
| set-core | Core orchestration engine, web module, dashboard, CLI tools |
| set-spec-capture | Capture specs from any source (web, PDF, conversation) |
| set-voice-agent-delivery | Voice agent project type — Soniox STT, Google TTS, spec-driven customer interaction |
An agent working on a change — debugging, testing, fixing:
Input: A markdown spec + a v0.app design export (or any Next.js-shaped design source)
Orchestration: Phased execution, dependency DAG, quality gates on every change
Visibility: See where every minute goes — agent sessions, LLM calls, tool executions, sub-agents, gates — all placed on a real-time axis. Click any implementing span to drill into its per-tool breakdown.
Activity timeline (top) — every change as a row, every gate and LLM call placed in time. Drilldown (below) — one implementing span expanded into per-tool, per-LLM-call, per-sub-agent breakdown with the longest operations called out.
Output: Running application — built entirely from the spec, faithful to the design source
Spec + v0 design → parallel agents → quality gates (including design-fidelity) → working app. Zero intervention.
spec.md ──► digest ──► decompose ──► parallel agents ──► verify ──► merge ──► done
What's actually happening under the hood
spec.md + design source (v0.app export → manifest, or design-snapshot.md)
│
▼
┌───────────────────────────────────────────────────────────┐
│ Sentinel (autonomous supervisor) │
│ ├─ digests spec into requirements + domain summaries │
│ ├─ decomposes into independent changes (DAG) │
│ ├─ dispatches each to its own git worktree │
│ ├─ monitors progress, restarts on crash │
│ ├─ merges verified results back to main │
│ └─ auto-replans until full spec coverage │
│ │
│ Per change: │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Ralph Loop │ │
│ │ ├─ OpenSpec artifacts (proposal → design → code) │ │
│ │ ├─ iterative implementation with tests │ │
│ │ ├─ progress-based trend detection │ │
│ │ └─ auto-pause on stall or budget limit │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Quality gates (per change, before merge): │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Jest/Vitest → Build → Playwright E2E │ │
│ │ → Code Review → Spec Coverage → Smoke Test │ │
│ │ (gate profiles: per-change-type configuration) │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Across all agents: │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Memory Layer │ │
│ │ ├─ 5-layer hooks inject context per tool │ │
│ │ ├─ agents learn from each other's work │ │
│ │ └─ conventions survive across sessions │ │
│ └──────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────┘
│
▼
merged, tested, done
| Feature | Description | |
|---|---|---|
| ⚙️ | Full Pipeline | Spec to merged code — digest, decompose, dispatch, verify, merge — hands-off. Guide |
| 🛡️ | Quality Gates | Test, build, E2E, code review, spec coverage, and smoke — deterministic, not LLM-judged. Guide |
| 🧠 | Persistent Memory | Hook-driven cross-session recall — agents learn from each other. Infrastructure saves, not voluntary. Guide |
| 📊 | Web Dashboard | Real-time monitoring — orchestration state, agents, tokens, issues, learnings. Guide |
| 📋 | OpenSpec Workflow | Structured artifact flow (proposal → design → spec → tasks → code) minimizes hallucination. Guide |
| 🔧 | Self-Healing | Issue pipeline: detect → investigate → fix → verify. The sentinel diagnoses before it acts. Guide |
| 🧩 | Plugin System | Project-type plugins add domain rules, gates, templates, and conventions. Docs |
| 🎨 | Design Bridge | v0.app export → manifest → per-change design slice → Tailwind tokens + shell components injected into every agent's context. Guide |
| 📈 | Cross-Run Learnings | Review findings and gate failures become rules for the next run. The system gets better with use. Dashboard |
| 🔁 | Account Manager | Manage multiple Claude Code accounts — register, monitor usage, manually switch. Docs |
set-core is battle-tested through consumer projects — E2E runs that exercise the full pipeline from spec to merged app. During these runs, agents discover and fix real problems: build gate failures, middleware bugs, test flakiness. These fixes are framework-level insights that belong in set-core, not just in the consumer project.
set-core (framework) consumer project (E2E run)
│ │
├── set-project init ───────────────────►│ deploy rules, templates, gates
│ │
│ ├── agents build features
│ ├── gates catch failures
│ ├── ISS pipeline creates fixes ◄── valuable!
│ │
│◄── set-harvest ───────────────────────┤ review + adopt framework fixes
│ │
├── update planning rules, templates │
├── set-project init ───────────────────►│ redeploy with improvements
After every E2E run, harvest the fixes:
set-harvest # scan all registered projects
set-harvest --project craftbrew-run-20260320-1445 # scan single project
set-harvest --dry-run # preview without updating stateThe harvest tool scans ISS fix commits chronologically, classifies them as framework-relevant or project-specific, and suggests where to adopt them (planning rules, templates, or core code). Each fix is reviewed interactively — no auto-adoption.
Custom projects: The harvest command runs from the set-core repo, not the consumer project. After an orchestration run completes on any registered project, switch to the set-core directory and run set-harvest. This is a manual step — the system cannot auto-adopt fixes without human review.
cd /path/to/set-core # switch to set-core repo
set-harvest # review all registered projects
set-harvest --project my-app # review single projectActive development priorities:
| Direction | Goal | Status |
|---|---|---|
| Divergence reduction | Eliminate remaining nondeterminism through template optimization, scaffold testing, and configuration distribution across core → module → scaffold → project layers | Measurably reduced for simple projects; complex projects still improving — tracked across paired E2E runs |
| Build time optimization | Reduce gate pipeline wall clock time — parallel gate execution, incremental builds, cached test results between changes | Currently sequential (Jest → Build → E2E → Review); exploring parallel gates where safe |
| Session context reuse | Reuse conversation context across Ralph Loop iterations and between related changes — reduce cold-start token overhead | Currently each iteration starts fresh; investigating warm-start from previous iteration's state |
| Memory optimization | Smarter recall — relevance scoring, dedup, consolidation. Lite-mode hooks for low-context sessions | Dedup, consolidation, and SET_MEMORY_HOOKS=lite mode operational; learning-to-rule conversion deferred |
| Gate intelligence | Per-change-type gate profiles that adapt based on historical pass rates and failure patterns | Gate profiles operational; adaptive thresholds planned |
| Merge conflict prevention | Proactive detection of cross-cutting file conflicts before they happen — schedule conflicting changes sequentially | Phase ordering works; file-level conflict prediction in research |
See docs/learn/journey.md for the full development history and docs/learn/lessons-learned.md for production insights driving these priorities.
AI agents are nondeterministic — run the same prompt twice, get different results. The experiment here: does adding structure (specs, gates, templates) make the output converge?
| Challenge | Our Approach | Result |
|---|---|---|
| Output divergence | 3-layer template system — templates lock structure, agents focus on logic | 83–87% structural convergence across paired runs (report) |
| Hallucination | OpenSpec workflow — structured artifacts with requirements + acceptance criteria | Agents implement against spec, not imagination |
| Quality roulette | Programmatic gates — exit codes, not LLM judgment. 7 gate types | Deterministic pass/fail |
| Spec drift | Coverage tracking — verifies "does it satisfy the spec?" not just "do tests pass?" | Auto-replan when coverage < 100% |
| Failure recovery | Issue pipeline — detailed investigation before any fix. No guessing. | 30-second recovery, not hours |
| Agent amnesia | Hook-driven memory — shared across worktrees, survives sessions | Zero voluntary saves → 100% capture via hooks |
| Framework reliability | E2E scaffold testing — the orchestrator tests itself | 30+ runs across 4 project scaffolds |
Structural convergence across paired runs: 83% minishop, 87% micro-web. Schema equivalence at 100%, convention compliance at 100%. The remaining divergence is stylistic, not structural. Not perfect, but converging. Full data →
Writing a good spec takes effort. That's intentional — the quality of the output depends on the quality of the input. The trade-off SET makes: upfront structure for reliable results.
A good spec for SET includes:
- Data model — entities, fields, relationships, enums (becomes the Prisma schema)
- Page layouts — sections, column counts, component names (not vague descriptions)
- Design tokens — exact hex colors, font families, spacing values (or use v0.app +
set-design-importto pull a working Tailwind/shadcn export) - Auth & roles — protected routes, user roles, registration flow
- Seed data — realistic names and content, not "Product 1"
- i18n — locales, framework, URL structure (if multilingual)
- Business requirements — what the user should be able to do, with acceptance criteria
Use /set:write-spec in Claude Code to generate a structured spec interactively — it detects your project type, asks targeted questions per section, and integrates with your design source (v0.app export or Figma). Works for web apps, APIs, CLI tools, and any project type.
The project type templates handle the rest — framework boilerplate, build config, test setup, linting rules. You focus on what to build. The templates ensure how it gets built is consistent and deterministic.
The better the spec, the better the result. Agents working from a detailed spec produce dramatically better output than agents working from a conversation.
See the Writing Specs guide for the full methodology, the CraftBrew scaffold for a production-quality example with Figma design integration, and the MiniShop spec for a minimal but complete example.
git clone ssh://git@git.setcode.dev:2222/root/set-core.git
cd set-core && ./install.shAfter install, the web dashboard starts automatically as a background service (launchd on macOS, systemd on Linux). Open http://localhost:7400 — you should see the manager page.
Before setting up your own project, see the full pipeline in action. Open a Claude Code session from the set-core directory and type:
run a micro-web E2E test
Claude will scaffold a project, register it with the manager, and validate the gate pipeline. Then tell Claude to start the sentinel — the orchestration runs through the manager API at http://localhost:7400.
Watch the dashboard as it progresses — you'll see phases, gate results, token usage, and the final application. The micro-web test builds a simple 5-page site (home, about, blog, contact) in ~20 minutes.
When the orchestration completes, tell Claude:
start the application that was just built
Claude will install dependencies, start the dev server, and open the app in your browser.
For a more complex test, try the MiniShop — a full e-commerce app (products, cart, admin panel, auth) built from a detailed spec with Figma design:
run a minishop E2E test
Now that you've seen how it works, set up orchestration for your own project:
cd ~/my-project
set-project init --project-type web --template nextjsWrite your spec — use /set:write-spec in Claude Code for an interactive guide that detects your project and asks the right questions:
/set:write-spec
Sync your design (optional but recommended) — pull a v0.app export and let set-core scan it for shell components, tokens, and hygiene issues:
set-design-import --git <v0-repo-url> --ref main --scaffold .
set-design-hygiene # report mock arrays, hardcoded strings, broken routesThe design-fidelity gate runs on every UI change and blocks merge if the agent's output diverges from the v0 export's component structure or tokens.
Start the orchestration — use /set:start in Claude Code, or start from the dashboard:
/set:start docs/spec.md
See docs/guide/writing-specs.md for the complete spec-writing methodology, docs/guide/design-integration.md for the design source → agent pipeline, and docs/guide/quick-start.md for the full setup walkthrough.
Core orchestration:
| Component | Technology |
|---|---|
| Agent runtime | Claude Code (Anthropic) |
| Workflow | OpenSpec — spec-driven artifact pipeline |
| Isolation | Git worktrees — real branches, real merges |
| Engine | Python, FastAPI, uvicorn |
| Dashboard | React, TypeScript, Tailwind CSS |
| Memory | shodh-memory (RocksDB + vector embeddings) |
| Design bridge | v0.app export → set-design-import → manifest + per-change design slice → agent context, with design-fidelity gate at merge |
Tooling ecosystem:
| Tool | Purpose |
|---|---|
| set-spec-capture | Capture specs from any source (web, PDF, conversation) |
| set-design-import | Pull a v0.app export, generate the design manifest, optionally run hygiene scan |
| set-design-hygiene | Scan a design export for 9 antipatterns (mock arrays, hardcoded strings, broken routes, …) |
| set-run-logs | Forensic CLI for completed orchestration runs — events, gate timing, agent decisions |
| set-e2e-report | Generate benchmark reports from orchestration runs |
| set-router | Manage multiple Claude Code accounts — register, switch, monitor usage (docs) |
Built-in modules add domain-specific technology (Next.js, Prisma, Playwright for web; Soniox STT, Google TTS for voice). See Plugins.
set-core is a framework with a plugin system. The core orchestration engine is open source. Project types — domain-specific rules, templates, and conventions — can be public or private.
The web project type (Next.js, Prisma, Playwright) ships built-in and is validated through synthetic E2E orchestration runs that simulate real development environments — reproducible, measurable, tracked.
Custom project types in development include voice agent delivery (Soniox TTS/STT with spec-driven customer interaction), and others not yet public. The plugin architecture lets anyone create their own domain-specific type with custom gates, templates, and conventions.
| Metric | Value |
|---|---|
| Commits | 1,870 (across 88 days) |
| Capability specs | 429 |
| Active changes | 5 in flight, 424 archived |
| Codebase | 109K LOC (89K Python, 20K TypeScript) + Shell, specs, docs, templates |
| Built-in modules | web (Next.js + Prisma + Playwright), example (reference plugin) |
| MiniShop benchmark | 6/6 merged, 0 interventions, 1h 45m |
| Latest milestones | v0.app design pipeline, design-fidelity gate, fix-iss circuit-breaker, forensics CLI, USD cost metrics |
Worth every hour. Full journey, benchmarks, and lessons: docs/learn/journey.md
Single-agent was the start. Orchestration is the present. Enterprise is preparing.
Systems like SET can do the work of a full development team — given the right specification and properly developed project types. Period.
This is not the future. This is the present. The sooner we move in this direction, the sooner we'll see what software development actually becomes — instead of clinging to the assumption that manual development or even manual code review should remain the default.
Claude Code is extraordinarily capable. But we can't expect it to guess what we haven't specified. When the model fills in gaps we left empty, we call it "hallucination" — but most of the time, it's underspecification on our side. The frustration of repeated errors, unexpected behavior, inconsistent output — 90% of this comes from insufficient context, not model limitation.
And even with detailed, multi-hundred-page specifications, we cannot guarantee that the model treats every character with equal priority during implementation. This is precisely why systems like SET exist: to enforce, verify, and check that the model understood and executed what we intended. OpenSpec structures the work. Quality gates verify the output. The sentinel investigates before it fixes. No guessing.
Banks, government, defense, regulated industries — they can't use cloud-hosted models today. But this will change. On-premise models, secure multi-tenant systems, hybrid architectures — the infrastructure is coming. Some organizations are already preparing.
The responsible thing for every enterprise is to prepare now — before the models arrive on their infrastructure. Learn orchestration patterns. Build project types for your domain. Develop the muscle memory for spec-driven development. Don't wait for the infrastructure to be ready; be ready when it arrives.
We built set-core to solve our own problems. The web project type is public and battle-tested. But the real power comes from domain-specific project types — fintech with IDOR rules, healthcare with HIPAA compliance, e-commerce with payment flow gates.
Model providers (Anthropic included) will build orchestration into their platforms — we welcome that. These middleware layers are destined to become first-party features. But that doesn't mean we shouldn't build them ourselves: this is how we shape the tools to our needs and accumulate knowledge that vendor-generic solutions can't provide.
Start now. There will be bugs. But this is a self-healing system — the sentinel detects, investigates, and fixes. The more people use it, the faster it improves.
| Section | Contents |
|---|---|
| Guide | Quick start, writing specs, design integration, orchestration, sentinel, worktrees, OpenSpec, memory, dashboard |
| Reference | CLI tools, configuration, architecture, plugins |
| Learn | How it works, development journey, benchmarks, lessons learned |
| Research | Dated deep-dives: token optimization, cache tier analysis, divergence studies, framework comparisons |
| Examples | MiniShop walkthrough, first project setup |
| Deep Dive | 18-chapter technical reference covering every pipeline stage |
| Contributing | Dev setup, testing, plugin development, code style |
MIT — See LICENSE for details.
Website: setcode.dev · Source: git.setcode.dev/root/set-core







