English · 中文
Make AI coding agents prove they are done. The agent can only propose done; a check reads your real output files and only then grants
complete.
AI coding agents are goal-driven. Give Codex or Claude Code a goal and it optimizes hard toward the main line — build the page, fix the bug, produce the run. On longer tasks it often skips the user-visible details that were implied, scattered through the thread, or never written as a test.
The goal is not the acceptance criteria.
Example — "add a monthly sales report page" can end with a page that exists and tests that pass, while the CSV export is missing, the chart has one data point, the title still says "Untitled", and the empty state is broken. The agent honestly believes it's done. That's the problem.
agent-completion-gate turns "done" into an external acceptance check. The agent can only propose candidate_complete; a protected gate reads the real artifacts and grants complete only when a human-written acceptance manifest passes. Plain files + one Python script — no service, no account, no lock-in.
The gate does not infer what the user meant; a human distills the acceptance criteria into the manifest, and the gate prevents the agent from self-certifying against anything less. It complements your tests and CI — it checks user-visible acceptance surfaces teams rarely unit-test (a missing export, a degenerate chart, a renamed run), not code correctness.
OpenSpec helps you define what to build before coding; agent-completion-gate checks whether the finished artifacts satisfy acceptance before the agent can call the task done.
Watch the gate read real files and disagree with an agent that says "done":
pip install pyyaml
git clone https://github.com/zhjai/agent-completion-gate && cd agent-completion-gate
sh examples/minimal-project/run.shThe everyday case — "add a monthly sales report page". The agent reports candidate_complete both times; only the real artifacts differ:
===== BEFORE — agent did the headline task, missed the details (expect BLOCKED) =====
FAIL report_has_multiple_points: rows points=1 (min 2)
FAIL csv_export_present: file exports/monthly.csv exists=False
-> BLOCKED (exit 1). The agent could NOT call this done.
===== AFTER — agent fixed the real artifacts (expect COMPLETE-OK) =====
PASS report_has_multiple_points: rows points=3 (min 2)
PASS csv_export_present: file exports/monthly.csv exists=True
-> COMPLETE-OK (exit 0).
More: examples/run.sh (overstep / blocked / granted), examples/diff_demo.sh (catch a worker under-reporting what it touched), examples/swanlab/ (the real ML incident that motivated this kit).
cd your-project
npx skills add zhjai/agent-completion-gate -g -a claude-code # or -a codex, cursor, … any hostThen you (a human) scaffold the gate into your repo — one command, no manual copying:
# the engine + scaffolder live in the repo (the skill teaches the procedure; it doesn't ship the engine)
git clone https://github.com/zhjai/agent-completion-gate /tmp/acg
cd your-project && sh /tmp/acg/scripts/init.sh --dest .It creates gate/ (the engine + an empty, passable manifest), control/surface_inventory.yaml, state/, .github/workflows/completion-gate.yml, and a CODEOWNERS example. Idempotent; never clobbers your edited specs without --force. (Prefer typing one line to your agent? Ask it to "set up the completion gate" — the completion-gate-init skill runs this same script. The script is the source of truth.)
Fresh install is intentionally permissive. Empty specs pass — at this point the gate only stops the agent from self-declaring
complete; it does not yet know your project's artifacts. To make it useful you add at least one surface and one check.
1 — Define what "done" means (the human distills intent into checks; the gate doesn't infer it). Edit control/surface_inventory.yaml:
surfaces:
- id: report
user_visible: true
paths: ["artifacts/report.json"]…and gate/acceptance_manifest.yaml:
checks:
- id: report_has_multiple_points
surface: report
type: min_series_points
artifact: "artifacts/report.json"
series: "rows"
min_points: 2
review_items: []Built-in check types: file_exists, config_not_disabled, min_series_points, max_chart_count, identity_in_name (extend run_machine_check() for your own). A fuller worked spec: examples/swanlab/.
2 — Run it locally:
printf 'status: candidate_complete\ntouched_surfaces: [report]\nreview_queue: []\n' > state/completion_candidate.yaml
python3 -E gate/check_acceptance.py --manifest gate/acceptance_manifest.yaml \
--inventory control/surface_inventory.yaml --candidate state/completion_candidate.yaml --repo .Missing the data points → BLOCKED. Once the real artifact is right → COMPLETE-OK.
3 — Make it the authority. The scaffolded .github/workflows/completion-gate.yml runs this on every PR. Mark the verify-completion job a required status check, and CODEOWNERS-protect gate/, control/, and the workflow (see the generated .github/CODEOWNERS.completion-gate.example). Now complete means exactly one thing: that check is green. Trust model + the agent Stop-hook option: integrations/README.md.
The completion-audit skill instructs the agent: at task wrap-up, write state/completion_candidate.yaml (status: candidate_complete, plus the surfaces it touched), then run the gate. The agent can reach at most candidate_complete — only the external verifier (CI / a hook) ever writes complete. If the gate blocks, it fixes the real artifacts and re-audits. Your loop: do the work → audit completion → CI verdict → fix blocked reasons or merge.
in_progress ──► candidate_complete ──►(EXTERNAL verifier)──► complete
│ └─► blocked
└────────► blocked (needs-review / unknown surface / missing evidence)
The worker can only reach candidate_complete or blocked. Only an external verifier writes complete. needs-review == blocked (not an annotation the agent can set and move past). The kit ships the check, the contract, and the wiring: check_acceptance.py returns a verdict; gate/verify_completion.sh enforces the state machine around it (rejects a worker that wrote complete itself; grants only on a clean pass); integrations/ attaches it as CI / hook. Full contract: STATE_MACHINE.md.
- A rule is advisory — a goal rationalizes past it.
- A skill can be skipped — the agent chooses not to invoke it.
- memory records belief, not verified truth.
- Only a gate the agent can't edit, on a path it can't skip, reading artifacts it can't fake reliably stops "looks done but isn't."
OpenSpec — planning before coding (agree on what to build)
agent-lessonbook — capture corrections & drift lessons during the work
agent-completion-gate — acceptance before "done"
agent-lessonbook is an optional companion that captures process lessons during execution. This gate is standalone — it never reads lessonbook (or any memory) at runtime; it reads only its own --manifest/--inventory. Only a human may translate a recurring lesson into the gate's protected manifest.
Hardened across multiple heterogeneous (Codex × Claude) review rounds — each invariant closed a reproduced bypass. "External + fail-closed under a trusted base branch + runner", not "unbypassable":
- Gate + manifest + inventory are protected (read-only, outside the agent-writable workspace, maintained only through human/CI-reviewed changes).
check_acceptance.py --agent-writable-root DIRenforces this at runtime. - Inspect real artifacts, never
run_state. - Unknowns fail closed — a touched user-visible surface with no passing check → blocked. The
touched_surfaceslist is a worker self-report; use--strict-surfacesor--diff-base <ref>/--touchedto derive it from the real git diff instead of trusting the worker. - One canonical completion signal (the gate's verdict); chat / PR / dashboard derive from it, never become an independent "complete".
- Artifact content is hostile data, not instructions — deterministic checks first; an LLM verifier treats artifacts as untrusted.
- Hermetic execution — the gate runs as
python3 -E(ignoresPYTHON*env / repo-plantedyaml.py), and CI runs it from the trusted base branch so a PR can't edit the gate that judges it.
scripts/init.sh— scaffold the gate into your project (the authoritative setup path).STATE_MACHINE.md— the completion contract (states, transitions, wiring).integrations/README.md— CI / agent-hook / pre-push wiring + the trust model.examples/— runnable:minimal-project/(everyday web task),run.sh,diff_demo.sh,diff_rename_test.sh,swanlab/(the ML incident).CHANGELOG.md· self-tests intests/.
v0.3.1 preview. MIT. Agent-agnostic, file-based, fail-closed. Optional companion: agent-lessonbook.