A pattern reference for governed multi-agent engineering delivery: scoped work packages, handoff queues, isolated execution workspaces, review gates, and durable merged state.
This repository ships the pattern reference: documentation, templates, and one synthetic worked walkthrough. Executable orchestration is not in scope at v1.
- A pattern reference for a governed multi-agent engineering-delivery workflow.
- Documentation (architecture, operating model, lifecycle, vocabulary, private-vs-public separation, expanded long-form architecture under
docs/architecture/) plus four reusable templates plus one synthetic end-to-end walkthrough. - A supporting repository in a four-repo portfolio. The three flagship repositories named under Adjacent repositories carry their own technical depth.
- Not the exact internal operating system used in any private work. The repo is a public-safe re-derivation of the pattern, not a snapshot of any real run-state.
- Not a deployed platform or runtime. No customer deployment claim.
- Not a managed service or proprietary SaaS.
- Not a generalized multi-tenant agent platform.
- Not a workshop, training course, adoption field-kit, or facilitator artifact.
- Not a flagship proof repository with captured-run evidence.
This README is the orientation document. The longer architecture coverage and component-level documentation live under docs/.
docs/architecture/overview.md— long-form architecture entry point: the seven major surfaces and how they fit together.docs/architecture/state-machine.md— the work-package lifecycle, legal state transitions, and the runner's enforcement points.docs/architecture/runner-modes.md— CLI and API entry points; the DryRun / Shadow / Canary / Live promotion ladder.docs/architecture/handoff-queue.md— append-only transport semantics, queue-entry shape, return contract, queue discipline rules.docs/architecture/candidate-writes-and-canon-gate.md— the candidate-to-canon path, Canon Gate policy boundary, commissioning, rollback.docs/architecture/substrate.md— transactional store + vector retrieval + typed graph engine + local-model evaluator layer at the architectural level.docs/architecture/audit-and-replay.md— append-only audit chain, replay determinism, anti-replay, rollback, override.docs/architecture/identity-and-security.md— principal model, capability bundles, tenant boundaries, fail-closed defaults, secrets hygiene.docs/architecture/human-ownership.md— the five seats of human ownership; why local models recommend rather than commission.docs/architecture/cloud-portable-shape.md— local-first execution; cloud-portable target; what changes, what stays the same.docs/architecture/glossary.md— every architecture term defined in plain workflow language.docs/diagrams/— professional Mermaid diagram sources used across the architecture documents.docs/operating-model.md,docs/lifecycle-and-state.md,docs/vocabulary.md,docs/private-vs-public.md— the v1 framing documents, preserved.templates/— the v1 templates for work packages, handoff messages, return artifacts, and review responses.examples/oss-library-maintenance/— the v1 synthetic worked walkthrough.
If you only have time to read a few files, start with docs/architecture/overview.md, then this README's North Star and Architecture in prose sections below.
agentic-ops is not a Markdown-first workflow with scripts attached later. The pattern's finish line is a code-driven state machine with Markdown and YAML control surfaces around it.
The substrate the pattern describes is a governed agentic operating layer: a way to run serious agent work with durable state, scoped work packages, isolated execution, governed memory writes, review gates, audit trails, and human ownership. Agents do not live only in chat. They work through a system that knows who they are, what they are allowed to read, what they are allowed to write, which state they are in, what proof they owe, and who must review the result before anything becomes canon.
The durable model:
flowchart TB
humanOwner["Human owner / review authority"]
queue["Handoff queue: work package + onboarding"]
identity["Scoped agent identity: role + tools + allowed surfaces"]
workspace["Isolated execution workspace: branch + target repo"]
sm["Code-driven state machine"]
cli["CLI path"]
api["API path"]
outputs["Candidate writes / proof artifacts / review requests"]
canonGate["Canon Gate: policy + redaction + provenance"]
model["Local model evaluator: scoring + comparison + commissioning support"]
reviewGate["Review gate authority"]
canonState["Commissioned canon: accepted state only"]
humanOwner --> queue --> identity --> workspace --> sm
sm --> cli
sm --> api
cli --> outputs
api --> outputs
outputs --> canonGate
canonGate --> model
model --> reviewGate
reviewGate --> canonState
style humanOwner fill:#0f172a,stroke:#38bdf8
style queue fill:#0f172a,stroke:#38bdf8
style identity fill:#0f172a,stroke:#38bdf8
style workspace fill:#172554,stroke:#60a5fa
style sm fill:#172554,stroke:#60a5fa
style cli fill:#172554,stroke:#60a5fa
style api fill:#172554,stroke:#60a5fa
style outputs fill:#172554,stroke:#60a5fa
style canonGate fill:#3f1d1d,stroke:#f97316
style model fill:#3f1d1d,stroke:#f97316
style reviewGate fill:#3f1d1d,stroke:#f97316
style canonState fill:#123524,stroke:#22c55e
The control files matter. Markdown and YAML make the work inspectable, governable, and human-readable. But they are not the runtime engine. Code owns the execution state machine, bootstrap, harness, substrate writes, API behavior, retry and stop behavior, audit, and validation.
The architecture starts with a simple premise: serious agent work needs an operating surface, not just a chat window. Chat is useful for instruction and collaboration, but it is a weak system of record. The pattern in this repository describes the operating substrate around the work: every agent gets a work package, a role, a handoff queue entry, a scoped execution workspace, an authorized read/write surface, a proof obligation, and a review path. The human operator and the review authority remain responsible for direction and acceptance. The agents do bounded work. The state machine makes that work repeatable.
The control plane in this pattern lives in a coordination summary repository: the place where the review authority books work, tracks status, records review verdicts, keeps the work-package registry, updates the backlog, and maintains the project board. The control plane does not own implementation code. It owns coordination, evidence pointers, accepted summaries, and state transitions. This separation matters because mixing implementation payload into the control surface produces an oversized PR where governance and implementation compete for the same review attention. Keeping the coordination repository summary-only prevents that drift.
The handoff queue is the transport layer between the review authority and the agents. It is not chat, email, instant messaging, or a vague inbox. A handoff queue entry tells an agent who it is, which work package it owns, which files define the assignment, what state the entry is in, where the expected return belongs, and what the next gate is. A correct agent does not improvise from memory when the queue entry exists. It pulls the queue, reads the entry, opens the work-package and onboarding files, checks the current state, and either returns scope confirmation, executes under a review-gate go, handles a revision, files a status hold, or stops. This is the core anti-drift mechanism for multi-agent work.
The state machine is the executable center. It enforces what a human or review authority would otherwise have to enforce manually: queue-entry discovery, agent-identity selection, workspace provisioning, work-package loading, state-transition validation, stop-condition handling, review-request routing, and audit capture. The state machine is what turns a CLI-driven operating model into a runnable system. Without it, the system depends on humans repeatedly saying "check the queue" and agents interpreting that correctly. With it, the review authority can book work and the runner can trigger the correct scoped agent under the correct state.
The worker agent is intentionally constrained. An agent is not an autonomous actor with blanket repo access. It receives a role, a work package, an execution workspace, a branch, owned surfaces, forbidden surfaces, expected evidence tier, validation commands, and a stop gate. The agent may implement, research, migrate, or review only inside that boundary. If the work package says scope confirmation first, the agent stops after scope confirmation. If a review-gate go lands, the agent executes. If review returns revision required, the agent fixes only the named findings. This discipline lets multiple agents run in parallel without turning the repo into an uncontrolled shared scratchpad.
The implementation plane is the adopter's target repository — the home for code, proof trees, research roots, runtime evidence, and detailed return artifacts. The implementation plane should be allowed to be detailed and technical. It should contain enough evidence that a future reviewer can understand what was built, what was proven, what was not proven, and what must happen next. The control plane should receive links and commissioned summaries, not the full payload.
The CLI and API are two entry points into the same operating model. The CLI is the local operator surface: dispatching, checking the queue, applying verdicts, running substrate checks, and reviewing work. The API path is the programmatic entry point for the same actions. They must not become two different systems. A command that starts a runner, reads a queue entry, writes a candidate, or requests review should call the same underlying state-machine logic whether invoked from the CLI or through an API.
The substrate is the memory and data plane. A transactional store owns transactional truth: candidate envelopes, run metadata, review status, policy versions, audit pointers, decisions, and consistent state. A vector store owns semantic retrieval over approved or candidate-indexed material, with tenant and policy filters. A first-party typed graph engine owns relationships, traversal, provenance, graph audit, and graph algorithms. These stores work together because agent work is not a single blob of text. It has records, semantic meaning, relationships, owners, provenance, review status, and time.
The candidate-write path is the safe bridge between agent work and durable memory. Agents write candidate records as they work, but those candidates are not canon. A candidate carries the principal, source, work package, policy version, evidence, redaction state, graph and vector projections, and review status. Local models and evaluators can help classify, score, compare, summarize, and explain candidate records. They can help commission, but they do not get to silently promote. Canon promotion belongs behind the Canon Gate and the accepted human review path.
The Canon Gate is the policy boundary. It should fail closed when identity is missing, scope is wrong, evidence is insufficient, redaction fails, replay is detected, tenant boundaries are unclear, policy versions do not match, local-model confidence is insufficient, or human review is required. The Canon Gate is not just a function call; it is the system's answer to the question "when is an agent-produced thing allowed to become trusted state?" If the answer is unclear, the candidate stays a candidate.
The local-model layer exists because memory tools that only append Markdown are not enough. Machine learning can help evaluate and commission memory: detecting duplicates, scoring evidence, comparing candidate facts, identifying disagreement, summarizing context, and surfacing risk. Local models should support judgment and review. They should not erase human ownership. The right pattern is model-assisted review, not model-owned truth.
The first-party graph engine is part of the pattern's core, not a placeholder. It should be built for agentic workloads from the beginning: typed nodes and edges, audit-aware writes, policy-aware traversal, replayable changes, bounded algorithms, tenant and principal checks, and configurable graph-analysis suites. The goal is not merely to store relationships. The goal is to make graph reasoning safe enough for agents to use during real work, while still preserving reviewability, provenance, and human accountability.
The review system is what keeps speed from becoming recklessness. Agents can work in parallel, but the review authority decides whether the result is accepted, rejected, blocked, or sent back for revision. Review artifacts must name what changed, what was validated, what was not touched, what evidence supports the claim, what risks remain, and what the next gate is. A review verdict is part of the system state, not a chat opinion.
The public surface — this repository — is a reflection of the pattern, not the source of truth. Public material explains the pattern safely; it does not leak private workflow details, customer or employer terms, raw discovery content, or unreviewed run state. The public material describes the architecture in a credible, reusable way.
The enterprise direction is the same pattern at larger scale. A person or team should be able to work with agents through a handoff queue, scoped work packages, governed memory, safe automation, review queues, and durable audit. A department should be able to capture workflow knowledge through natural conversation, sanitize it, identify safe automation opportunities, distinguish augmentation from replacement, and preserve human ownership. An organization should be able to introduce agents without losing control of identity, data boundaries, policy, auditability, or accountability. That is why this pattern is bigger than a local script runner.
The result the pattern aims for: a system where agents make humans faster and more accurate without pretending humans are optional. Agents prepare the work, surface context, check consistency, draft artifacts, run proofs, and write candidates. Humans decide, approve, correct, escalate, and own the outcome. The architecture exists to make that relationship durable.
The pattern carries a structural operating rule that adopters honor across every work package: implementation merges land in the implementation repository first, before the coordination summary repository records the lane as complete.
The coordination summary repository receives only:
- work-package summaries
- review verdicts
- links to implementation PRs and merge commits
- tracker, backlog, and project-board state
- commissioned documentation summaries
It does not receive full implementation code, proof trees, captured runtime artifacts, or research dumps.
This rule exists because mixing implementation payload into the coordination surface produces a control-plane PR that becomes too large and too mixed to review. Keeping coordination summary-only is the operating discipline that lets the control plane stay readable as the system scales.
The pattern is not a narrow work-package runner. It is a governed agentic substrate that can start locally, support real agent work, and later map to cloud, team, and enterprise operation without changing the core operating model.
| Workstream | Required outcome |
|---|---|
| Repo-driven self-configuration | Agents read the repo, identify their role, queue entry, work package, allowed surfaces, stop conditions, and required proof shape without relying on chat as authority. |
| Bootstrap and harness | A repeatable startup path provisions the local runtime, loads configuration, checks dependencies, creates and validates execution workspaces, and enters the state machine. |
| CLI / API parity | The same governed operations are available through CLI and API. The CLI does not become a separate shadow system with different behavior. |
| Handoff-queue watcher and runner | The review authority books a queue entry and the system triggers the correct scoped agent using the right identity and credentials without manual "check the queue" prompts. |
| Local substrate | Local services are started, checked, stopped, and validated as a real substrate rather than a loose collection of files. |
| Data stores | A transactional store for source-of-truth records, a vector store for semantic retrieval, and a first-party graph engine for typed relationships, traversal, and provenance. |
| First-party graph engine | A purpose-built graph engine for agentic workloads: secure from inception, audit-aware, replayable, policy-aware, designed for graph algorithms that support agent work. |
| Candidate memory / write path | Agents write candidate records first; candidate records carry provenance, principal, policy version, evidence, local-model evaluator context, and review state. |
| Canon Gate | No candidate becomes commissioned canon unless it passes policy, evidence, redaction, identity, anti-replay, and human-review controls. |
| Local models and commissioning | Local-model evaluator support helps score, compare, classify, summarize, or commission candidates — but does not bypass review authority. |
| Human-in-the-loop ownership | Humans own judgment, approval, escalation, and accountability. Agents surface, prepare, compare, and execute bounded work. |
| Security and identity | Tenant boundaries, principal binding, least privilege, redaction, anti-replay, audit chain, override logging, and fail-closed defaults are first-class. |
| Testing and validation | Local command proof, runtime proof, stress tests, abuse tests, replay tests, rollback tests, redaction tests, and human-review escalation tests all matter. |
| Cloud portability | Local-first does not mean local-only. The architecture maps to cloud execution without rewriting the operating model. |
| Public-safe reflection | Public surface material comes after the private implementation is real and sanitized. Public-facing material does not leak private names, private workflows, or raw run state. |
The pattern preserves a clear separation of responsibilities across four planes.
flowchart LR
subgraph controlPlane["Control plane: coordination summary repo"]
humanOwner["Human owner / review authority"]
queue["Handoff queue"]
registry["Work-package registry"]
tracker["Tracker / backlog / board"]
end
subgraph runtimePlane["Execution plane: runtime"]
runner["Agent runner"]
machine["State machine"]
agent["Scoped worker agent"]
workspace["Isolated execution workspace"]
end
subgraph implPlane["Implementation plane: implementation repo"]
code["Implementation code"]
proofs["Proof trees"]
research["Research roots"]
handoff["Return artifact"]
end
subgraph reviewPlane["Review and integration"]
implpr["Implementation PR"]
verdict["Queue review verdict"]
coordpr["Coordination summary PR"]
fold["Tracker / backlog / registry / board fold"]
end
humanOwner --> queue
humanOwner --> registry
humanOwner --> tracker
queue --> runner
registry --> runner
runner --> machine
machine --> agent
agent --> workspace
workspace --> code
workspace --> proofs
workspace --> research
code --> implpr
proofs --> implpr
research --> implpr
handoff --> coordpr
implpr --> verdict
coordpr --> verdict
verdict --> fold
fold --> tracker
style humanOwner fill:#0f172a,stroke:#38bdf8
style queue fill:#0f172a,stroke:#38bdf8
style registry fill:#0f172a,stroke:#38bdf8
style tracker fill:#0f172a,stroke:#38bdf8
style runner fill:#172554,stroke:#60a5fa
style machine fill:#172554,stroke:#60a5fa
style agent fill:#172554,stroke:#60a5fa
style workspace fill:#172554,stroke:#60a5fa
style code fill:#312e81,stroke:#a5b4fc
style proofs fill:#312e81,stroke:#a5b4fc
style research fill:#312e81,stroke:#a5b4fc
style handoff fill:#312e81,stroke:#a5b4fc
style implpr fill:#3f1d1d,stroke:#f97316
style verdict fill:#3f1d1d,stroke:#f97316
style coordpr fill:#3f1d1d,stroke:#f97316
style fold fill:#3f1d1d,stroke:#f97316
The state machine is the center of gravity. The handoff queue is transport. The work package is scope. The execution workspace is isolation. The PR is review and merge. The substrate is runtime memory and evidence. The review authority is the acceptance gate.
No worker infers permission from chat when the queue entry says otherwise. No implementation lane claims completion because a coordination summary PR merged. No candidate memory becomes canon because an agent wrote it. No public surface becomes the dumping ground for private implementation state.
The full state machine + per-component depth lives under docs/architecture/.
| Path | Purpose today |
|---|---|
README.md |
This file — pattern anchor + documentation index |
docs/architecture/ |
Long-form per-component architecture: overview, state machine, runner modes, handoff queue, candidate writes and Canon Gate, substrate, audit and replay, identity and security, human ownership, cloud-portable shape, glossary |
docs/diagrams/ |
Mermaid sources for the architecture diagrams used across docs/architecture/ and this README |
docs/architecture.md |
The v1 seven-component workflow model and lifecycle diagram |
docs/operating-model.md |
The four-role table, scopes, and handoff rules |
docs/lifecycle-and-state.md |
The state-transition graph and per-state required artifacts |
docs/vocabulary.md |
Vocabulary used across the pattern, plus extension guidance for adopters |
docs/private-vs-public.md |
What stays in an adopter's coordination repository vs what becomes adopter-public |
templates/work-package-template.md |
A bounded unit-of-work template with the eight required sections |
templates/handoff-message-template.md |
The handoff-queue message shape |
templates/return-artifact-template.md |
The worker return-artifact shape |
templates/review-response-template.md |
The review-response shape and verdict vocabulary |
examples/oss-library-maintenance/ |
One synthetic end-to-end walkthrough applying the pattern to a fictional open-source library maintenance scenario |
ROADMAP.md |
What v1 ships, what does NOT ship at v1, and v1.1+ candidates |
The pattern uses an honest evidence ladder. Adopters describe what they have proven using a defined classification, not by overclaiming.
| Evidence class | Meaning | Acceptable claim |
|---|---|---|
| Source trace | Repo files, work-package specs, prior accepted artifacts, and PR state were inspected. | "The current repo says X." |
| Static proof | A diff, inventory, collision map, or structural check proves a bounded property. | "This migration plan is additions-only and names the collision." |
| Local command proof | A local command ran and logs were captured. | "This harness passed locally under these inputs." |
| Runtime proof | Services actually started or a substrate path executed. | "The transactional store / vector store / graph path ran for this scenario." |
| Stress / abuse proof | Negative paths, replay, duplicate write, tenant boundary, rollback, and redaction cases ran. | "This path fails closed under the tested abuse cases." |
| Merge proof | The implementation PR merged and the merge SHA is recorded. | "This implementation is integrated into the target repo." |
| Not proven | The lane names the gap and does not overclaim. | "Cloud runtime is not proven in this lane." |
These categories do not collapse. A static plan is not runtime proof. A local stub is not a local model. A coordination summary merge is not target-repo code integration. A candidate write is not canon.
The pattern asks adopters to write major PR bodies, READMEs, work-package returns, and proof summaries to a reviewer-complete bar. The standard is closer to a technical advisory or architecture review than a normal short changelog.
A reviewer-complete artifact is one where a reviewer can read the artifact and answer every load-bearing question without needing chat context, side-channel files, or a separate explanation. If the reviewer has to ask "what did this actually do?", "where is the evidence for that claim?", "what's not proven here?", or "what's the next step?", the artifact is not reviewer-complete. The body answers all of those upfront.
The standard expects:
- Long-form prose. Not bullet-only changelogs. Reviewers need orientation, not just a list of touched files. The executive judgment section is prose; the architecture-in-prose section is prose; the non-claims section is prose. State tables and evidence registers complement the prose; they do not replace it.
- State tables. A status table at the top, a runtime / source anchor table, a current-state truth table, a risk and residual register. These give a reviewer the structured anchors they need to verify drift between PR body, queue, and repo state.
- Professional diagrams. When architecture, data flow, state flow, graph flow, queue flow, or review flow is central to the artifact, include a GitHub-renderable Mermaid diagram. Use conservative syntax that renders reliably; visually clever but brittle diagrams that fail to render are worse than no diagram. Follow the diagram with prose explaining what each component owns.
- Evidence registers with specific pointers. "Tests pass" is not evidence. Name the command, the captured log, the expected output, and the interpretation. Pointers to commands, logs, fixtures, PR URLs, and merge SHAs are what let a reviewer verify rather than trust.
- Explicit non-claims. State what the artifact does not prove. Non-claims are not defensive writing; they are how the system avoids drift. A reviewer who sees an explicit "this does not authorize unreviewed canon promotion" knows the lane has been thought through.
- Public / private hygiene. State what must not move to public surfaces. Run hygiene scans against blocked external-model literals, private-source identifiers, customer or employer terms, internal control-plane vocabulary, AI-attribution trailers, and IP-sensitive filenames.
- Reviewer-complete context. Everything a reviewer needs to evaluate the lane in isolation is in the artifact. Chat-context dependencies are findings; the artifact stands on its own.
The standard scales with the lane. A minor doc fix does not need 600 lines of body. A substrate change, a Canon Gate change, a graph-engine change, a security boundary change, a major migration, a validation tier landing — these all need the full long-form treatment.
There is nothing to run. To use the pattern: read docs/operating-model.md and docs/lifecycle-and-state.md for the v1 framing, then walk through the long-form docs/architecture/ tree for component depth, then read examples/oss-library-maintenance/ for a complete one-cycle illustration.
The three sibling canonical repositories are out-of-scope here. Cross-references are descriptive only; this repository does not import or deploy them.
production-rag-eval-harness— retrieval-quality evaluation (the home for retrieval implementation depth).agent-runtime-observability— governed agent runtime (the home for agent-runtime implementation depth).aws-bedrock-iac-reference— AWS / Bedrock infrastructure-as-code reference (the home for cloud-side architecture depth).