From 1686ca93d42a1ed5cb4294781d5c879db63e9cec Mon Sep 17 00:00:00 2001
From: Ramesh Ayyagari <rayyagari2@gmail.com>
Date: Thu, 28 May 2026 13:50:11 -0400
Subject: [PATCH] =?UTF-8?q?docs:=20Sprint=203=20public=20docs=20=E2=80=94?=
 =?UTF-8?q?=20task-risk-profile,=20d1-d4-scoring,=20execution-substrates,?=
 =?UTF-8?q?=20competitive-landscape;=20fix=20D3=20naming?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 calibration/d1-d4-rubric.md   |   6 +-
 docs/competitive-landscape.md | 127 +++++++++++++++
 docs/d1-d4-scoring.md         | 248 +++++++++++++++++++++++++++++
 docs/execution-substrates.md  | 184 ++++++++++++++++++++++
 docs/glossary.md              |   2 +-
 docs/task-risk-profile.md     | 285 ++++++++++++++++++++++++++++++++++
 6 files changed, 848 insertions(+), 4 deletions(-)
 create mode 100644 docs/competitive-landscape.md
 create mode 100644 docs/d1-d4-scoring.md
 create mode 100644 docs/execution-substrates.md
 create mode 100644 docs/task-risk-profile.md

diff --git a/calibration/d1-d4-rubric.md b/calibration/d1-d4-rubric.md
index 585cf76..224fd27 100644
--- a/calibration/d1-d4-rubric.md
+++ b/calibration/d1-d4-rubric.md
@@ -7,7 +7,7 @@ weighted dimensions, each scored 0-25:
 |-----------------|----------------------------------------------------|-----|
 | **D1 Correctness**  | Did the agent produce correct output on first attempt? | 25  |
 | **D2 Observability** | Did the agent emit enough telemetry, logs, and intermediate state to verify what it did? | 25  |
-| **D3 Compliance** | Did the agent follow the rules of the role scope, approvals, manifest use? | 25  |
+| **D3 Policy** | Did the agent follow the rules of the role scope, approvals, manifest use? | 25  |
 | **D4 Recurrence** | Did the agent repeat a prior mistake from its own failure library? | 25  |
 | **Total**       |                                                    | 100 |
 
@@ -38,7 +38,7 @@ Acceptable:
 ```
 D1 Correctness:  25 7/7 acceptance criteria met on first QA attempt, no rework
 D2 Observability: 22 bulletin entries at all 4 phase transitions, one missing handoff log
-D3 Compliance:   25 pre-spawn protocol followed; manifest matches ACs
+D3 Policy:   25 pre-spawn protocol followed; manifest matches ACs
 D4 Recurrence:   25 no known pattern repeated; novel task class
 Total:           97 → STANDARD (confidence band: LOW, n=6)
 ```
@@ -152,7 +152,7 @@ being trustworthy.
 
 ---
 
-## D3 Compliance
+## D3 Policy
 
 **What it measures:** did the agent follow the rules of its role?
 
diff --git a/docs/competitive-landscape.md b/docs/competitive-landscape.md
new file mode 100644
index 0000000..0104262
--- /dev/null
+++ b/docs/competitive-landscape.md
@@ -0,0 +1,127 @@
+# Competitive Landscape
+
+A short note on where AWF sits relative to four systems it is most
+often confused with. The pattern across all four: each operates at a
+different layer of the agent stack and the relationship to AWF is
+more often complementary than competing.
+
+For the layering model that grounds this document, see
+[`docs/execution-substrates.md`](execution-substrates.md).
+
+---
+
+## Maggy
+
+| Field | Value |
+|---|---|
+| **Category** | AI engineering command centre and execution substrate. |
+| **Relationship to AWF** | Execution layer beneath the authority layer. AWF authorizes over Maggy the same way it authorizes over Claude Code, Codex or Cursor. |
+| **AWF distinction** | Maggy answers "how is this work driven". AWF answers "who is allowed to do this work". |
+
+A buyer can adopt Maggy for engineering execution workflow and AWF for
+cross-runtime authorization. The two compose: Maggy drives the work,
+AWF decides whether Maggy is allowed to drive it on behalf of a given
+trust subject for a given task class.
+
+Maggy is part of the substrate roster AWF designed adapters for. An
+adapter ships post-Sprint 3 on pilot demand. Sprint 3 covers Maggy at
+the design-document level only.
+
+---
+
+## Microsoft AGT
+
+| Field | Value |
+|---|---|
+| **Category** | Runtime security enforcement layer. |
+| **Relationship to AWF** | Different layer. AGT is permission-check infrastructure at the moment of action. AWF is authority-record infrastructure across sessions and substrates. |
+| **AWF distinction** | AGT decides whether the next tool call is permitted, now, deterministically. AWF decides whether the trust subject behind the agent has earned authority over this class of work, over time. |
+
+AGT and AWF are complementary, not substitutes. AGT enforces a static
+policy at the moment of execution. AWF enforces a trust trajectory
+across sessions: a trust subject can be permitted by AGT to run a
+query yet `BLOCKED` by AWF because that trust subject has not
+accumulated enough D1 to D4 evidence on `db_migration` to earn the
+required tier.
+
+A production deployment with both layers gets the strengths of each:
+deterministic permission checks at action time from AGT and earned
+authority across sessions from AWF. Neither layer's job is the other's.
+
+For more on how AGT fits the broader governance stack, see
+[`docs/architecture/three-layer-stack.md`](architecture/three-layer-stack.md).
+
+---
+
+## Superlog
+
+| Field | Value |
+|---|---|
+| **Category** | Application observability for AI applications. |
+| **Relationship to AWF** | Different domain. Superlog instruments the application; AWF governs authority over the agents that produced the application's behaviour. |
+| **AWF distinction** | Superlog answers "what did the application do at runtime". AWF answers "was the agent that built this allowed to". |
+
+Superlog occupies the same conceptual slot for AI applications that
+Datadog or Honeycomb occupy for traditional services. It is consumed
+*after* an agent has shipped code. AWF is consumed *before* the agent
+runs, with the feedback loop closing only when telemetry from the
+running application feeds back into the agent's trust signal via
+Process Intelligence (Sprint 4).
+
+The two systems can compose. Superlog's runtime signal becomes input
+to AWF's trust-update logic. The signals flow in one direction, from
+runtime back to authority. There is no overlap at the authority layer
+itself.
+
+---
+
+## Pentagon
+
+| Field | Value |
+|---|---|
+| **Category** | Agent team workspace and execution layer. |
+| **Relationship to AWF** | Execution layer, similar to Maggy. AWF authorizes over Pentagon, not against it. |
+| **AWF distinction** | Pentagon answers "where does an agent team coordinate, share state and hand off work". AWF answers "is this team's chosen runtime authorized for this class of work". |
+
+Pentagon is a workspace product. Multiple agents and their human
+operators live inside it and pass work back and forth. The runtime
+authorization question still applies: when a Pentagon-resident agent
+attempts a task class, *that agent's trust subject* is what AWF scores
+authority against. Pentagon is the surface. AWF is the authority
+record behind the surface.
+
+A buyer can adopt Pentagon for cross-agent collaboration and AWF for
+the authority layer that decides what those agents are allowed to do.
+Same complementary pattern as the others.
+
+---
+
+## The general shape
+
+The four systems above are confused with AWF for the same reason. Each
+operates in or near the agent stack. Each speaks the language of
+governance or accountability. Each ships an artifact that *looks* like
+an authority decision from a distance.
+
+Up close they are not. The test is the question each system answers.
+
+| System | Question it answers |
+|---|---|
+| Maggy, Pentagon, Claude Code, Codex, Cursor | Can this work be done? And if so, how? |
+| Microsoft AGT | Is this specific tool call permitted, right now? |
+| Superlog | What did the application do at runtime? |
+| AWF | Has the agent's trust subject earned authority over this class of work, on this substrate? |
+
+Different question, different system. AWF composes with all of them.
+It replaces none.
+
+---
+
+## Related
+
+- `docs/execution-substrates.md`: the layering model that places each
+  of the systems above on its appropriate layer.
+- `docs/architecture/three-layer-stack.md`: runtime governance vs
+  scheduled automation vs behavioural accountability. The framing AGT
+  fits inside.
+- AWF Sprint Plan v4.4.2, Competitive Landscape: Maggy.
diff --git a/docs/d1-d4-scoring.md b/docs/d1-d4-scoring.md
new file mode 100644
index 0000000..618de84
--- /dev/null
+++ b/docs/d1-d4-scoring.md
@@ -0,0 +1,248 @@
+# D1 to D4: Trust Scoring Dimensions
+
+D1, D2, D3 and D4 are the four dimensions used to score a single agent
+session. They are not tiers. A dimension score measures one session's
+behaviour. A trust tier is an authority level the trust subject has
+earned over many sessions.
+
+This document exists to lock that distinction. Earlier AWF documentation
+sometimes blurred it: the dimension D3 was occasionally referred to as
+the same kind of thing as the tier RESTRICTED. It is not. v4.4.2
+separates them permanently.
+
+> **D1 to D4 score a session. Trust tiers authorize a future session.**
+
+That sentence is the whole point of this document. Everything below is
+the structure that makes it true.
+
+---
+
+## What each is
+
+| Concept | Type | Scale | Lifecycle |
+|---|---|---|---|
+| D1 to D4 dimensions | Per-session measurement | Each 0 to 25; sum 0 to 100 | Computed once per session, after QA closes |
+| Trust tier | Per-(subject, task class, runtime) authority level | One of `PROVISIONAL`, `RESTRICTED`, `STANDARD`, `HIGH` | Persisted across sessions; updated as D1 to D4 evidence accumulates |
+
+The dimensions describe *what happened*. The tier describes *what the
+trust subject is allowed to do next*. The authorization decision at the
+start of a session reads the tier. It does not recompute the dimension
+scores.
+
+A session that scores 100 on D1 to D4 does not, on its own, promote a
+trust subject from `STANDARD` to `HIGH`. Promotion is gated by the
+confidence band, which is a function of session count. A perfect score
+with n=1 sessions is provisional evidence at best.
+
+---
+
+## D1 Correctness
+
+**What it measures.** Did the agent produce correct output on the first
+QA attempt?
+
+The signal is *first attempt*. An agent that needed three QA rounds to
+land a task is materially different from one that landed it cleanly,
+even if the final diff is identical.
+
+**Evidence inputs.**
+- QA verdict (pass / pass_with_notes / fail).
+- Per-acceptance-criterion pass or fail.
+- Number of QA rounds before pass.
+- Post-merge defects traceable to the session.
+
+**Raises D1.** Zero rework. Every acceptance criterion green on the
+first QA pass. No structural fixes during the session.
+
+**Lowers D1.** Rework loops. Acceptance criteria initially missed.
+Structural changes during fix-up. Post-merge defects attributable to
+the session.
+
+**Hard-stop at 0.** The output, taken at face value, would have caused
+production harm. A SQL change that would corrupt data. An auth check
+silently removed. The score reflects what the agent produced, not what
+the safety net caught.
+
+---
+
+## D2 Observability
+
+**What it measures.** Did the agent emit enough telemetry, logs and
+intermediate state for an observer to reconstruct the session from logs
+alone?
+
+D2 is the dimension that protects against silent execution. An agent
+that produces correct output without telemetry is not trustworthy at
+scale. The next time it produces wrong output, nobody will know until
+production breaks.
+
+**Evidence inputs.**
+- Bulletin entries at every phase transition.
+- Tool-use audit log entries.
+- Handoff logs between agent roles.
+- Task manifest completion.
+- Any traces left in the artifacts.
+
+**Raises D2.** A continuous timeline from spawn to QA close. Every tool
+use accounted for. Every handoff explicit. No silent regions.
+
+**Lowers D2.** Missing phase transitions. Tool uses with no audit-log
+record. Reviewer has to infer what the agent did.
+
+**Hard-stop at 0.** Falsified telemetry. A bulletin entry that claims a
+state the artifact contradicts. This triggers automatic demotion to
+`RESTRICTED` regardless of any other dimension score. Promotion back
+requires at least five clean sessions plus a second-scorer review. The
+framework's audit story collapses if the audit log itself can lie.
+
+---
+
+## D3 Policy
+
+**What it measures.** Did the agent operate within the rules of its
+role?
+
+D3 covers scope respect, approval gates, manifest discipline and any
+role-specific rule (a backend agent never modifying frontend files;
+a QA agent never editing the artifact under review).
+
+> **Naming note.** Earlier AWF documentation referred to this dimension
+> as "D3 Compliance". v4.4.2 standardises on **D3 Policy** as the
+> canonical shorthand. The meaning is unchanged. The long form is
+> "policy compliance" and the schema field remains `d3`.
+
+**Evidence inputs.**
+- Pre-spawn protocol completion.
+- Files in scope versus files actually modified.
+- Approval gates fired versus gates expected.
+- Lock acquisitions matched to file edits.
+- Override flag usage (each use is a deduction).
+
+**Raises D3.** Zero violations. Every hook passed legitimately. Every
+required approval obtained before the action.
+
+**Lowers D3.** Scope drifts. Missed approval gates. Edits to locked
+files without the correct procedure. Override flags used to push past
+checks the agent could have respected.
+
+**Hard-stop at 0.** Any of the following:
+- Hook bypass with an override marker the operator did not authorise.
+- Unauthorized commit. Agent committed code without operator approval where approval was required.
+- Editing files outside declared scope without surfacing the change.
+
+Hard-stop triggers immediate reviewer escalation.
+
+---
+
+## D4 Recurrence
+
+**What it measures.** Did the agent repeat a known failure pattern from
+its own failure library?
+
+D4 turns failure memory from an archive into a behaviour shaper. An
+agent that makes a novel mistake costs one failure record. An agent
+that repeats a known failure costs trust.
+
+**Evidence inputs.**
+- Failure library state at session start.
+- Pre-task retrieval log: the `FAILURE-LIB` bulletin entries the orchestrator writes before spawn.
+- Post-session diff: did this session create a recurrence?
+- Failure record taxonomy class.
+
+**Raises D4.** No known pattern repeated. Or: agent came close to a
+known pattern and self-corrected, with the catch visible in the
+bulletin.
+
+**Lowers D4.** Repeated a pattern that was in the failure library and
+discoverable. Repeated a pattern that was specifically named in the
+session's instructions.
+
+**Hard-stop at 0.** The repeated pattern was explicitly listed in the
+session's instruction file. The agent had every signal and ignored it.
+A failure library entry update is mandatory: the existing pattern entry
+must be promoted (`recurrenceCount` incremented, prevention rule
+re-evaluated).
+
+---
+
+## How dimensions become tiers
+
+A session score is one data point. A trust tier is a position the trust
+subject occupies for the next session's authorization decision.
+
+The path from dimension scores to tier is governed by two things at
+once:
+
+1. **The session-score total.** D1 + D2 + D3 + D4, summed 0 to 100.
+2. **The confidence band.** A function of session count.
+
+A single 95/100 session does not promote a trust subject from
+`STANDARD` to `HIGH`. A 95/100 average across many sessions with high
+confidence band does.
+
+Demotion is more responsive. A hard-stop on D2 or D3 demotes
+immediately, on the basis that one event of falsified telemetry or
+unauthorized commit changes the trust calculation regardless of prior
+session count.
+
+The full tier-progression rules live in the autonomy gates and trust
+scoring documentation. This document's job is the upstream one: making
+sure the dimensions that feed those rules are unambiguous.
+
+---
+
+## Trust tiers, briefly
+
+Authority order (lowest to highest):
+
+```
+PROVISIONAL  <  RESTRICTED  <  STANDARD  <  HIGH
+```
+
+- **PROVISIONAL.** Unproven. Observe and propose only.
+- **RESTRICTED.** Known limits. Executes low-risk work with controls.
+- **STANDARD.** Baseline execution authority. The default working tier.
+- **HIGH.** Expanded authority for proven trust subjects.
+
+Tiers are keyed on `(trust_subject_id, task_class, runtime_provider)`.
+A trust subject can be `HIGH` for UI refactor work on Cursor while
+remaining `PROVISIONAL` for database migrations on Codex. Authority is
+earned per task class and per runtime, not in aggregate.
+
+The Runtime Authorization Decision consults the tier. It does not
+recompute the dimension scores. The decision runs fast precisely
+because the trust capability profile already encodes the history.
+
+---
+
+## What this means in practice
+
+When you read a Trust Capability Profile, you are reading a record of
+authority. When you read a Trust Score, you are reading the D1 to D4
+evidence that fed into that authority. The two are linked but they are
+not the same artifact.
+
+When you read a Runtime Authorization Decision, the field
+`current_trust_tier` is the authority. The "D1 to D4 evidence summary"
+section is the supporting trail.
+
+A reviewer auditing a decision works in this order:
+1. Was the right tier consulted?
+2. Was the right risk lane derived?
+3. Did the comparison rule (tier vs required-tier) fire correctly?
+4. Are the dimension scores backing the tier plausible against the
+   linked sessions?
+
+Dimensions feed tiers. Tiers feed decisions. Decisions feed audit. The
+shape moves in one direction.
+
+---
+
+## Related
+
+- `schemas/v1/trust-score.schema.json`: per-session score envelope.
+- `schemas/v1/trust_capability_profile.schema.json`: earned tier per (subject, task class, runtime).
+- `schemas/v1/runtime_authorization_decision.schema.json`: the per-task authorization output.
+- `calibration/d1-d4-rubric.md`: the score-band rubric and procedural scoring guide.
+- `docs/task-risk-profile.md`: the input that decides what tier the decision *requires*.
+- AWF Sprint Plan v4.4.2, Terminology section.
diff --git a/docs/execution-substrates.md b/docs/execution-substrates.md
new file mode 100644
index 0000000..8635709
--- /dev/null
+++ b/docs/execution-substrates.md
@@ -0,0 +1,184 @@
+# Execution Substrates and the Authority Layer
+
+AWF is not an execution platform. AWF does not spawn agents, route
+subagents, generate code, run tests or capture tool-call telemetry.
+Those concerns belong to *execution substrates*. AWF is the authority
+layer above them.
+
+This document explains the distinction, why it matters and how the
+three layers fit together.
+
+---
+
+## The two questions
+
+Execution and authority answer different questions. Conflating them
+collapses important structure.
+
+> **Execution substrates answer: "Can the agent do the work?"**
+>
+> **AWF answers: "Has the agent earned authority to do the work?"**
+
+A substrate can be perfectly capable of running a database migration.
+Capability does not imply authority. AWF's job is to decide whether the
+trust subject mapped to that substrate has accumulated enough D1 to D4
+evidence on that task class to be authorized. If not, AWF either
+rejects the work, escalates for human approval or recommends a different
+substrate that has earned the authority.
+
+These are different layers of the system. They run as separate code and
+produce different artifacts. Capability is observed at execution time.
+Authority is consulted at authorization time, which happens before
+execution begins.
+
+---
+
+## The three-layer stack
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│  Authority layer: AWF / TrustPlane                          │
+│  Decides who is allowed to do what, under which controls,   │
+│  and records the decision in the audit log.                 │
+└─────────────────────────────────────────────────────────────┘
+                            ▲
+                            │ authorizes
+                            ▼
+┌─────────────────────────────────────────────────────────────┐
+│  Execution substrate layer                                  │
+│  Maggy, Claude Code, Codex, Cursor, Devin, LangGraph.       │
+│  Drives the work. Spawns agents, routes subagents, captures │
+│  events. Answers "can this be done."                        │
+└─────────────────────────────────────────────────────────────┘
+                            ▲
+                            │ executes via
+                            ▼
+┌─────────────────────────────────────────────────────────────┐
+│  Runtime / model layer                                      │
+│  Claude, GPT, Gemini, Qwen.                                 │
+│  The underlying LLM that produces tokens.                   │
+└─────────────────────────────────────────────────────────────┘
+```
+
+Each layer is independent. AWF is model-agnostic and substrate-agnostic.
+A substrate can swap its underlying model without AWF noticing. AWF
+authorizes the substrate, not the model.
+
+---
+
+## What an execution substrate is
+
+An execution substrate is any system that takes a task description and
+drives it to completion through one or more agents. The shape varies.
+Some are CLIs (Claude Code, Codex, Aider). Some are IDE integrations
+(Cursor, Continue). Some are end-to-end engineering platforms (Devin,
+Maggy). Some are agent runtime frameworks (LangGraph, AutoGen).
+
+What they have in common:
+
+- They spawn the agents that produce the artifact.
+- They are the source of tool-call telemetry.
+- They define the local execution model: how subagents are dispatched, how state is tracked, how the session terminates.
+- They can be observed and authorized over by an external authority layer.
+
+What they are not:
+
+- They are not the authority over their own agents at scale. Each
+  substrate has its own internal policy surface (Claude Code's
+  permission modes, Cursor's safety settings, Devin's approval rules).
+  None of them produce a cross-substrate authorization decision keyed
+  on what *this trust subject* has earned across *all* the substrates
+  it has worked through.
+
+That last gap is the one AWF fills.
+
+---
+
+## What AWF is
+
+AWF is the authority layer above substrates. For every task, it decides
+whether the candidate runtime has earned the right to do the work. The
+artifacts:
+
+- **Task Risk Profile.** A per-task score on five dimensions, summing
+  to a composite that maps to a risk lane.
+- **Trust Capability Profile.** A persisted record per
+  `(trust_subject_id, task_class, runtime_provider)` of earned authority,
+  evidence strength and session count.
+- **Runtime Authorization Decision.** The per-task output. `AUTHORIZED`,
+  `SUPERVISED` (authorized with controls) or `BLOCKED`, with an audit
+  event recording why.
+
+AWF does not generate the code. The substrate does. AWF decides whether
+the substrate is allowed to. The substrate then executes (or does not),
+and the result feeds back into the trust capability profile that gates
+the next decision.
+
+The shape: authorize, execute, observe, update authority. Every loop
+moves the trust signal forward.
+
+---
+
+## Why this layering matters
+
+**For buyers.** Adopting Maggy, Cursor or Devin does not displace AWF.
+Those are the execution substrates AWF was designed to authorize over.
+A buyer who has standardised on Cursor for engineering work still
+needs an answer to "which trust subject is allowed to invoke Cursor on
+this task class and under which controls". AWF is the layer that
+answers it.
+
+**For builders.** A new substrate enters the market every quarter. AWF
+absorbs that change rate through an adapter pattern. The substrate's
+events get translated into AWF's canonical audit event shape, the
+substrate's identity becomes a `trust_subject_id` and the substrate
+gets a row per task class in the trust capability profile. The
+authority layer does not need to know the substrate's internal model
+of agents.
+
+**For regulators.** Per-event `user_id` carried on every audit event
+answers the question regulators actually ask: which human authorized
+the agent to do this? Substrates produce the events. The authority
+layer enforces that the events are correctly shaped, joined and
+retained.
+
+---
+
+## Layer boundaries in practice
+
+A few rules that keep the boundaries clean.
+
+**AWF does not call substrate-internal APIs to make decisions.** The
+authority decision is made on AWF's own state: the
+`trust_capability_profile` table, the `task_risk_profile` for the
+request and policy config. The substrate's internal trust signals, if
+any, are not consulted at authorization time. They may feed in later
+via Process Intelligence (Sprint 4), but only as one input among many.
+
+**Substrates do not write to `trust_scores` directly.** A substrate
+emits events. The Eval/Telemetry Service is the only writer to
+canonical trust tables. This rule is the same one AWF applies to its
+own internal components.
+
+**The runtime/model layer is not addressable.** AWF does not produce
+authorizations against Claude or GPT directly. It produces
+authorizations against the substrate that runs the model. If a
+substrate switches its underlying model, the trust history travels
+with the substrate, not with the model.
+
+---
+
+## Related
+
+- `docs/architecture/three-layer-stack.md`: a different three-layer
+  lens, oriented around governance concerns (runtime governance,
+  scheduled automation, behavioural accountability). Complementary to
+  the authority-of-substrate stack described here.
+- `docs/architecture/four-plane-model.md`: the four-plane operating
+  model AWF runs inside (workforce, autonomy, control, automation).
+- `docs/task-risk-profile.md`: the per-task input to authorization.
+- `docs/d1-d4-scoring.md`: the per-session evidence that updates trust
+  tiers.
+- `docs/competitive-landscape.md`: where individual substrates sit
+  relative to AWF.
+- AWF Sprint Plan v4.4.2, Strategic Refinement and Three-Layer Stack.
diff --git a/docs/glossary.md b/docs/glossary.md
index 6fded27..bcf038c 100644
--- a/docs/glossary.md
+++ b/docs/glossary.md
@@ -111,7 +111,7 @@ state transitions sufficiently for the session to be reconstructed.
 D2 = 0 (falsified telemetry) is a categorical demotion to PROBATION.
 Defined in [concepts/trust-scoring.md](concepts/trust-scoring.md).
 
-**D3 Compliance**
+**D3 Policy**
 Third trust-scoring dimension. Measures whether the agent operated within
 policy: no hook bypass, no unauthorized commits, no out-of-scope writes.
 Defined in [concepts/trust-scoring.md](concepts/trust-scoring.md).
diff --git a/docs/task-risk-profile.md b/docs/task-risk-profile.md
new file mode 100644
index 0000000..0a93bc2
--- /dev/null
+++ b/docs/task-risk-profile.md
@@ -0,0 +1,285 @@
+# Task Risk Profile
+
+The Task Risk Profile scores a task on five dimensions before any
+authorization decision runs. The five scores sum to a composite (5 to 25)
+that maps deterministically to a risk lane. The risk lane drives the
+required trust tier in the Runtime Authorization Decision.
+
+The same task carries the same risk lane regardless of which execution
+substrate is being authorized. A database migration is a database
+migration whether the candidate runtime is Cursor, Codex or Claude Code.
+
+Schema: `schemas/v1/task_risk_profile.schema.json`.
+Source reference: AWF Sprint Plan v4.4.2, Task Risk Profile Model.
+
+---
+
+## The five dimensions
+
+Each dimension is scored from 1 to 5. Higher is riskier. Every score
+should carry a one-line rationale. The schema permits omitting it, but
+the audit value of a rationale-free score is near zero.
+
+### Code complexity
+
+How intricate is the change?
+
+| Score | Anchor |
+|---|---|
+| 1 | Single-file edit. One function, one config value, one CSS class. |
+| 2 | A handful of files inside one module. No cross-cutting concerns. |
+| 3 | Multi-file change within one service. Touches an internal contract. |
+| 4 | Cross-service change. Coordinated edits in two or more services. |
+| 5 | Multi-service refactor. Cuts across service boundaries and shared libraries. |
+
+**Raises the score.** Code generation that spans previously-unrelated
+modules. Changes that require parallel edits to keep types or schemas
+consistent. Rewrites of glue code that several services depend on.
+
+**Lowers the score.** Edits confined to a leaf module, a presentational
+component or a feature flag. Local refactors where the blast radius of
+a mistake is bounded by the file.
+
+**Edge cases.**
+- A small diff that changes a public function signature consumed across services scores 3 at minimum, not 1. Surface area matters more than diff size.
+- A multi-file rename driven by a deterministic codemod scores 2, not 4. Cognitive load is what this dimension measures, not line count.
+
+### Blast radius
+
+How many users or systems are affected if this change fails in production?
+
+| Score | Anchor |
+|---|---|
+| 1 | Isolated. Internal tooling, dev-only script, sandbox environment. |
+| 2 | Single feature surface inside a logged-in product area. |
+| 3 | Whole product area. All users of one feature or all instances of one job. |
+| 4 | Cross-feature. Several product areas or a shared platform component. |
+| 5 | Production-wide. Every request, every tenant, every customer. |
+
+**Raises the score.** Shared platform code (auth, payments, billing,
+observability). Changes touching the data path used by every request.
+Rollouts that cannot be feature-flagged.
+
+**Lowers the score.** Changes guarded by a feature flag at zero rollout.
+Changes scoped to a single tenant. Changes to a job that runs only when
+explicitly invoked.
+
+**Edge cases.**
+- A change shipped behind a flag at 0% still has a blast radius if the flag itself is misconfigured. Score the worst-case rollout, not the planned one.
+- An internal admin tool that touches production data is not score 1, even if only operators see it. The "users" axis includes the systems those operators control.
+
+### Security and data sensitivity
+
+What is the exposure to secrets, PII or auth surfaces?
+
+| Score | Anchor |
+|---|---|
+| 1 | No secrets, no PII, no auth surface. Purely presentational. |
+| 2 | Touches identifiers but no credentials. Display name, IDs, public metadata. |
+| 3 | Carries customer identifiers in transit. Touches request paths that may include PII. |
+| 4 | Touches identity tables, password reset flows, token issuance. Adjacent to auth. |
+| 5 | Direct auth surface or PII path. Changes credential validation, session checks or stored secrets. |
+
+**Raises the score.** Any change to authn or authz code. New database
+columns that may hold tokens, hashes or PII. New external integrations
+that require credentials.
+
+**Lowers the score.** Display-only changes. Visual refactors with no
+data dependency. Changes to internal analytics where no PII is in scope.
+
+**Edge cases.**
+- Removing a check is at least as sensitive as adding one. A diff that deletes an authentication step scores 5 regardless of how small the patch is.
+- Logging changes are not exempt. A new log line that captures PII is a sensitivity event even though the surface looks innocuous.
+
+### Concurrency and state risk
+
+How likely is this change to introduce race conditions, lock contention or migration-time inconsistency?
+
+| Score | Anchor |
+|---|---|
+| 1 | No state mutation. Pure functions, presentational components. |
+| 2 | Local state mutation with no contention. Single-writer paths. |
+| 3 | Shared in-memory state. Cache writes, distributed counters, queue producers. |
+| 4 | Schema change with a backfill or a coordinated deploy across consumers. |
+| 5 | Live schema change under read and write load. Foreign-key dependents in scope. |
+
+**Raises the score.** Any DDL on a table whose row count exceeds reasonable
+backfill latency. Lock orderings that differ from existing transactional
+paths. Any change that requires multiple services to deploy together.
+
+**Lowers the score.** Backwards-compatible additions (new optional column,
+new endpoint). Pure-read changes. Changes that can be rolled back by
+deleting a single deployment artifact.
+
+**Edge cases.**
+- A column add that is technically backwards-compatible can still score 4 if the application immediately starts writing to it without the read path tolerating nulls. The risk lives in the write coordination, not the DDL.
+- Cache invalidation changes look small but cross the concurrency axis. Score them at 3 minimum.
+
+### Business criticality
+
+What are the revenue, contract or regulatory consequences of breakage?
+
+| Score | Anchor |
+|---|---|
+| 1 | Sandbox or experimental product. No paying customers, no SLA. |
+| 2 | Internal tools used by employees. No external SLA. |
+| 3 | Customer-facing feature not on the revenue path. Marketing site, in-app help. |
+| 4 | Customer-facing feature adjacent to the revenue path. Account settings, integrations, dashboards consumed by paying customers. |
+| 5 | Revenue path or regulated surface. Checkout, payment, login, compliance reporting. |
+
+**Raises the score.** Anything documented in a customer contract. Anything
+under regulatory scope (PCI, HIPAA, SOX). Anything that, broken for an
+hour, would page an executive.
+
+**Lowers the score.** Internal-only surfaces with no contractual SLA.
+Clearly experimental features ring-fenced behind a beta flag.
+
+**Edge cases.**
+- A low-traffic feature with a contractual SLA scores 4 or 5. Volume is not the test; consequence is.
+- A feature that does not generate revenue today but is part of a signed enterprise commitment scores at the enterprise-commitment level, not the today-revenue level.
+
+---
+
+## Composite score and risk lane
+
+The composite is the sum of the five dimension scores. It ranges from
+5 (every dimension at 1) to 25 (every dimension at 5).
+
+| Composite | Risk lane | Authorization rule |
+|---|---|---|
+| 5 to 9 | **Low** | `RESTRICTED` tier sufficient |
+| 10 to 14 | **Medium** | `STANDARD` tier required |
+| 15 to 19 | **High** | `STANDARD` tier plus controls required |
+| 20 to 25 | **Critical** | `HIGH` tier plus human approval required |
+
+The mapping is deterministic. Two scorers who agree on the five
+dimension scores produce the same risk lane. A risk profile that
+records a lane inconsistent with its composite is a scoring error, not
+a judgement call.
+
+---
+
+## Worked examples
+
+The four examples below match the JSON examples published in
+`schemas/v1/task_risk_profile.schema.json`. Each illustrates the lane
+its composite falls into.
+
+### Low lane: a single-button colour change
+
+| Field | Value |
+|---|---|
+| Task class | `ui_refactor` |
+| Code complexity | 1. Single-file CSS class swap on one button. |
+| Blast radius | 2. Signed-in dashboard header only. |
+| Security and data sensitivity | 1. No secrets, no PII, no auth. |
+| Concurrency and state risk | 1. Presentational, no state mutation. |
+| Business criticality | 2. Internal dashboard, off the revenue path. |
+| **Composite** | **7** |
+| **Risk lane** | **Low** |
+
+Authorized on `RESTRICTED` or higher. Suitable for substrates with even
+minimal D1 to D4 history on UI work.
+
+### Medium lane: API versioning header
+
+| Field | Value |
+|---|---|
+| Task class | `api_change` |
+| Code complexity | 3. New versioned endpoint and routing-layer edits in two services. |
+| Blast radius | 3. All API consumers see the new version negotiation header. |
+| Security and data sensitivity | 2. No new auth surface; OAuth flow unchanged. |
+| Concurrency and state risk | 2. No schema change; backwards-compatible. |
+| Business criticality | 3. Customer-facing API but not on the checkout path. |
+| **Composite** | **13** |
+| **Risk lane** | **Medium** |
+
+Requires `STANDARD` tier on the candidate substrate.
+
+### High lane: cross-service API contract change
+
+| Field | Value |
+|---|---|
+| Task class | `api_change` |
+| Code complexity | 4. Cross-service contract change touching three consumers. |
+| Blast radius | 4. Every third-party integration that consumes this endpoint. |
+| Security and data sensitivity | 3. Endpoint carries customer identifiers, no auth secrets. |
+| Concurrency and state risk | 3. Coordinated deploy across consumers required. |
+| Business criticality | 3. Customer-facing API on the integration path. |
+| **Composite** | **17** |
+| **Risk lane** | **High** |
+
+Requires `STANDARD` plus controls. Typical controls: contract-test
+gating, staged consumer rollout, partner notification.
+
+### Critical lane: production DB migration on the revenue-path table
+
+| Field | Value |
+|---|---|
+| Task class | `db_migration` |
+| Code complexity | 4. Schema change with backfill and a `CONCURRENTLY` index build. |
+| Blast radius | 5. Production-wide. Primary OLTP table for the revenue path. |
+| Security and data sensitivity | 4. Touches identity table; column carries hashed credentials. |
+| Concurrency and state risk | 4. Live schema change under load with foreign-key dependents. |
+| Business criticality | 4. Failure blocks checkout and login simultaneously. |
+| **Composite** | **21** |
+| **Risk lane** | **Critical** |
+
+Requires `HIGH` tier plus human approval. No substrate is authorized
+for Critical lane work without explicit human approval, regardless of
+trust tier.
+
+---
+
+## Scoring guidance
+
+A few rules that prevent the common scoring drift modes.
+
+**Score from worst case, not planned case.** A change rolled out behind
+a flag at 0% can still ship at 100% by accident. Score the rollout
+shape that exists, not the rollout shape that is intended.
+
+**Treat reductions in safety the same as additions.** Removing a check,
+a log line or an alarm is at least as risky as adding one. The
+direction of change does not lower any dimension.
+
+**Cross-service work scores at least 3 on code complexity.** A small
+diff that crosses a service boundary carries the coordination risk of
+every service it touches.
+
+**Concurrency risk lives in coordination, not in DDL.** Two
+backwards-compatible additions deployed in the wrong order can still
+corrupt data. Score the deployment dance, not just the schema.
+
+**If you cannot defend a dimension's score in one sentence, you have
+not scored it.** Every dimension should carry a rationale. A score
+without rationale survives no scrutiny in an audit.
+
+---
+
+## A note on Sprint 3 examples
+
+The risk profiles published with Sprint 3 are deterministic simulations.
+The dimension scores in each example were chosen to illustrate the
+mapping logic and to exercise each lane at least once. They were not
+extracted from production task telemetry.
+
+Production computation of the Task Risk Profile replaces the current
+rule-based classifier incrementally as the `trust_subjects` and
+`trust_capability_profiles` tables go live. Until then, the published
+examples are the only Task Risk Profiles in the system and they are
+illustrative.
+
+This is worth surfacing because the Runtime Authorization Decision
+examples reference these risk profiles by ID. The decisions are real
+machinery exercised against simulated inputs. The machinery does not
+change when the inputs become real.
+
+---
+
+## Related
+
+- `schemas/v1/task_risk_profile.schema.json`: schema definition and JSON examples.
+- `docs/d1-d4-scoring.md`: the trust scoring dimensions that gate the comparison between risk lane and required tier.
+- `docs/execution-substrates.md`: three-layer stack; why the same task carries the same risk lane regardless of substrate.
+- AWF Sprint Plan v4.4.2, Task Risk Profile Model.