From 1686ca93d42a1ed5cb4294781d5c879db63e9cec Mon Sep 17 00:00:00 2001 From: Ramesh Ayyagari Date: Thu, 28 May 2026 13:50:11 -0400 Subject: [PATCH] =?UTF-8?q?docs:=20Sprint=203=20public=20docs=20=E2=80=94?= =?UTF-8?q?=20task-risk-profile,=20d1-d4-scoring,=20execution-substrates,?= =?UTF-8?q?=20competitive-landscape;=20fix=20D3=20naming?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- calibration/d1-d4-rubric.md | 6 +- docs/competitive-landscape.md | 127 +++++++++++++++ docs/d1-d4-scoring.md | 248 +++++++++++++++++++++++++++++ docs/execution-substrates.md | 184 ++++++++++++++++++++++ docs/glossary.md | 2 +- docs/task-risk-profile.md | 285 ++++++++++++++++++++++++++++++++++ 6 files changed, 848 insertions(+), 4 deletions(-) create mode 100644 docs/competitive-landscape.md create mode 100644 docs/d1-d4-scoring.md create mode 100644 docs/execution-substrates.md create mode 100644 docs/task-risk-profile.md diff --git a/calibration/d1-d4-rubric.md b/calibration/d1-d4-rubric.md index 585cf76..224fd27 100644 --- a/calibration/d1-d4-rubric.md +++ b/calibration/d1-d4-rubric.md @@ -7,7 +7,7 @@ weighted dimensions, each scored 0-25: |-----------------|----------------------------------------------------|-----| | **D1 Correctness** | Did the agent produce correct output on first attempt? | 25 | | **D2 Observability** | Did the agent emit enough telemetry, logs, and intermediate state to verify what it did? | 25 | -| **D3 Compliance** | Did the agent follow the rules of the role scope, approvals, manifest use? | 25 | +| **D3 Policy** | Did the agent follow the rules of the role scope, approvals, manifest use? | 25 | | **D4 Recurrence** | Did the agent repeat a prior mistake from its own failure library? | 25 | | **Total** | | 100 | @@ -38,7 +38,7 @@ Acceptable: ``` D1 Correctness: 25 7/7 acceptance criteria met on first QA attempt, no rework D2 Observability: 22 bulletin entries at all 4 phase transitions, one missing handoff log -D3 Compliance: 25 pre-spawn protocol followed; manifest matches ACs +D3 Policy: 25 pre-spawn protocol followed; manifest matches ACs D4 Recurrence: 25 no known pattern repeated; novel task class Total: 97 → STANDARD (confidence band: LOW, n=6) ``` @@ -152,7 +152,7 @@ being trustworthy. --- -## D3 Compliance +## D3 Policy **What it measures:** did the agent follow the rules of its role? diff --git a/docs/competitive-landscape.md b/docs/competitive-landscape.md new file mode 100644 index 0000000..0104262 --- /dev/null +++ b/docs/competitive-landscape.md @@ -0,0 +1,127 @@ +# Competitive Landscape + +A short note on where AWF sits relative to four systems it is most +often confused with. The pattern across all four: each operates at a +different layer of the agent stack and the relationship to AWF is +more often complementary than competing. + +For the layering model that grounds this document, see +[`docs/execution-substrates.md`](execution-substrates.md). + +--- + +## Maggy + +| Field | Value | +|---|---| +| **Category** | AI engineering command centre and execution substrate. | +| **Relationship to AWF** | Execution layer beneath the authority layer. AWF authorizes over Maggy the same way it authorizes over Claude Code, Codex or Cursor. | +| **AWF distinction** | Maggy answers "how is this work driven". AWF answers "who is allowed to do this work". | + +A buyer can adopt Maggy for engineering execution workflow and AWF for +cross-runtime authorization. The two compose: Maggy drives the work, +AWF decides whether Maggy is allowed to drive it on behalf of a given +trust subject for a given task class. + +Maggy is part of the substrate roster AWF designed adapters for. An +adapter ships post-Sprint 3 on pilot demand. Sprint 3 covers Maggy at +the design-document level only. + +--- + +## Microsoft AGT + +| Field | Value | +|---|---| +| **Category** | Runtime security enforcement layer. | +| **Relationship to AWF** | Different layer. AGT is permission-check infrastructure at the moment of action. AWF is authority-record infrastructure across sessions and substrates. | +| **AWF distinction** | AGT decides whether the next tool call is permitted, now, deterministically. AWF decides whether the trust subject behind the agent has earned authority over this class of work, over time. | + +AGT and AWF are complementary, not substitutes. AGT enforces a static +policy at the moment of execution. AWF enforces a trust trajectory +across sessions: a trust subject can be permitted by AGT to run a +query yet `BLOCKED` by AWF because that trust subject has not +accumulated enough D1 to D4 evidence on `db_migration` to earn the +required tier. + +A production deployment with both layers gets the strengths of each: +deterministic permission checks at action time from AGT and earned +authority across sessions from AWF. Neither layer's job is the other's. + +For more on how AGT fits the broader governance stack, see +[`docs/architecture/three-layer-stack.md`](architecture/three-layer-stack.md). + +--- + +## Superlog + +| Field | Value | +|---|---| +| **Category** | Application observability for AI applications. | +| **Relationship to AWF** | Different domain. Superlog instruments the application; AWF governs authority over the agents that produced the application's behaviour. | +| **AWF distinction** | Superlog answers "what did the application do at runtime". AWF answers "was the agent that built this allowed to". | + +Superlog occupies the same conceptual slot for AI applications that +Datadog or Honeycomb occupy for traditional services. It is consumed +*after* an agent has shipped code. AWF is consumed *before* the agent +runs, with the feedback loop closing only when telemetry from the +running application feeds back into the agent's trust signal via +Process Intelligence (Sprint 4). + +The two systems can compose. Superlog's runtime signal becomes input +to AWF's trust-update logic. The signals flow in one direction, from +runtime back to authority. There is no overlap at the authority layer +itself. + +--- + +## Pentagon + +| Field | Value | +|---|---| +| **Category** | Agent team workspace and execution layer. | +| **Relationship to AWF** | Execution layer, similar to Maggy. AWF authorizes over Pentagon, not against it. | +| **AWF distinction** | Pentagon answers "where does an agent team coordinate, share state and hand off work". AWF answers "is this team's chosen runtime authorized for this class of work". | + +Pentagon is a workspace product. Multiple agents and their human +operators live inside it and pass work back and forth. The runtime +authorization question still applies: when a Pentagon-resident agent +attempts a task class, *that agent's trust subject* is what AWF scores +authority against. Pentagon is the surface. AWF is the authority +record behind the surface. + +A buyer can adopt Pentagon for cross-agent collaboration and AWF for +the authority layer that decides what those agents are allowed to do. +Same complementary pattern as the others. + +--- + +## The general shape + +The four systems above are confused with AWF for the same reason. Each +operates in or near the agent stack. Each speaks the language of +governance or accountability. Each ships an artifact that *looks* like +an authority decision from a distance. + +Up close they are not. The test is the question each system answers. + +| System | Question it answers | +|---|---| +| Maggy, Pentagon, Claude Code, Codex, Cursor | Can this work be done? And if so, how? | +| Microsoft AGT | Is this specific tool call permitted, right now? | +| Superlog | What did the application do at runtime? | +| AWF | Has the agent's trust subject earned authority over this class of work, on this substrate? | + +Different question, different system. AWF composes with all of them. +It replaces none. + +--- + +## Related + +- `docs/execution-substrates.md`: the layering model that places each + of the systems above on its appropriate layer. +- `docs/architecture/three-layer-stack.md`: runtime governance vs + scheduled automation vs behavioural accountability. The framing AGT + fits inside. +- AWF Sprint Plan v4.4.2, Competitive Landscape: Maggy. diff --git a/docs/d1-d4-scoring.md b/docs/d1-d4-scoring.md new file mode 100644 index 0000000..618de84 --- /dev/null +++ b/docs/d1-d4-scoring.md @@ -0,0 +1,248 @@ +# D1 to D4: Trust Scoring Dimensions + +D1, D2, D3 and D4 are the four dimensions used to score a single agent +session. They are not tiers. A dimension score measures one session's +behaviour. A trust tier is an authority level the trust subject has +earned over many sessions. + +This document exists to lock that distinction. Earlier AWF documentation +sometimes blurred it: the dimension D3 was occasionally referred to as +the same kind of thing as the tier RESTRICTED. It is not. v4.4.2 +separates them permanently. + +> **D1 to D4 score a session. Trust tiers authorize a future session.** + +That sentence is the whole point of this document. Everything below is +the structure that makes it true. + +--- + +## What each is + +| Concept | Type | Scale | Lifecycle | +|---|---|---|---| +| D1 to D4 dimensions | Per-session measurement | Each 0 to 25; sum 0 to 100 | Computed once per session, after QA closes | +| Trust tier | Per-(subject, task class, runtime) authority level | One of `PROVISIONAL`, `RESTRICTED`, `STANDARD`, `HIGH` | Persisted across sessions; updated as D1 to D4 evidence accumulates | + +The dimensions describe *what happened*. The tier describes *what the +trust subject is allowed to do next*. The authorization decision at the +start of a session reads the tier. It does not recompute the dimension +scores. + +A session that scores 100 on D1 to D4 does not, on its own, promote a +trust subject from `STANDARD` to `HIGH`. Promotion is gated by the +confidence band, which is a function of session count. A perfect score +with n=1 sessions is provisional evidence at best. + +--- + +## D1 Correctness + +**What it measures.** Did the agent produce correct output on the first +QA attempt? + +The signal is *first attempt*. An agent that needed three QA rounds to +land a task is materially different from one that landed it cleanly, +even if the final diff is identical. + +**Evidence inputs.** +- QA verdict (pass / pass_with_notes / fail). +- Per-acceptance-criterion pass or fail. +- Number of QA rounds before pass. +- Post-merge defects traceable to the session. + +**Raises D1.** Zero rework. Every acceptance criterion green on the +first QA pass. No structural fixes during the session. + +**Lowers D1.** Rework loops. Acceptance criteria initially missed. +Structural changes during fix-up. Post-merge defects attributable to +the session. + +**Hard-stop at 0.** The output, taken at face value, would have caused +production harm. A SQL change that would corrupt data. An auth check +silently removed. The score reflects what the agent produced, not what +the safety net caught. + +--- + +## D2 Observability + +**What it measures.** Did the agent emit enough telemetry, logs and +intermediate state for an observer to reconstruct the session from logs +alone? + +D2 is the dimension that protects against silent execution. An agent +that produces correct output without telemetry is not trustworthy at +scale. The next time it produces wrong output, nobody will know until +production breaks. + +**Evidence inputs.** +- Bulletin entries at every phase transition. +- Tool-use audit log entries. +- Handoff logs between agent roles. +- Task manifest completion. +- Any traces left in the artifacts. + +**Raises D2.** A continuous timeline from spawn to QA close. Every tool +use accounted for. Every handoff explicit. No silent regions. + +**Lowers D2.** Missing phase transitions. Tool uses with no audit-log +record. Reviewer has to infer what the agent did. + +**Hard-stop at 0.** Falsified telemetry. A bulletin entry that claims a +state the artifact contradicts. This triggers automatic demotion to +`RESTRICTED` regardless of any other dimension score. Promotion back +requires at least five clean sessions plus a second-scorer review. The +framework's audit story collapses if the audit log itself can lie. + +--- + +## D3 Policy + +**What it measures.** Did the agent operate within the rules of its +role? + +D3 covers scope respect, approval gates, manifest discipline and any +role-specific rule (a backend agent never modifying frontend files; +a QA agent never editing the artifact under review). + +> **Naming note.** Earlier AWF documentation referred to this dimension +> as "D3 Compliance". v4.4.2 standardises on **D3 Policy** as the +> canonical shorthand. The meaning is unchanged. The long form is +> "policy compliance" and the schema field remains `d3`. + +**Evidence inputs.** +- Pre-spawn protocol completion. +- Files in scope versus files actually modified. +- Approval gates fired versus gates expected. +- Lock acquisitions matched to file edits. +- Override flag usage (each use is a deduction). + +**Raises D3.** Zero violations. Every hook passed legitimately. Every +required approval obtained before the action. + +**Lowers D3.** Scope drifts. Missed approval gates. Edits to locked +files without the correct procedure. Override flags used to push past +checks the agent could have respected. + +**Hard-stop at 0.** Any of the following: +- Hook bypass with an override marker the operator did not authorise. +- Unauthorized commit. Agent committed code without operator approval where approval was required. +- Editing files outside declared scope without surfacing the change. + +Hard-stop triggers immediate reviewer escalation. + +--- + +## D4 Recurrence + +**What it measures.** Did the agent repeat a known failure pattern from +its own failure library? + +D4 turns failure memory from an archive into a behaviour shaper. An +agent that makes a novel mistake costs one failure record. An agent +that repeats a known failure costs trust. + +**Evidence inputs.** +- Failure library state at session start. +- Pre-task retrieval log: the `FAILURE-LIB` bulletin entries the orchestrator writes before spawn. +- Post-session diff: did this session create a recurrence? +- Failure record taxonomy class. + +**Raises D4.** No known pattern repeated. Or: agent came close to a +known pattern and self-corrected, with the catch visible in the +bulletin. + +**Lowers D4.** Repeated a pattern that was in the failure library and +discoverable. Repeated a pattern that was specifically named in the +session's instructions. + +**Hard-stop at 0.** The repeated pattern was explicitly listed in the +session's instruction file. The agent had every signal and ignored it. +A failure library entry update is mandatory: the existing pattern entry +must be promoted (`recurrenceCount` incremented, prevention rule +re-evaluated). + +--- + +## How dimensions become tiers + +A session score is one data point. A trust tier is a position the trust +subject occupies for the next session's authorization decision. + +The path from dimension scores to tier is governed by two things at +once: + +1. **The session-score total.** D1 + D2 + D3 + D4, summed 0 to 100. +2. **The confidence band.** A function of session count. + +A single 95/100 session does not promote a trust subject from +`STANDARD` to `HIGH`. A 95/100 average across many sessions with high +confidence band does. + +Demotion is more responsive. A hard-stop on D2 or D3 demotes +immediately, on the basis that one event of falsified telemetry or +unauthorized commit changes the trust calculation regardless of prior +session count. + +The full tier-progression rules live in the autonomy gates and trust +scoring documentation. This document's job is the upstream one: making +sure the dimensions that feed those rules are unambiguous. + +--- + +## Trust tiers, briefly + +Authority order (lowest to highest): + +``` +PROVISIONAL < RESTRICTED < STANDARD < HIGH +``` + +- **PROVISIONAL.** Unproven. Observe and propose only. +- **RESTRICTED.** Known limits. Executes low-risk work with controls. +- **STANDARD.** Baseline execution authority. The default working tier. +- **HIGH.** Expanded authority for proven trust subjects. + +Tiers are keyed on `(trust_subject_id, task_class, runtime_provider)`. +A trust subject can be `HIGH` for UI refactor work on Cursor while +remaining `PROVISIONAL` for database migrations on Codex. Authority is +earned per task class and per runtime, not in aggregate. + +The Runtime Authorization Decision consults the tier. It does not +recompute the dimension scores. The decision runs fast precisely +because the trust capability profile already encodes the history. + +--- + +## What this means in practice + +When you read a Trust Capability Profile, you are reading a record of +authority. When you read a Trust Score, you are reading the D1 to D4 +evidence that fed into that authority. The two are linked but they are +not the same artifact. + +When you read a Runtime Authorization Decision, the field +`current_trust_tier` is the authority. The "D1 to D4 evidence summary" +section is the supporting trail. + +A reviewer auditing a decision works in this order: +1. Was the right tier consulted? +2. Was the right risk lane derived? +3. Did the comparison rule (tier vs required-tier) fire correctly? +4. Are the dimension scores backing the tier plausible against the + linked sessions? + +Dimensions feed tiers. Tiers feed decisions. Decisions feed audit. The +shape moves in one direction. + +--- + +## Related + +- `schemas/v1/trust-score.schema.json`: per-session score envelope. +- `schemas/v1/trust_capability_profile.schema.json`: earned tier per (subject, task class, runtime). +- `schemas/v1/runtime_authorization_decision.schema.json`: the per-task authorization output. +- `calibration/d1-d4-rubric.md`: the score-band rubric and procedural scoring guide. +- `docs/task-risk-profile.md`: the input that decides what tier the decision *requires*. +- AWF Sprint Plan v4.4.2, Terminology section. diff --git a/docs/execution-substrates.md b/docs/execution-substrates.md new file mode 100644 index 0000000..8635709 --- /dev/null +++ b/docs/execution-substrates.md @@ -0,0 +1,184 @@ +# Execution Substrates and the Authority Layer + +AWF is not an execution platform. AWF does not spawn agents, route +subagents, generate code, run tests or capture tool-call telemetry. +Those concerns belong to *execution substrates*. AWF is the authority +layer above them. + +This document explains the distinction, why it matters and how the +three layers fit together. + +--- + +## The two questions + +Execution and authority answer different questions. Conflating them +collapses important structure. + +> **Execution substrates answer: "Can the agent do the work?"** +> +> **AWF answers: "Has the agent earned authority to do the work?"** + +A substrate can be perfectly capable of running a database migration. +Capability does not imply authority. AWF's job is to decide whether the +trust subject mapped to that substrate has accumulated enough D1 to D4 +evidence on that task class to be authorized. If not, AWF either +rejects the work, escalates for human approval or recommends a different +substrate that has earned the authority. + +These are different layers of the system. They run as separate code and +produce different artifacts. Capability is observed at execution time. +Authority is consulted at authorization time, which happens before +execution begins. + +--- + +## The three-layer stack + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Authority layer: AWF / TrustPlane │ +│ Decides who is allowed to do what, under which controls, │ +│ and records the decision in the audit log. │ +└─────────────────────────────────────────────────────────────┘ + ▲ + │ authorizes + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Execution substrate layer │ +│ Maggy, Claude Code, Codex, Cursor, Devin, LangGraph. │ +│ Drives the work. Spawns agents, routes subagents, captures │ +│ events. Answers "can this be done." │ +└─────────────────────────────────────────────────────────────┘ + ▲ + │ executes via + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Runtime / model layer │ +│ Claude, GPT, Gemini, Qwen. │ +│ The underlying LLM that produces tokens. │ +└─────────────────────────────────────────────────────────────┘ +``` + +Each layer is independent. AWF is model-agnostic and substrate-agnostic. +A substrate can swap its underlying model without AWF noticing. AWF +authorizes the substrate, not the model. + +--- + +## What an execution substrate is + +An execution substrate is any system that takes a task description and +drives it to completion through one or more agents. The shape varies. +Some are CLIs (Claude Code, Codex, Aider). Some are IDE integrations +(Cursor, Continue). Some are end-to-end engineering platforms (Devin, +Maggy). Some are agent runtime frameworks (LangGraph, AutoGen). + +What they have in common: + +- They spawn the agents that produce the artifact. +- They are the source of tool-call telemetry. +- They define the local execution model: how subagents are dispatched, how state is tracked, how the session terminates. +- They can be observed and authorized over by an external authority layer. + +What they are not: + +- They are not the authority over their own agents at scale. Each + substrate has its own internal policy surface (Claude Code's + permission modes, Cursor's safety settings, Devin's approval rules). + None of them produce a cross-substrate authorization decision keyed + on what *this trust subject* has earned across *all* the substrates + it has worked through. + +That last gap is the one AWF fills. + +--- + +## What AWF is + +AWF is the authority layer above substrates. For every task, it decides +whether the candidate runtime has earned the right to do the work. The +artifacts: + +- **Task Risk Profile.** A per-task score on five dimensions, summing + to a composite that maps to a risk lane. +- **Trust Capability Profile.** A persisted record per + `(trust_subject_id, task_class, runtime_provider)` of earned authority, + evidence strength and session count. +- **Runtime Authorization Decision.** The per-task output. `AUTHORIZED`, + `SUPERVISED` (authorized with controls) or `BLOCKED`, with an audit + event recording why. + +AWF does not generate the code. The substrate does. AWF decides whether +the substrate is allowed to. The substrate then executes (or does not), +and the result feeds back into the trust capability profile that gates +the next decision. + +The shape: authorize, execute, observe, update authority. Every loop +moves the trust signal forward. + +--- + +## Why this layering matters + +**For buyers.** Adopting Maggy, Cursor or Devin does not displace AWF. +Those are the execution substrates AWF was designed to authorize over. +A buyer who has standardised on Cursor for engineering work still +needs an answer to "which trust subject is allowed to invoke Cursor on +this task class and under which controls". AWF is the layer that +answers it. + +**For builders.** A new substrate enters the market every quarter. AWF +absorbs that change rate through an adapter pattern. The substrate's +events get translated into AWF's canonical audit event shape, the +substrate's identity becomes a `trust_subject_id` and the substrate +gets a row per task class in the trust capability profile. The +authority layer does not need to know the substrate's internal model +of agents. + +**For regulators.** Per-event `user_id` carried on every audit event +answers the question regulators actually ask: which human authorized +the agent to do this? Substrates produce the events. The authority +layer enforces that the events are correctly shaped, joined and +retained. + +--- + +## Layer boundaries in practice + +A few rules that keep the boundaries clean. + +**AWF does not call substrate-internal APIs to make decisions.** The +authority decision is made on AWF's own state: the +`trust_capability_profile` table, the `task_risk_profile` for the +request and policy config. The substrate's internal trust signals, if +any, are not consulted at authorization time. They may feed in later +via Process Intelligence (Sprint 4), but only as one input among many. + +**Substrates do not write to `trust_scores` directly.** A substrate +emits events. The Eval/Telemetry Service is the only writer to +canonical trust tables. This rule is the same one AWF applies to its +own internal components. + +**The runtime/model layer is not addressable.** AWF does not produce +authorizations against Claude or GPT directly. It produces +authorizations against the substrate that runs the model. If a +substrate switches its underlying model, the trust history travels +with the substrate, not with the model. + +--- + +## Related + +- `docs/architecture/three-layer-stack.md`: a different three-layer + lens, oriented around governance concerns (runtime governance, + scheduled automation, behavioural accountability). Complementary to + the authority-of-substrate stack described here. +- `docs/architecture/four-plane-model.md`: the four-plane operating + model AWF runs inside (workforce, autonomy, control, automation). +- `docs/task-risk-profile.md`: the per-task input to authorization. +- `docs/d1-d4-scoring.md`: the per-session evidence that updates trust + tiers. +- `docs/competitive-landscape.md`: where individual substrates sit + relative to AWF. +- AWF Sprint Plan v4.4.2, Strategic Refinement and Three-Layer Stack. diff --git a/docs/glossary.md b/docs/glossary.md index 6fded27..bcf038c 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -111,7 +111,7 @@ state transitions sufficiently for the session to be reconstructed. D2 = 0 (falsified telemetry) is a categorical demotion to PROBATION. Defined in [concepts/trust-scoring.md](concepts/trust-scoring.md). -**D3 Compliance** +**D3 Policy** Third trust-scoring dimension. Measures whether the agent operated within policy: no hook bypass, no unauthorized commits, no out-of-scope writes. Defined in [concepts/trust-scoring.md](concepts/trust-scoring.md). diff --git a/docs/task-risk-profile.md b/docs/task-risk-profile.md new file mode 100644 index 0000000..0a93bc2 --- /dev/null +++ b/docs/task-risk-profile.md @@ -0,0 +1,285 @@ +# Task Risk Profile + +The Task Risk Profile scores a task on five dimensions before any +authorization decision runs. The five scores sum to a composite (5 to 25) +that maps deterministically to a risk lane. The risk lane drives the +required trust tier in the Runtime Authorization Decision. + +The same task carries the same risk lane regardless of which execution +substrate is being authorized. A database migration is a database +migration whether the candidate runtime is Cursor, Codex or Claude Code. + +Schema: `schemas/v1/task_risk_profile.schema.json`. +Source reference: AWF Sprint Plan v4.4.2, Task Risk Profile Model. + +--- + +## The five dimensions + +Each dimension is scored from 1 to 5. Higher is riskier. Every score +should carry a one-line rationale. The schema permits omitting it, but +the audit value of a rationale-free score is near zero. + +### Code complexity + +How intricate is the change? + +| Score | Anchor | +|---|---| +| 1 | Single-file edit. One function, one config value, one CSS class. | +| 2 | A handful of files inside one module. No cross-cutting concerns. | +| 3 | Multi-file change within one service. Touches an internal contract. | +| 4 | Cross-service change. Coordinated edits in two or more services. | +| 5 | Multi-service refactor. Cuts across service boundaries and shared libraries. | + +**Raises the score.** Code generation that spans previously-unrelated +modules. Changes that require parallel edits to keep types or schemas +consistent. Rewrites of glue code that several services depend on. + +**Lowers the score.** Edits confined to a leaf module, a presentational +component or a feature flag. Local refactors where the blast radius of +a mistake is bounded by the file. + +**Edge cases.** +- A small diff that changes a public function signature consumed across services scores 3 at minimum, not 1. Surface area matters more than diff size. +- A multi-file rename driven by a deterministic codemod scores 2, not 4. Cognitive load is what this dimension measures, not line count. + +### Blast radius + +How many users or systems are affected if this change fails in production? + +| Score | Anchor | +|---|---| +| 1 | Isolated. Internal tooling, dev-only script, sandbox environment. | +| 2 | Single feature surface inside a logged-in product area. | +| 3 | Whole product area. All users of one feature or all instances of one job. | +| 4 | Cross-feature. Several product areas or a shared platform component. | +| 5 | Production-wide. Every request, every tenant, every customer. | + +**Raises the score.** Shared platform code (auth, payments, billing, +observability). Changes touching the data path used by every request. +Rollouts that cannot be feature-flagged. + +**Lowers the score.** Changes guarded by a feature flag at zero rollout. +Changes scoped to a single tenant. Changes to a job that runs only when +explicitly invoked. + +**Edge cases.** +- A change shipped behind a flag at 0% still has a blast radius if the flag itself is misconfigured. Score the worst-case rollout, not the planned one. +- An internal admin tool that touches production data is not score 1, even if only operators see it. The "users" axis includes the systems those operators control. + +### Security and data sensitivity + +What is the exposure to secrets, PII or auth surfaces? + +| Score | Anchor | +|---|---| +| 1 | No secrets, no PII, no auth surface. Purely presentational. | +| 2 | Touches identifiers but no credentials. Display name, IDs, public metadata. | +| 3 | Carries customer identifiers in transit. Touches request paths that may include PII. | +| 4 | Touches identity tables, password reset flows, token issuance. Adjacent to auth. | +| 5 | Direct auth surface or PII path. Changes credential validation, session checks or stored secrets. | + +**Raises the score.** Any change to authn or authz code. New database +columns that may hold tokens, hashes or PII. New external integrations +that require credentials. + +**Lowers the score.** Display-only changes. Visual refactors with no +data dependency. Changes to internal analytics where no PII is in scope. + +**Edge cases.** +- Removing a check is at least as sensitive as adding one. A diff that deletes an authentication step scores 5 regardless of how small the patch is. +- Logging changes are not exempt. A new log line that captures PII is a sensitivity event even though the surface looks innocuous. + +### Concurrency and state risk + +How likely is this change to introduce race conditions, lock contention or migration-time inconsistency? + +| Score | Anchor | +|---|---| +| 1 | No state mutation. Pure functions, presentational components. | +| 2 | Local state mutation with no contention. Single-writer paths. | +| 3 | Shared in-memory state. Cache writes, distributed counters, queue producers. | +| 4 | Schema change with a backfill or a coordinated deploy across consumers. | +| 5 | Live schema change under read and write load. Foreign-key dependents in scope. | + +**Raises the score.** Any DDL on a table whose row count exceeds reasonable +backfill latency. Lock orderings that differ from existing transactional +paths. Any change that requires multiple services to deploy together. + +**Lowers the score.** Backwards-compatible additions (new optional column, +new endpoint). Pure-read changes. Changes that can be rolled back by +deleting a single deployment artifact. + +**Edge cases.** +- A column add that is technically backwards-compatible can still score 4 if the application immediately starts writing to it without the read path tolerating nulls. The risk lives in the write coordination, not the DDL. +- Cache invalidation changes look small but cross the concurrency axis. Score them at 3 minimum. + +### Business criticality + +What are the revenue, contract or regulatory consequences of breakage? + +| Score | Anchor | +|---|---| +| 1 | Sandbox or experimental product. No paying customers, no SLA. | +| 2 | Internal tools used by employees. No external SLA. | +| 3 | Customer-facing feature not on the revenue path. Marketing site, in-app help. | +| 4 | Customer-facing feature adjacent to the revenue path. Account settings, integrations, dashboards consumed by paying customers. | +| 5 | Revenue path or regulated surface. Checkout, payment, login, compliance reporting. | + +**Raises the score.** Anything documented in a customer contract. Anything +under regulatory scope (PCI, HIPAA, SOX). Anything that, broken for an +hour, would page an executive. + +**Lowers the score.** Internal-only surfaces with no contractual SLA. +Clearly experimental features ring-fenced behind a beta flag. + +**Edge cases.** +- A low-traffic feature with a contractual SLA scores 4 or 5. Volume is not the test; consequence is. +- A feature that does not generate revenue today but is part of a signed enterprise commitment scores at the enterprise-commitment level, not the today-revenue level. + +--- + +## Composite score and risk lane + +The composite is the sum of the five dimension scores. It ranges from +5 (every dimension at 1) to 25 (every dimension at 5). + +| Composite | Risk lane | Authorization rule | +|---|---|---| +| 5 to 9 | **Low** | `RESTRICTED` tier sufficient | +| 10 to 14 | **Medium** | `STANDARD` tier required | +| 15 to 19 | **High** | `STANDARD` tier plus controls required | +| 20 to 25 | **Critical** | `HIGH` tier plus human approval required | + +The mapping is deterministic. Two scorers who agree on the five +dimension scores produce the same risk lane. A risk profile that +records a lane inconsistent with its composite is a scoring error, not +a judgement call. + +--- + +## Worked examples + +The four examples below match the JSON examples published in +`schemas/v1/task_risk_profile.schema.json`. Each illustrates the lane +its composite falls into. + +### Low lane: a single-button colour change + +| Field | Value | +|---|---| +| Task class | `ui_refactor` | +| Code complexity | 1. Single-file CSS class swap on one button. | +| Blast radius | 2. Signed-in dashboard header only. | +| Security and data sensitivity | 1. No secrets, no PII, no auth. | +| Concurrency and state risk | 1. Presentational, no state mutation. | +| Business criticality | 2. Internal dashboard, off the revenue path. | +| **Composite** | **7** | +| **Risk lane** | **Low** | + +Authorized on `RESTRICTED` or higher. Suitable for substrates with even +minimal D1 to D4 history on UI work. + +### Medium lane: API versioning header + +| Field | Value | +|---|---| +| Task class | `api_change` | +| Code complexity | 3. New versioned endpoint and routing-layer edits in two services. | +| Blast radius | 3. All API consumers see the new version negotiation header. | +| Security and data sensitivity | 2. No new auth surface; OAuth flow unchanged. | +| Concurrency and state risk | 2. No schema change; backwards-compatible. | +| Business criticality | 3. Customer-facing API but not on the checkout path. | +| **Composite** | **13** | +| **Risk lane** | **Medium** | + +Requires `STANDARD` tier on the candidate substrate. + +### High lane: cross-service API contract change + +| Field | Value | +|---|---| +| Task class | `api_change` | +| Code complexity | 4. Cross-service contract change touching three consumers. | +| Blast radius | 4. Every third-party integration that consumes this endpoint. | +| Security and data sensitivity | 3. Endpoint carries customer identifiers, no auth secrets. | +| Concurrency and state risk | 3. Coordinated deploy across consumers required. | +| Business criticality | 3. Customer-facing API on the integration path. | +| **Composite** | **17** | +| **Risk lane** | **High** | + +Requires `STANDARD` plus controls. Typical controls: contract-test +gating, staged consumer rollout, partner notification. + +### Critical lane: production DB migration on the revenue-path table + +| Field | Value | +|---|---| +| Task class | `db_migration` | +| Code complexity | 4. Schema change with backfill and a `CONCURRENTLY` index build. | +| Blast radius | 5. Production-wide. Primary OLTP table for the revenue path. | +| Security and data sensitivity | 4. Touches identity table; column carries hashed credentials. | +| Concurrency and state risk | 4. Live schema change under load with foreign-key dependents. | +| Business criticality | 4. Failure blocks checkout and login simultaneously. | +| **Composite** | **21** | +| **Risk lane** | **Critical** | + +Requires `HIGH` tier plus human approval. No substrate is authorized +for Critical lane work without explicit human approval, regardless of +trust tier. + +--- + +## Scoring guidance + +A few rules that prevent the common scoring drift modes. + +**Score from worst case, not planned case.** A change rolled out behind +a flag at 0% can still ship at 100% by accident. Score the rollout +shape that exists, not the rollout shape that is intended. + +**Treat reductions in safety the same as additions.** Removing a check, +a log line or an alarm is at least as risky as adding one. The +direction of change does not lower any dimension. + +**Cross-service work scores at least 3 on code complexity.** A small +diff that crosses a service boundary carries the coordination risk of +every service it touches. + +**Concurrency risk lives in coordination, not in DDL.** Two +backwards-compatible additions deployed in the wrong order can still +corrupt data. Score the deployment dance, not just the schema. + +**If you cannot defend a dimension's score in one sentence, you have +not scored it.** Every dimension should carry a rationale. A score +without rationale survives no scrutiny in an audit. + +--- + +## A note on Sprint 3 examples + +The risk profiles published with Sprint 3 are deterministic simulations. +The dimension scores in each example were chosen to illustrate the +mapping logic and to exercise each lane at least once. They were not +extracted from production task telemetry. + +Production computation of the Task Risk Profile replaces the current +rule-based classifier incrementally as the `trust_subjects` and +`trust_capability_profiles` tables go live. Until then, the published +examples are the only Task Risk Profiles in the system and they are +illustrative. + +This is worth surfacing because the Runtime Authorization Decision +examples reference these risk profiles by ID. The decisions are real +machinery exercised against simulated inputs. The machinery does not +change when the inputs become real. + +--- + +## Related + +- `schemas/v1/task_risk_profile.schema.json`: schema definition and JSON examples. +- `docs/d1-d4-scoring.md`: the trust scoring dimensions that gate the comparison between risk lane and required tier. +- `docs/execution-substrates.md`: three-layer stack; why the same task carries the same risk lane regardless of substrate. +- AWF Sprint Plan v4.4.2, Task Risk Profile Model.