From 99aa57d692373254d3edecd7767bb5a7ae25f125 Mon Sep 17 00:00:00 2001 From: Diego Andres Rabaioli Date: Mon, 29 Jun 2026 12:11:35 +0200 Subject: [PATCH 1/5] Add deterministic prompt seam-contract checks (Tier 1, #23) CDD's slash-commands are agentic prompts whose steps hand artifacts to each other; a one-sided edit can silently strand a downstream prompt-driven step. Add scripts/prompt-seam-check.sh (+ prompt-seam-whitelist.txt), extending the proven command-drift-check.sh pattern with four grep-only, no-LLM seam contracts: - /cdd-* references resolve to an existing command file (catches rename fallout like #27/#31); known non-commands are whitelisted. - the gh_issue_NN branch token is produced (cdd-next-step) and consumed (cdd-pre-pr -> Closes #NN) in agreement. - backticked file paths in the command files + CLAUDE.md + README resolve. - each cdd-*.md keeps its load-bearing headings. CDD-repo-only (not shipped in the template). Wired into CI (template-smoke.yml), /cdd-pre-pr (cdd-only section + checklist), the engineering-practices enforced list, CLAUDE.md, and the roadmap (Phase 11: Tier-1 done; Tiers 2-3 recorded as deferred, per the #23 investigation). Co-Authored-By: Claude Opus 4.8 (1M context) --- .claude/commands/cdd-pre-pr.md | 14 +++ .github/workflows/template-smoke.yml | 3 + CLAUDE.md | 8 +- doc/knowledge_base/engineering-practices.md | 3 +- doc/knowledge_base/roadmap.md | 3 + scripts/prompt-seam-check.sh | 118 ++++++++++++++++++++ scripts/prompt-seam-whitelist.txt | 21 ++++ 7 files changed, 167 insertions(+), 3 deletions(-) create mode 100755 scripts/prompt-seam-check.sh create mode 100644 scripts/prompt-seam-whitelist.txt diff --git a/.claude/commands/cdd-pre-pr.md b/.claude/commands/cdd-pre-pr.md index f0393fe..e117627 100644 --- a/.claude/commands/cdd-pre-pr.md +++ b/.claude/commands/cdd-pre-pr.md @@ -105,6 +105,20 @@ The script renders the template through `bootstrap-cdd-project.sh --stage` with If the script exits 0, report "no drift" and continue. If it reports divergence, present each diff to the user; for each, the user decides whether to reconcile the repo copy, reconcile the template copy, or record a justified exception (a whitelist entry or a `cdd-only` fence). Apply fixes only on user approval. Do not auto-edit either tree from this step. When presenting the step 8 checklist, append a `- [ ] Command-set drift clean` line to it. + +## Prompt-seam checks (CDD repo only) + +Also specific to the CDD repo: deterministic seam-contract checks over the repo's own prompts (the slash-commands and the docs around them), guarding against a one-sided edit silently stranding a downstream prompt-driven step. + +```bash +./scripts/prompt-seam-check.sh +``` + +It verifies four seams with grep only (no LLM, no API key): every `/cdd-*` reference across the repo's markdown resolves to an existing command file (known non-commands are whitelisted in `scripts/prompt-seam-whitelist.txt`); the `gh_issue_NN` branch token produced in `cdd-next-step.md` is still consumed (turned into a `Closes #NN` line) in `cdd-pre-pr.md`; backticked file paths in the command files, `CLAUDE.md`, and `README.md` resolve to real files; and each `cdd-*.md` still carries its load-bearing headings. CI runs it on every PR via `template-smoke.yml`. + +If the script exits 0, report "prompt seams clean" and continue. If it reports a broken seam, present each one to the user; for each, the user decides whether to fix the reference/heading/path or record a justified exception (a whitelist entry). Apply fixes only on user approval. + +When presenting the step 8 checklist, append a `- [ ] Prompt seams clean` line to it. ## 8. Summary diff --git a/.github/workflows/template-smoke.yml b/.github/workflows/template-smoke.yml index 64133fb..daea853 100644 --- a/.github/workflows/template-smoke.yml +++ b/.github/workflows/template-smoke.yml @@ -25,6 +25,9 @@ jobs: - name: Command-set drift check (repo commands vs rendered template) run: ./scripts/command-drift-check.sh + - name: Prompt-seam check (command refs, branch token, paths, sections) + run: ./scripts/prompt-seam-check.sh + - name: Worktree-helper install smoke (against a throwaway HOME) run: ./scripts/install-smoke-assert.sh diff --git a/CLAUDE.md b/CLAUDE.md index 92a9886..758505b 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -46,6 +46,10 @@ bash -n demo/setup.sh demo/teardown.sh demo/lib.sh # Command-set drift: repo .claude/commands/ vs the rendered template. ./scripts/command-drift-check.sh +# Prompt-seam contracts: /cdd-* refs resolve, gh_issue_NN token agrees across +# producer/consumer, backticked paths resolve, commands keep load-bearing headings. +./scripts/prompt-seam-check.sh + # Worktree-helper install: run `cdd-worktree.sh install` against a throwaway HOME. ./scripts/install-smoke-assert.sh @@ -60,7 +64,7 @@ rm -rf /tmp/cdd-demo-smoke demo/setup.sh mdr_demo_99 --base /tmp/cdd-demo-smoke --local-only ``` -The `template-smoke` GitHub Actions workflow runs the same checks on every PR: shellcheck, the command-set drift check, the worktree-helper install smoke, the end-to-end smoke, and the demo seed-overlay step. +The `template-smoke` GitHub Actions workflow runs the same checks on every PR: shellcheck, the command-set drift check, the prompt-seam check, the worktree-helper install smoke, the end-to-end smoke, and the demo seed-overlay step. When `/cdd-pre-pr` runs in this repo, the "build / format / lint / test" gates collapse into the checks above plus a doc reconciliation pass. @@ -79,7 +83,7 @@ When `/cdd-pre-pr` runs in this repo, the "build / format / lint / test" gates c | `demo/` | Demo / dogfooding subsystem (third artifact) | | `demo/seed/` | Filled-in "Markdown Renderer" project content (not template) | | `demo/{setup,teardown}.sh` | Create/teardown demo & dogfood instances; `lib.sh` shared | -| `scripts/` | Template smoke assertions + install smoke + command-set drift check (with whitelists) | +| `scripts/` | Template smoke assertions + install smoke + command-set drift check + prompt-seam check (with whitelists) | | `.github/workflows/` | CI: `template-smoke.yml` runs the bootstrap end-to-end | | `.claude/commands/` | This repo's own slash commands | | `tools/` | Bootstrap script + the canonical shared worktree helper (`cdd-worktree.sh`, self-installing) | diff --git a/doc/knowledge_base/engineering-practices.md b/doc/knowledge_base/engineering-practices.md index 09dc70d..baa2231 100644 --- a/doc/knowledge_base/engineering-practices.md +++ b/doc/knowledge_base/engineering-practices.md @@ -17,6 +17,7 @@ There is no unit-test suite; behaviour is exercised by integration-style smoke a - `bash -n` over all shell scripts (syntax). - `./scripts/command-drift-check.sh` — repo `.claude/commands/` vs the rendered template, plus the handoff-schema and worktree-helper assertions. +- `./scripts/prompt-seam-check.sh` — deterministic seam contracts between the repo's own prompts: `/cdd-*` references resolve to a command file, the `gh_issue_NN` branch token is produced and consumed in agreement, backticked file paths resolve, and each command keeps its load-bearing headings. - End-to-end bootstrap smoke: `tools/bootstrap-cdd-project.sh` into a tmpdir + `scripts/template-smoke-assert.sh` (clean, link-valid tree). - Demo seed-overlay smoke: `demo/setup.sh … --local-only`. @@ -24,7 +25,7 @@ New behaviour in a script or the bootstrap path ships with the relevant smoke or ## Continuous integration — Enforced -`.github/workflows/template-smoke.yml` runs shellcheck, the command-drift check, the end-to-end bootstrap smoke, and the demo seed-overlay step on every PR. +`.github/workflows/template-smoke.yml` runs shellcheck, the command-drift check, the prompt-seam check, the end-to-end bootstrap smoke, and the demo seed-overlay step on every PR. ## Lint & format — Enforced (lint); Expected (format) diff --git a/doc/knowledge_base/roadmap.md b/doc/knowledge_base/roadmap.md index d55abac..8db8b6e 100644 --- a/doc/knowledge_base/roadmap.md +++ b/doc/knowledge_base/roadmap.md @@ -147,6 +147,9 @@ Elevate the two under-guarded founding objectives — instilling engineering bes - [ ] Objective-3 standing channel: a recurring mechanism that routes a discovered improvement into the roadmap/conventions (not a reintroduced standing log). — §6 known gap; design deferred. - [ ] Reinforce objective 2 at bootstrap: a required bootstrap-phase task and/or checklist, once the `/cdd-pre-pr` mechanism is proven. - [ ] Objective-1 mechanizations: codify when `/cdd-merge-base` is recommended/auto-triggered; consider a mechanical gate-honored check. +- [x] Deterministic prompt seam-contract checks (Tier 1; issue #23). `scripts/prompt-seam-check.sh` (+ `scripts/prompt-seam-whitelist.txt`) pins four seams between the workflow's own prompts: every `/cdd-*` reference resolves to a command file (catches rename fallout like #27/#31), the `gh_issue_NN` branch token is produced (`cdd-next-step.md`) and consumed (`cdd-pre-pr.md`) in agreement, backticked file paths resolve, and each `cdd-*.md` keeps its load-bearing headings. Wired into CI (`template-smoke.yml`), `/cdd-pre-pr`, and the engineering-practices enforced list. CDD-repo-only (not shipped in the template); extends the proven `command-drift-check.sh` pattern. A recurring objective-1 reliability guardrail. Verdict + tier breakdown: issue #23 comment. +- [ ] Tier 2 (deferred): generalize the seam checks into a standalone "prompt lint" framework — a packaging decision once Tier 1 has proven itself (issue #23 comment). +- [ ] Tier 3 (deferred): behavioral / LLM-as-judge evals (promptfoo et al.) are the wrong fit for a six-checkpoint human-in-the-loop workflow — a headless eval would test a degraded, checkpoint-stripped workflow, and it needs API budget and tolerates flakiness. The behavioral safety net is instead elevating `demo/` to a periodic **human-driven** dogfood run — a *process* practice, not a CI job (issue #23 comment). **Milestone:** all three founding objectives are named commitments in §1, each with at least one recurring guardrail or a tracked plan to add one. diff --git a/scripts/prompt-seam-check.sh b/scripts/prompt-seam-check.sh new file mode 100755 index 0000000..b02ec14 --- /dev/null +++ b/scripts/prompt-seam-check.sh @@ -0,0 +1,118 @@ +#!/usr/bin/env bash +# Deterministic seam-contract checks for the CDD repo's own prompts (Tier 1; issue #23). +# +# CDD's slash-commands are agentic prompts whose steps hand artifacts to each other. +# Producer and consumer must agree on each artifact's shape, and a one-sided edit can +# silently strand a downstream step. This script pins those seams with grep/diff only — +# no LLM, no API key, no flakiness — the same proven shape as command-drift-check.sh. +# +# It is a CDD-repo-only check (the meta-project guarding its own command set/docs); it is +# not shipped in template/ and does not run in downstream projects' CI. See the #23 +# investigation comment for the verdict and the deferred Tier 2/3 follow-ups. +# +# Checks: +# 1. Command-name resolution — every `/cdd-*` reference across the repo's markdown +# resolves to an existing .claude/commands/cdd-*.md, or is a whitelisted non-command +# (shell helper, marker path, retired name) in scripts/prompt-seam-whitelist.txt. +# 2. Branch-token contract — the gh_issue_NN token produced in cdd-next-step.md is +# consumed (-> Closes #NN) in cdd-pre-pr.md; both sides must still name it. +# 3. Path-existence linter — backticked repo-relative file paths in the command files, +# CLAUDE.md, and README.md resolve to real files (whitelist covers downstream paths). +# 4. Required-section presence — each cdd-*.md still carries its load-bearing headings, +# so an edit can't silently drop one. +set -euo pipefail + +REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" +cd "$REPO_ROOT" + +REPO_CMDS=".claude/commands" +WHITELIST="scripts/prompt-seam-whitelist.txt" + +fail=0 +note() { echo " $*" >&2; fail=1; } + +whitelisted() { + grep -vE '^[[:space:]]*(#|$)' "$WHITELIST" | grep -qxF -- "$1" +} + +# --- Check 1: command-name resolution ---------------------------------------- +mapfile -t md_files < <(git ls-files --cached --others --exclude-standard '*.md') +mapfile -t cmd_tokens < <(grep -hoE '/cdd-[a-z][a-z0-9-]*' "${md_files[@]}" | sort -u) + +for tok in "${cmd_tokens[@]}"; do + name="${tok#/}" + [[ -f "$REPO_CMDS/$name.md" ]] && continue + whitelisted "$tok" && continue + note "dangling command reference $tok — no $REPO_CMDS/$name.md and not whitelisted:" + grep -rnoE "$tok"'([^a-z0-9-]|$)' "${md_files[@]}" | sed 's/^/ /' >&2 || true +done + +# --- Check 2: branch-token / issue-token contract ---------------------------- +NEXT="$REPO_CMDS/cdd-next-step.md" +PRE="$REPO_CMDS/cdd-pre-pr.md" +grep -qF 'gh_issue_NN_' "$NEXT" \ + || note "branch-token producer broken: $NEXT no longer names the gh_issue_NN_ token" +grep -qF 'gh_issue_NN' "$PRE" \ + || note "branch-token consumer broken: $PRE no longer matches the gh_issue_NN branch token" +grep -qF 'Closes #NN' "$PRE" \ + || note "branch-token consumer broken: $PRE no longer turns the token into a Closes #NN line" + +# --- Check 3: path-existence linter ------------------------------------------ +# Backticked tokens that look like a repo-relative path (contain '/', end in a known +# extension, no placeholders/globs/home/vars/brace-expansion) must resolve to a real file. +for f in "$REPO_CMDS"/cdd-*.md CLAUDE.md README.md; do + while IFS= read -r p; do + [[ -e "$p" ]] && continue + whitelisted "$p" && continue + note "broken path reference in $f: \`$p\`" + done < <(grep -oE '`[^`]+`' "$f" \ + | sed -E 's/^`//; s/`$//' \ + | grep -E '/' \ + | grep -E '\.(md|sh|ya?ml|txt|json|png)$' \ + | grep -vE '[<>*~$ {}]') +done + +# --- Check 4: required-section presence per command -------------------------- +# Curated load-bearing headings (## lines, matched whole-line). Not the full set — +# the seam-critical steps whose silent removal would break a downstream prompt. +require_headings() { + local file="$1"; shift + local h + for h in "$@"; do + grep -qxF -- "$h" "$file" || note "missing required heading in $file: $h" + done +} + +require_headings "$REPO_CMDS/cdd-next-step.md" \ + '## 0. Mode: roadmap-driven, intent-driven, or issue-driven' \ + '## 5. Draft the handoff' \ + '## 7. Write the handoff file' \ + '## 8. Print the next command' +require_headings "$REPO_CMDS/cdd-pre-pr.md" \ + '## 1. Identify changes' \ + '## 2. Build & QA' \ + '## 8. Summary' \ + '## 9. Commit reconciliation edits' \ + '## 10. Open PR (optional)' +require_headings "$REPO_CMDS/cdd-merge-base.md" \ + '## 3. Dry-run conflict assessment' \ + '## 5. Perform the merge' \ + '## 8. Summary' +require_headings "$REPO_CMDS/cdd-process-pr.md" \ + '## 4. Triage (the retained checkpoint)' \ + '## 7. Commit and push' +require_headings "$REPO_CMDS/cdd-bootstrap.md" \ + '## 1. Guided discovery' \ + '## 6. Scaffold the project (one bootstrap invocation)' +require_headings "$REPO_CMDS/cdd-quick-create.md" \ + '## 1. Scope check (the gate)' \ + '## 4. Write the deliverable (files-first)' +require_headings "$REPO_CMDS/cdd-retrofit.md" \ + '## 3. Install mode' \ + '## 4. Upgrade mode' + +if [[ "$fail" -ne 0 ]]; then + echo "prompt-seam check: FAILED (see above)" >&2 + exit 1 +fi +echo "prompt-seam check: clean" diff --git a/scripts/prompt-seam-whitelist.txt b/scripts/prompt-seam-whitelist.txt new file mode 100644 index 0000000..cb2017b --- /dev/null +++ b/scripts/prompt-seam-whitelist.txt @@ -0,0 +1,21 @@ +# Tokens that look like a /cdd-* slash command or a repo file path but are +# intentionally NOT one. Consulted by scripts/prompt-seam-check.sh. +# Exact-line match; lines starting with # and blank lines are ignored. + +# --- Not slash commands (check 1: command-name resolution) --- +# The sourced shell helper (tools/cdd-worktree.sh), written without a leading +# slash precisely because it is not a slash command. +/cdd-worktree +# Baseline marker file path (.claude/cdd-baseline), not a command. +/cdd-baseline +# Smoke-test temp dirs (/tmp/cdd-smoke, /tmp/cdd-demo-smoke), not commands. +/cdd-smoke +/cdd-demo-smoke +# Retired command name, kept only in roadmap completion annotations that +# document the /cdd-merge-main -> /cdd-merge-base rename (#31). Historical. +/cdd-merge-main + +# --- Downstream paths absent from the CDD repo (check 3: path existence) --- +# Produced by /cdd-bootstrap in a downstream project; the CDD repo's own +# overview lives at doc/architecture/overview.md instead. +doc/knowledge_base/project-overview.md From 071267933843e34e14d8c3368e423ca9706bf688 Mon Sep 17 00:00:00 2001 From: Diego Andres Rabaioli Date: Mon, 29 Jun 2026 13:19:26 +0200 Subject: [PATCH 2/5] Fix SC2016 in prompt-seam-check.sh; reconcile architecture overview (#23) Pre-PR reconciliation: - prompt-seam-check.sh tripped shellcheck SC2016 on the deliberately single-quoted grep/sed backtick patterns in the path-existence check, which would have failed the template-smoke CI shellcheck step. Added a `# shellcheck disable=SC2016` directive before the compound command. - doc/architecture/overview.md: note the prompt-seam check in the scripts/ tree comment and add a sibling paragraph describing it. Co-Authored-By: Claude Opus 4.8 (1M context) --- doc/architecture/overview.md | 4 +++- scripts/prompt-seam-check.sh | 3 +++ 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/doc/architecture/overview.md b/doc/architecture/overview.md index 844f4cb..dd9529d 100644 --- a/doc/architecture/overview.md +++ b/doc/architecture/overview.md @@ -29,7 +29,7 @@ Changes flow process-first, template-second. A PR that touches the process doc b │ ├── architecture/ # how this repo is structured │ ├── features/ # what this repo provides │ └── knowledge_base/ # process doc, roadmap, engineering practices, decisions -├── scripts/ # template smoke assertions + command-set drift check (with whitelists) +├── scripts/ # template smoke assertions + command-set drift check + prompt-seam check (with whitelists) ├── template/ # copy-paste material for new projects └── tools/ ├── bootstrap-cdd-project.sh # non-interactive bootstrap for new projects @@ -42,6 +42,8 @@ The process doc references the template by example (it describes what a CLAUDE.m The CDD repo's own `.claude/commands/` and `template/.claude/commands/` are conceptually the same files, with the repo's own copy free to drift if it needs CDD-specific behaviour. Unintended drift is a defect, and is checked mechanically: `scripts/command-drift-check.sh` (run by CI and `/cdd-pre-pr`) renders the template via the bootstrap script's stage mode with this repo's own identifiers and diffs the result against `.claude/commands/`, so substitution differences cancel out and only real divergence surfaces. Justified exceptions are either whole one-sided files listed in `scripts/command-drift-whitelist.txt` or CDD-meta sections of shared files fenced between `` / `` markers in the repo copy. The script also rejects `cdd-only` markers appearing in the template itself, where they would be stripped from both sides of the diff and hide real drift. Three commands are deliberately one-sided: `/cdd-retrofit` (`.claude/commands/cdd-retrofit.md`), `/cdd-bootstrap` (`.claude/commands/cdd-bootstrap.md`), and `/cdd-quick-create` (`.claude/commands/cdd-quick-create.md`) live only in the CDD repo — `/cdd-retrofit` installs CDD into an existing project or upgrades one already on CDD, `/cdd-bootstrap` scaffolds a new greenfield one, and `/cdd-quick-create` produces a lightweight one-off deliverable — all operating *on* a target from a CDD-repo session, so the template ships no copy of any of them. `/cdd-retrofit` and `/cdd-bootstrap` share the bootstrap pipeline; `/cdd-quick-create` uses neither it nor `template/`, because a one-off has no template. See [Bootstrap & retrofit](bootstrap-and-retrofit.md) for the shared pipeline. +A sibling guard of the same family, `scripts/prompt-seam-check.sh` (also run by CI and `/cdd-pre-pr`), pins the seam contracts *between* the repo's own prompts with grep only — no LLM, no API key. It verifies that every `/cdd-*` reference across the repo's markdown resolves to a command file, that the `gh_issue_NN` branch token produced in `cdd-next-step.md` is still consumed (as a `Closes #NN` line) in `cdd-pre-pr.md`, that backticked repo-relative file paths resolve, and that each `cdd-*.md` keeps its load-bearing headings. Justified non-resolving tokens (shell helpers, marker paths, retired names, downstream-only paths) live in `scripts/prompt-seam-whitelist.txt`. Like the drift check it is CDD-repo-only and not shipped in the template. + ## Open structural questions - Whether per-project-type variants live as parallel template directories, as a single template with a variant flag, or as post-bootstrap transformation scripts. Deferred until there is enough usage to compare across project types. diff --git a/scripts/prompt-seam-check.sh b/scripts/prompt-seam-check.sh index b02ec14..3ae300e 100755 --- a/scripts/prompt-seam-check.sh +++ b/scripts/prompt-seam-check.sh @@ -61,6 +61,9 @@ grep -qF 'Closes #NN' "$PRE" \ # Backticked tokens that look like a repo-relative path (contain '/', end in a known # extension, no placeholders/globs/home/vars/brace-expansion) must resolve to a real file. for f in "$REPO_CMDS"/cdd-*.md CLAUDE.md README.md; do + # SC2016: the single quotes below are deliberate — the grep/sed patterns match + # literal backtick characters in the markdown; no shell expansion is wanted. + # shellcheck disable=SC2016 while IFS= read -r p; do [[ -e "$p" ]] && continue whitelisted "$p" && continue From 67328d7c10fd39197fb3e2348cd85c04aefcc6ff Mon Sep 17 00:00:00 2001 From: Diego Andres Rabaioli Date: Mon, 29 Jun 2026 14:16:26 +0200 Subject: [PATCH 3/5] Record prompt-seam-check scope in ADR 0002; drop Tier 2/3 from roadmap (#23) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Tier 2 (generalize into a prompt-lint framework) is premature with a single consumer; Tier 3 (LLM-as-judge evals) is rejected on principle — a headless eval can only run by stripping the human checkpoints that define CDD. Neither is planned work, so neither belongs on the roadmap. The rationale now lives in ADR 0002, and the Tier 1 roadmap line + architecture ADR index point at it. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...e-prompt-seam-checks-deterministic-only.md | 34 +++++++++++++++++++ doc/architecture/index.md | 1 + doc/knowledge_base/roadmap.md | 4 +-- 3 files changed, 36 insertions(+), 3 deletions(-) create mode 100644 doc/architecture/adr/0002-scope-prompt-seam-checks-deterministic-only.md diff --git a/doc/architecture/adr/0002-scope-prompt-seam-checks-deterministic-only.md b/doc/architecture/adr/0002-scope-prompt-seam-checks-deterministic-only.md new file mode 100644 index 0000000..1e20b1b --- /dev/null +++ b/doc/architecture/adr/0002-scope-prompt-seam-checks-deterministic-only.md @@ -0,0 +1,34 @@ +# 0002: Scope prompt "CI" to deterministic seam checks; reject LLM-as-judge evals + +**Status:** Accepted + +## Context + +Issue #23 asked how to "CI/test" CDD's prompts: on a normal codebase, code is unit-tested, linted, and formatted to keep it reliable; a prompt can't be unit-tested, so what guards against a prompt edit silently breaking the rest of the workflow? (The motivating example: editing `cdd-next-step.md` could change the handoff file's shape and strand the downstream `cdd-worktree` step or the implementation session.) + +Investigation split the worry into two failure classes that need different treatment: + +1. **Seam / contract drift — *deterministic.*** The workflow's steps hand artifacts to each other (a producer step and a consumer step must agree on the artifact's shape). This is testable with grep/diff — no LLM required. The repo had already discovered the pattern without naming it: `command-drift-check.sh` and `template-smoke-assert.sh` already pin contracts at some seams. +2. **Behavioral correctness — *non-deterministic.*** "Does the prompt actually make Claude do the right thing?" This is eval / LLM-as-judge territory (promptfoo et al.). + +The investigation proposed three tiers (full reasoning in the issue #23 comment): + +- **Tier 1** — extend the deterministic seam-pinning to the seams not yet covered (cheap, high value, no API keys, no flakiness). +- **Tier 2** — generalize Tier 1 into a standalone "prompt lint" framework (a packaging decision). +- **Tier 3** — behavioral LLM-as-judge evals (expensive, partly inapplicable). + +Tier 1 shipped as `scripts/prompt-seam-check.sh` (PR #37). This ADR records the decision on Tiers 2 and 3, which were initially carried on the roadmap as deferred items and are now removed from it — a roadmap is a list of work intended to happen, and neither tier is. + +## Decision + +**Scope CDD's prompt "CI" to deterministic seam-contract checks (Tier 1). Do not pursue Tier 2 or Tier 3 as planned work; remove both from the roadmap.** + +- **Tier 2 (generalize into a "prompt lint" framework) — not planned.** It is low-difficulty (the checks are grep-based; generalizing means lifting the hard-coded seams into a config/manifest-driven form), but premature: there is exactly one consumer today (this repo). Abstracting for a single consumer guesses at the wrong shape. If a second consumer appears, or if maintaining Tier 1 reveals real friction, revisit it then — but it is not roadmap work now. +- **Tier 3 (behavioral LLM-as-judge evals) — rejected on principle, not merely deferred for difficulty.** Running a CDD command faithfully means an agent loop against a live repo *with six human-in-the-loop checkpoints*. A headless eval can only run by removing those checkpoints — so it would test a degraded, checkpoint-stripped workflow that is not CDD. It also needs API budget in CI and tolerates flakiness, both of which the deterministic checks deliberately avoid. The honest behavioral safety net already exists: the `demo/` subsystem. The realistic behavioral practice is keeping `demo/` as a periodic **human-driven** dogfood run — a *process* practice, not a CI job. + +## Consequences + +- The reliability story for the workflow's own prompts is, and stays, deterministic: `command-drift-check.sh` + `prompt-seam-check.sh` + the smoke asserts, all grep/diff, no API keys, no flakiness. That is the whole intended surface — there is no pending "behavioral eval" work implied by its absence. +- The roadmap no longer carries Tier 2 / Tier 3 as open items, so it doesn't misrepresent a rejected approach (Tier 3) or a premature one (Tier 2) as pending work. The rationale for the absence lives here instead. +- Behavioral confidence is owned by the `demo/` dogfood as a human-driven practice, not by CI. If that practice needs to become a tracked cadence, it belongs with the demo/dogfooding subsystem, not with prompt-CI tooling. +- Tier 2 remains available as an unplanned future option gated on a concrete trigger (a second consumer or Tier 1 maintenance friction); this ADR would be revisited rather than silently reversed. diff --git a/doc/architecture/index.md b/doc/architecture/index.md index 90994df..aa203bb 100644 --- a/doc/architecture/index.md +++ b/doc/architecture/index.md @@ -9,3 +9,4 @@ How this repo is structured. This index is a pointer list — the content lives - [The demo layer](demo.md) — the third artifact: filled-in seed + create/teardown automation - `adr/` — architecture decision records (`adr/0000-template.md` for the format) - [`0001-name-and-guard-founding-objectives.md`](adr/0001-name-and-guard-founding-objectives.md) — naming and guarding CDD's two under-guarded founding objectives (engineering practices, self-improvement) + - [`0002-scope-prompt-seam-checks-deterministic-only.md`](adr/0002-scope-prompt-seam-checks-deterministic-only.md) — scoping prompt "CI" to deterministic seam checks; why LLM-as-judge evals (and a generalized prompt-lint framework) are not planned work diff --git a/doc/knowledge_base/roadmap.md b/doc/knowledge_base/roadmap.md index 8db8b6e..c49df24 100644 --- a/doc/knowledge_base/roadmap.md +++ b/doc/knowledge_base/roadmap.md @@ -147,9 +147,7 @@ Elevate the two under-guarded founding objectives — instilling engineering bes - [ ] Objective-3 standing channel: a recurring mechanism that routes a discovered improvement into the roadmap/conventions (not a reintroduced standing log). — §6 known gap; design deferred. - [ ] Reinforce objective 2 at bootstrap: a required bootstrap-phase task and/or checklist, once the `/cdd-pre-pr` mechanism is proven. - [ ] Objective-1 mechanizations: codify when `/cdd-merge-base` is recommended/auto-triggered; consider a mechanical gate-honored check. -- [x] Deterministic prompt seam-contract checks (Tier 1; issue #23). `scripts/prompt-seam-check.sh` (+ `scripts/prompt-seam-whitelist.txt`) pins four seams between the workflow's own prompts: every `/cdd-*` reference resolves to a command file (catches rename fallout like #27/#31), the `gh_issue_NN` branch token is produced (`cdd-next-step.md`) and consumed (`cdd-pre-pr.md`) in agreement, backticked file paths resolve, and each `cdd-*.md` keeps its load-bearing headings. Wired into CI (`template-smoke.yml`), `/cdd-pre-pr`, and the engineering-practices enforced list. CDD-repo-only (not shipped in the template); extends the proven `command-drift-check.sh` pattern. A recurring objective-1 reliability guardrail. Verdict + tier breakdown: issue #23 comment. -- [ ] Tier 2 (deferred): generalize the seam checks into a standalone "prompt lint" framework — a packaging decision once Tier 1 has proven itself (issue #23 comment). -- [ ] Tier 3 (deferred): behavioral / LLM-as-judge evals (promptfoo et al.) are the wrong fit for a six-checkpoint human-in-the-loop workflow — a headless eval would test a degraded, checkpoint-stripped workflow, and it needs API budget and tolerates flakiness. The behavioral safety net is instead elevating `demo/` to a periodic **human-driven** dogfood run — a *process* practice, not a CI job (issue #23 comment). +- [x] Deterministic prompt seam-contract checks (Tier 1; issue #23). `scripts/prompt-seam-check.sh` (+ `scripts/prompt-seam-whitelist.txt`) pins four seams between the workflow's own prompts: every `/cdd-*` reference resolves to a command file (catches rename fallout like #27/#31), the `gh_issue_NN` branch token is produced (`cdd-next-step.md`) and consumed (`cdd-pre-pr.md`) in agreement, backticked file paths resolve, and each `cdd-*.md` keeps its load-bearing headings. Wired into CI (`template-smoke.yml`), `/cdd-pre-pr`, and the engineering-practices enforced list. CDD-repo-only (not shipped in the template); extends the proven `command-drift-check.sh` pattern. A recurring objective-1 reliability guardrail. Scope decision (deterministic seam checks only; no generalized "prompt lint" framework and no LLM-as-judge evals) recorded in ADR [`0002-scope-prompt-seam-checks-deterministic-only.md`](../architecture/adr/0002-scope-prompt-seam-checks-deterministic-only.md); verdict + tier breakdown in the issue #23 comment. **Milestone:** all three founding objectives are named commitments in §1, each with at least one recurring guardrail or a tracked plan to add one. From a06ba8d4c3168f72a89f39b8cd837fa30adf6538 Mon Sep 17 00:00:00 2001 From: Diego Andres Rabaioli Date: Mon, 29 Jun 2026 18:59:27 +0200 Subject: [PATCH 4/5] Correct ADR 0002: Tier 3 deferred on cost/ROI, not rejected on principle (#23) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The earlier draft argued LLM-as-judge evals were inapplicable because a headless eval "strips the human checkpoints that define CDD." That objection only holds for a naive full end-to-end autonomous run; for the realistic per-prompt behavioral checks, stubbing the human approval is exactly what an eval does and is legitimate. The honest reasons to defer are practical — cost (mitigable with a paths-filtered CI job), judge-calibration effort, flakiness, and ROI on a single-maintainer repo already covered by the seam checks + human dogfood. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...pe-prompt-seam-checks-deterministic-only.md | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/doc/architecture/adr/0002-scope-prompt-seam-checks-deterministic-only.md b/doc/architecture/adr/0002-scope-prompt-seam-checks-deterministic-only.md index 1e20b1b..9d3925d 100644 --- a/doc/architecture/adr/0002-scope-prompt-seam-checks-deterministic-only.md +++ b/doc/architecture/adr/0002-scope-prompt-seam-checks-deterministic-only.md @@ -1,4 +1,4 @@ -# 0002: Scope prompt "CI" to deterministic seam checks; reject LLM-as-judge evals +# 0002: Scope prompt "CI" to deterministic seam checks; defer LLM-as-judge evals **Status:** Accepted @@ -15,20 +15,26 @@ The investigation proposed three tiers (full reasoning in the issue #23 comment) - **Tier 1** — extend the deterministic seam-pinning to the seams not yet covered (cheap, high value, no API keys, no flakiness). - **Tier 2** — generalize Tier 1 into a standalone "prompt lint" framework (a packaging decision). -- **Tier 3** — behavioral LLM-as-judge evals (expensive, partly inapplicable). +- **Tier 3** — behavioral LLM-as-judge evals (expensive; deferred). -Tier 1 shipped as `scripts/prompt-seam-check.sh` (PR #37). This ADR records the decision on Tiers 2 and 3, which were initially carried on the roadmap as deferred items and are now removed from it — a roadmap is a list of work intended to happen, and neither tier is. +Tier 1 shipped as `scripts/prompt-seam-check.sh` (PR #37). This ADR records the decision on Tiers 2 and 3, which were initially carried on the roadmap as deferred items and are now removed from it — a roadmap lists work intended to happen next, and neither tier is that: Tier 2 is premature, and Tier 3 is deferred on cost and ROI. ## Decision **Scope CDD's prompt "CI" to deterministic seam-contract checks (Tier 1). Do not pursue Tier 2 or Tier 3 as planned work; remove both from the roadmap.** - **Tier 2 (generalize into a "prompt lint" framework) — not planned.** It is low-difficulty (the checks are grep-based; generalizing means lifting the hard-coded seams into a config/manifest-driven form), but premature: there is exactly one consumer today (this repo). Abstracting for a single consumer guesses at the wrong shape. If a second consumer appears, or if maintaining Tier 1 reveals real friction, revisit it then — but it is not roadmap work now. -- **Tier 3 (behavioral LLM-as-judge evals) — rejected on principle, not merely deferred for difficulty.** Running a CDD command faithfully means an agent loop against a live repo *with six human-in-the-loop checkpoints*. A headless eval can only run by removing those checkpoints — so it would test a degraded, checkpoint-stripped workflow that is not CDD. It also needs API budget in CI and tolerates flakiness, both of which the deterministic checks deliberately avoid. The honest behavioral safety net already exists: the `demo/` subsystem. The realistic behavioral practice is keeping `demo/` as a periodic **human-driven** dogfood run — a *process* practice, not a CI job. +- **Tier 3 (behavioral LLM-as-judge evals) — deferred on cost, effort, and ROI; *not* rejected on principle.** The realistic form is per-prompt / per-seam behavioral checks (e.g. promptfoo): feed a prompt a fixture, stub the human approval, and have an LLM judge whether the produced artifact — a handoff, a triage, a plan — is correct. Substituting a canned approval for the human is exactly what an eval *does* and is legitimate; it does not make the eval test "a different workflow," because the unit under test is the prompt's behavior between checkpoints, not the checkpoint itself. (An earlier draft of this ADR argued Tier 3 was inapplicable because a headless eval "strips the human checkpoints that define CDD." That objection only holds for a naive *full end-to-end* autonomous run, where the human is part of what makes the final output correct — but no one would build that, so it is not the case against Tier 3. The argument was overstated and is corrected here.) The real reasons to defer are practical: + - **Cost** — manageable, not prohibitive: a `paths:`-filtered CI job can run the evals only when the command/doc files actually change, so unchanged prompts cost nothing. + - **Effort** — the non-trivial work isn't wiring the tool, it's writing fixtures and *calibrating the judge* so its verdicts agree with human judgment. + - **Flakiness** — LLM outputs and judges are non-deterministic, so the signal is noisy and needs threshold-tuning, unlike the crisp pass/fail of the grep checks. + - **Maintenance + ROI** — every prompt change must keep the fixtures in sync, and for a single-maintainer repo the deterministic seam checks plus the human `demo/` dogfood already cover the high-frequency failure mode (seam drift). The marginal behavioral bug an eval would catch is rarer, so the ROI isn't there yet. + + The behavioral safety net meanwhile stays the `demo/` subsystem run as a periodic **human-driven** dogfood — a *process* practice, not a CI job. Revisit Tier 3 if behavioral regressions start slipping through that the deterministic checks and the dogfood miss. ## Consequences - The reliability story for the workflow's own prompts is, and stays, deterministic: `command-drift-check.sh` + `prompt-seam-check.sh` + the smoke asserts, all grep/diff, no API keys, no flakiness. That is the whole intended surface — there is no pending "behavioral eval" work implied by its absence. -- The roadmap no longer carries Tier 2 / Tier 3 as open items, so it doesn't misrepresent a rejected approach (Tier 3) or a premature one (Tier 2) as pending work. The rationale for the absence lives here instead. +- The roadmap no longer carries Tier 2 / Tier 3 as open items, so it doesn't present a premature approach (Tier 2) or a cost-deferred one (Tier 3) as pending work. The rationale for the absence lives here instead. - Behavioral confidence is owned by the `demo/` dogfood as a human-driven practice, not by CI. If that practice needs to become a tracked cadence, it belongs with the demo/dogfooding subsystem, not with prompt-CI tooling. -- Tier 2 remains available as an unplanned future option gated on a concrete trigger (a second consumer or Tier 1 maintenance friction); this ADR would be revisited rather than silently reversed. +- Both deferred tiers remain available as future options gated on concrete triggers — Tier 2 on a second consumer or Tier 1 maintenance friction, Tier 3 on behavioral regressions slipping past the deterministic checks and the dogfood. This ADR would be revisited rather than silently reversed. From 31e6e2aec0e9c46f8c86e5ca3df2d609b2e5cc21 Mon Sep 17 00:00:00 2001 From: Diego Andres Rabaioli Date: Mon, 29 Jun 2026 19:05:47 +0200 Subject: [PATCH 5/5] Compact the Tier 1 roadmap entry (#23) The four-seam enumeration was duplicated in three places; the roadmap line now points at the script header and engineering-practices.md for detail and keeps only what-landed + wiring + the ADR 0002 scope pointer. No info lost. Co-Authored-By: Claude Opus 4.8 (1M context) --- doc/knowledge_base/roadmap.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/knowledge_base/roadmap.md b/doc/knowledge_base/roadmap.md index c49df24..b907cbc 100644 --- a/doc/knowledge_base/roadmap.md +++ b/doc/knowledge_base/roadmap.md @@ -147,7 +147,7 @@ Elevate the two under-guarded founding objectives — instilling engineering bes - [ ] Objective-3 standing channel: a recurring mechanism that routes a discovered improvement into the roadmap/conventions (not a reintroduced standing log). — §6 known gap; design deferred. - [ ] Reinforce objective 2 at bootstrap: a required bootstrap-phase task and/or checklist, once the `/cdd-pre-pr` mechanism is proven. - [ ] Objective-1 mechanizations: codify when `/cdd-merge-base` is recommended/auto-triggered; consider a mechanical gate-honored check. -- [x] Deterministic prompt seam-contract checks (Tier 1; issue #23). `scripts/prompt-seam-check.sh` (+ `scripts/prompt-seam-whitelist.txt`) pins four seams between the workflow's own prompts: every `/cdd-*` reference resolves to a command file (catches rename fallout like #27/#31), the `gh_issue_NN` branch token is produced (`cdd-next-step.md`) and consumed (`cdd-pre-pr.md`) in agreement, backticked file paths resolve, and each `cdd-*.md` keeps its load-bearing headings. Wired into CI (`template-smoke.yml`), `/cdd-pre-pr`, and the engineering-practices enforced list. CDD-repo-only (not shipped in the template); extends the proven `command-drift-check.sh` pattern. A recurring objective-1 reliability guardrail. Scope decision (deterministic seam checks only; no generalized "prompt lint" framework and no LLM-as-judge evals) recorded in ADR [`0002-scope-prompt-seam-checks-deterministic-only.md`](../architecture/adr/0002-scope-prompt-seam-checks-deterministic-only.md); verdict + tier breakdown in the issue #23 comment. +- [x] Deterministic prompt seam-contract checks (Tier 1; issue #23). `scripts/prompt-seam-check.sh` (+ whitelist) pins four grep-only seams between the workflow's own prompts — the four are enumerated in the script header and `engineering-practices.md`. CDD-repo-only; wired into CI, `/cdd-pre-pr`, and the engineering-practices enforced list. A recurring objective-1 reliability guardrail. Scope decision — deterministic checks only, no generalized "prompt lint" framework and no LLM-as-judge evals — in ADR [`0002-scope-prompt-seam-checks-deterministic-only.md`](../architecture/adr/0002-scope-prompt-seam-checks-deterministic-only.md). **Milestone:** all three founding objectives are named commitments in §1, each with at least one recurring guardrail or a tracked plan to add one.