Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .claude/commands/cdd-pre-pr.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,20 @@ The script renders the template through `bootstrap-cdd-project.sh --stage` with
If the script exits 0, report "no drift" and continue. If it reports divergence, present each diff to the user; for each, the user decides whether to reconcile the repo copy, reconcile the template copy, or record a justified exception (a whitelist entry or a `cdd-only` fence). Apply fixes only on user approval. Do not auto-edit either tree from this step.

When presenting the step 8 checklist, append a `- [ ] Command-set drift clean` line to it.

## Prompt-seam checks (CDD repo only)

Also specific to the CDD repo: deterministic seam-contract checks over the repo's own prompts (the slash-commands and the docs around them), guarding against a one-sided edit silently stranding a downstream prompt-driven step.

```bash
./scripts/prompt-seam-check.sh
```

It verifies four seams with grep only (no LLM, no API key): every `/cdd-*` reference across the repo's markdown resolves to an existing command file (known non-commands are whitelisted in `scripts/prompt-seam-whitelist.txt`); the `gh_issue_NN` branch token produced in `cdd-next-step.md` is still consumed (turned into a `Closes #NN` line) in `cdd-pre-pr.md`; backticked file paths in the command files, `CLAUDE.md`, and `README.md` resolve to real files; and each `cdd-*.md` still carries its load-bearing headings. CI runs it on every PR via `template-smoke.yml`.

If the script exits 0, report "prompt seams clean" and continue. If it reports a broken seam, present each one to the user; for each, the user decides whether to fix the reference/heading/path or record a justified exception (a whitelist entry). Apply fixes only on user approval.

When presenting the step 8 checklist, append a `- [ ] Prompt seams clean` line to it.
<!-- cdd-only-end -->
## 8. Summary

Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/template-smoke.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ jobs:
- name: Command-set drift check (repo commands vs rendered template)
run: ./scripts/command-drift-check.sh

- name: Prompt-seam check (command refs, branch token, paths, sections)
run: ./scripts/prompt-seam-check.sh

- name: Worktree-helper install smoke (against a throwaway HOME)
run: ./scripts/install-smoke-assert.sh

Expand Down
8 changes: 6 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,10 @@ bash -n demo/setup.sh demo/teardown.sh demo/lib.sh
# Command-set drift: repo .claude/commands/ vs the rendered template.
./scripts/command-drift-check.sh

# Prompt-seam contracts: /cdd-* refs resolve, gh_issue_NN token agrees across
# producer/consumer, backticked paths resolve, commands keep load-bearing headings.
./scripts/prompt-seam-check.sh

# Worktree-helper install: run `cdd-worktree.sh install` against a throwaway HOME.
./scripts/install-smoke-assert.sh

Expand All @@ -60,7 +64,7 @@ rm -rf /tmp/cdd-demo-smoke
demo/setup.sh mdr_demo_99 --base /tmp/cdd-demo-smoke --local-only
```

The `template-smoke` GitHub Actions workflow runs the same checks on every PR: shellcheck, the command-set drift check, the worktree-helper install smoke, the end-to-end smoke, and the demo seed-overlay step.
The `template-smoke` GitHub Actions workflow runs the same checks on every PR: shellcheck, the command-set drift check, the prompt-seam check, the worktree-helper install smoke, the end-to-end smoke, and the demo seed-overlay step.

When `/cdd-pre-pr` runs in this repo, the "build / format / lint / test" gates collapse into the checks above plus a doc reconciliation pass.

Expand All @@ -79,7 +83,7 @@ When `/cdd-pre-pr` runs in this repo, the "build / format / lint / test" gates c
| `demo/` | Demo / dogfooding subsystem (third artifact) |
| `demo/seed/` | Filled-in "Markdown Renderer" project content (not template) |
| `demo/{setup,teardown}.sh` | Create/teardown demo & dogfood instances; `lib.sh` shared |
| `scripts/` | Template smoke assertions + install smoke + command-set drift check (with whitelists) |
| `scripts/` | Template smoke assertions + install smoke + command-set drift check + prompt-seam check (with whitelists) |
| `.github/workflows/` | CI: `template-smoke.yml` runs the bootstrap end-to-end |
| `.claude/commands/` | This repo's own slash commands |
| `tools/` | Bootstrap script + the canonical shared worktree helper (`cdd-worktree.sh`, self-installing) |
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# 0002: Scope prompt "CI" to deterministic seam checks; defer LLM-as-judge evals

**Status:** Accepted

## Context

Issue #23 asked how to "CI/test" CDD's prompts: on a normal codebase, code is unit-tested, linted, and formatted to keep it reliable; a prompt can't be unit-tested, so what guards against a prompt edit silently breaking the rest of the workflow? (The motivating example: editing `cdd-next-step.md` could change the handoff file's shape and strand the downstream `cdd-worktree` step or the implementation session.)

Investigation split the worry into two failure classes that need different treatment:

1. **Seam / contract drift — *deterministic.*** The workflow's steps hand artifacts to each other (a producer step and a consumer step must agree on the artifact's shape). This is testable with grep/diff — no LLM required. The repo had already discovered the pattern without naming it: `command-drift-check.sh` and `template-smoke-assert.sh` already pin contracts at some seams.
2. **Behavioral correctness — *non-deterministic.*** "Does the prompt actually make Claude do the right thing?" This is eval / LLM-as-judge territory (promptfoo et al.).

The investigation proposed three tiers (full reasoning in the issue #23 comment):

- **Tier 1** — extend the deterministic seam-pinning to the seams not yet covered (cheap, high value, no API keys, no flakiness).
- **Tier 2** — generalize Tier 1 into a standalone "prompt lint" framework (a packaging decision).
- **Tier 3** — behavioral LLM-as-judge evals (expensive; deferred).

Tier 1 shipped as `scripts/prompt-seam-check.sh` (PR #37). This ADR records the decision on Tiers 2 and 3, which were initially carried on the roadmap as deferred items and are now removed from it — a roadmap lists work intended to happen next, and neither tier is that: Tier 2 is premature, and Tier 3 is deferred on cost and ROI.

## Decision

**Scope CDD's prompt "CI" to deterministic seam-contract checks (Tier 1). Do not pursue Tier 2 or Tier 3 as planned work; remove both from the roadmap.**

- **Tier 2 (generalize into a "prompt lint" framework) — not planned.** It is low-difficulty (the checks are grep-based; generalizing means lifting the hard-coded seams into a config/manifest-driven form), but premature: there is exactly one consumer today (this repo). Abstracting for a single consumer guesses at the wrong shape. If a second consumer appears, or if maintaining Tier 1 reveals real friction, revisit it then — but it is not roadmap work now.
- **Tier 3 (behavioral LLM-as-judge evals) — deferred on cost, effort, and ROI; *not* rejected on principle.** The realistic form is per-prompt / per-seam behavioral checks (e.g. promptfoo): feed a prompt a fixture, stub the human approval, and have an LLM judge whether the produced artifact — a handoff, a triage, a plan — is correct. Substituting a canned approval for the human is exactly what an eval *does* and is legitimate; it does not make the eval test "a different workflow," because the unit under test is the prompt's behavior between checkpoints, not the checkpoint itself. (An earlier draft of this ADR argued Tier 3 was inapplicable because a headless eval "strips the human checkpoints that define CDD." That objection only holds for a naive *full end-to-end* autonomous run, where the human is part of what makes the final output correct — but no one would build that, so it is not the case against Tier 3. The argument was overstated and is corrected here.) The real reasons to defer are practical:
- **Cost** — manageable, not prohibitive: a `paths:`-filtered CI job can run the evals only when the command/doc files actually change, so unchanged prompts cost nothing.
- **Effort** — the non-trivial work isn't wiring the tool, it's writing fixtures and *calibrating the judge* so its verdicts agree with human judgment.
- **Flakiness** — LLM outputs and judges are non-deterministic, so the signal is noisy and needs threshold-tuning, unlike the crisp pass/fail of the grep checks.
- **Maintenance + ROI** — every prompt change must keep the fixtures in sync, and for a single-maintainer repo the deterministic seam checks plus the human `demo/` dogfood already cover the high-frequency failure mode (seam drift). The marginal behavioral bug an eval would catch is rarer, so the ROI isn't there yet.

The behavioral safety net meanwhile stays the `demo/` subsystem run as a periodic **human-driven** dogfood — a *process* practice, not a CI job. Revisit Tier 3 if behavioral regressions start slipping through that the deterministic checks and the dogfood miss.

## Consequences

- The reliability story for the workflow's own prompts is, and stays, deterministic: `command-drift-check.sh` + `prompt-seam-check.sh` + the smoke asserts, all grep/diff, no API keys, no flakiness. That is the whole intended surface — there is no pending "behavioral eval" work implied by its absence.
- The roadmap no longer carries Tier 2 / Tier 3 as open items, so it doesn't present a premature approach (Tier 2) or a cost-deferred one (Tier 3) as pending work. The rationale for the absence lives here instead.
- Behavioral confidence is owned by the `demo/` dogfood as a human-driven practice, not by CI. If that practice needs to become a tracked cadence, it belongs with the demo/dogfooding subsystem, not with prompt-CI tooling.
- Both deferred tiers remain available as future options gated on concrete triggers — Tier 2 on a second consumer or Tier 1 maintenance friction, Tier 3 on behavioral regressions slipping past the deterministic checks and the dogfood. This ADR would be revisited rather than silently reversed.
1 change: 1 addition & 0 deletions doc/architecture/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ How this repo is structured. This index is a pointer list — the content lives
- [The demo layer](demo.md) — the third artifact: filled-in seed + create/teardown automation
- `adr/` — architecture decision records (`adr/0000-template.md` for the format)
- [`0001-name-and-guard-founding-objectives.md`](adr/0001-name-and-guard-founding-objectives.md) — naming and guarding CDD's two under-guarded founding objectives (engineering practices, self-improvement)
- [`0002-scope-prompt-seam-checks-deterministic-only.md`](adr/0002-scope-prompt-seam-checks-deterministic-only.md) — scoping prompt "CI" to deterministic seam checks; why LLM-as-judge evals (and a generalized prompt-lint framework) are not planned work
4 changes: 3 additions & 1 deletion doc/architecture/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Changes flow process-first, template-second. A PR that touches the process doc b
│ ├── architecture/ # how this repo is structured
│ ├── features/ # what this repo provides
│ └── knowledge_base/ # process doc, roadmap, engineering practices, decisions
├── scripts/ # template smoke assertions + command-set drift check (with whitelists)
├── scripts/ # template smoke assertions + command-set drift check + prompt-seam check (with whitelists)
├── template/ # copy-paste material for new projects
└── tools/
├── bootstrap-cdd-project.sh # non-interactive bootstrap for new projects
Expand All @@ -42,6 +42,8 @@ The process doc references the template by example (it describes what a CLAUDE.m

The CDD repo's own `.claude/commands/` and `template/.claude/commands/` are conceptually the same files, with the repo's own copy free to drift if it needs CDD-specific behaviour. Unintended drift is a defect, and is checked mechanically: `scripts/command-drift-check.sh` (run by CI and `/cdd-pre-pr`) renders the template via the bootstrap script's stage mode with this repo's own identifiers and diffs the result against `.claude/commands/`, so substitution differences cancel out and only real divergence surfaces. Justified exceptions are either whole one-sided files listed in `scripts/command-drift-whitelist.txt` or CDD-meta sections of shared files fenced between `<!-- cdd-only-begin -->` / `<!-- cdd-only-end -->` markers in the repo copy. The script also rejects `cdd-only` markers appearing in the template itself, where they would be stripped from both sides of the diff and hide real drift. Three commands are deliberately one-sided: `/cdd-retrofit` (`.claude/commands/cdd-retrofit.md`), `/cdd-bootstrap` (`.claude/commands/cdd-bootstrap.md`), and `/cdd-quick-create` (`.claude/commands/cdd-quick-create.md`) live only in the CDD repo — `/cdd-retrofit` installs CDD into an existing project or upgrades one already on CDD, `/cdd-bootstrap` scaffolds a new greenfield one, and `/cdd-quick-create` produces a lightweight one-off deliverable — all operating *on* a target from a CDD-repo session, so the template ships no copy of any of them. `/cdd-retrofit` and `/cdd-bootstrap` share the bootstrap pipeline; `/cdd-quick-create` uses neither it nor `template/`, because a one-off has no template. See [Bootstrap & retrofit](bootstrap-and-retrofit.md) for the shared pipeline.

A sibling guard of the same family, `scripts/prompt-seam-check.sh` (also run by CI and `/cdd-pre-pr`), pins the seam contracts *between* the repo's own prompts with grep only — no LLM, no API key. It verifies that every `/cdd-*` reference across the repo's markdown resolves to a command file, that the `gh_issue_NN` branch token produced in `cdd-next-step.md` is still consumed (as a `Closes #NN` line) in `cdd-pre-pr.md`, that backticked repo-relative file paths resolve, and that each `cdd-*.md` keeps its load-bearing headings. Justified non-resolving tokens (shell helpers, marker paths, retired names, downstream-only paths) live in `scripts/prompt-seam-whitelist.txt`. Like the drift check it is CDD-repo-only and not shipped in the template.

## Open structural questions

- Whether per-project-type variants live as parallel template directories, as a single template with a variant flag, or as post-bootstrap transformation scripts. Deferred until there is enough usage to compare across project types.
Expand Down
3 changes: 2 additions & 1 deletion doc/knowledge_base/engineering-practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,15 @@ There is no unit-test suite; behaviour is exercised by integration-style smoke a

- `bash -n` over all shell scripts (syntax).
- `./scripts/command-drift-check.sh` — repo `.claude/commands/` vs the rendered template, plus the handoff-schema and worktree-helper assertions.
- `./scripts/prompt-seam-check.sh` — deterministic seam contracts between the repo's own prompts: `/cdd-*` references resolve to a command file, the `gh_issue_NN` branch token is produced and consumed in agreement, backticked file paths resolve, and each command keeps its load-bearing headings.
- End-to-end bootstrap smoke: `tools/bootstrap-cdd-project.sh` into a tmpdir + `scripts/template-smoke-assert.sh` (clean, link-valid tree).
- Demo seed-overlay smoke: `demo/setup.sh … --local-only`.

New behaviour in a script or the bootstrap path ships with the relevant smoke or assertion extended to cover it.

## Continuous integration — Enforced

`.github/workflows/template-smoke.yml` runs shellcheck, the command-drift check, the end-to-end bootstrap smoke, and the demo seed-overlay step on every PR.
`.github/workflows/template-smoke.yml` runs shellcheck, the command-drift check, the prompt-seam check, the end-to-end bootstrap smoke, and the demo seed-overlay step on every PR.

## Lint & format — Enforced (lint); Expected (format)

Expand Down
1 change: 1 addition & 0 deletions doc/knowledge_base/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@ Elevate the two under-guarded founding objectives — instilling engineering bes
- [ ] Objective-3 standing channel: a recurring mechanism that routes a discovered improvement into the roadmap/conventions (not a reintroduced standing log). — §6 known gap; design deferred.
- [ ] Reinforce objective 2 at bootstrap: a required bootstrap-phase task and/or checklist, once the `/cdd-pre-pr` mechanism is proven.
- [ ] Objective-1 mechanizations: codify when `/cdd-merge-base` is recommended/auto-triggered; consider a mechanical gate-honored check.
- [x] Deterministic prompt seam-contract checks (Tier 1; issue #23). `scripts/prompt-seam-check.sh` (+ whitelist) pins four grep-only seams between the workflow's own prompts — the four are enumerated in the script header and `engineering-practices.md`. CDD-repo-only; wired into CI, `/cdd-pre-pr`, and the engineering-practices enforced list. A recurring objective-1 reliability guardrail. Scope decision — deterministic checks only, no generalized "prompt lint" framework and no LLM-as-judge evals — in ADR [`0002-scope-prompt-seam-checks-deterministic-only.md`](../architecture/adr/0002-scope-prompt-seam-checks-deterministic-only.md).

**Milestone:** all three founding objectives are named commitments in §1, each with at least one recurring guardrail or a tracked plan to add one.

Expand Down
Loading
Loading