tests: missing canaries on compiled prompt semantics — silent regression risk to headline phrasings

## Summary

`@logic-md/core` has 325 unit tests for behavioural correctness, 29 conformance fixtures for parser/validator alignment, and (proposed in #46) perf assertions for scaling cost. The fourth layer missing from this stack is **semantic canaries on compiled prompt output** — small, deterministic tests that pin down the executive phrasings the compiler is supposed to produce, so a refactor that subtly changes prompt semantics fails loudly rather than silently.

This is orthogonal to #46 — perf catches "got slower," canaries catch "doesn't say what it's meant to say anymore."

## The regression class this catches

The README's central pitch is that LOGIC.md compilation produces prompts containing specific structures: the *"Your output IS the deliverable"* execution mandate, the `## Required Output` contract injection, the `Pre-Response Checklist` from quality gates, the confidence-requirement language, the retry-context phrasing.

Today, **nothing in the test suite asserts any of those phrasings exist.** The compiler tests check that `compileStep` returns a `CompiledStep` with a string `systemPromptSegment` of roughly the right size; they don't check that the string contains the phrasings the project's pitch depends on. Conformance fixtures don't go past the parser. Perf tests (when they land) will only assert speed.

If a refactor of `compiler.ts` accidentally drops `formatStrategyPreamble` or rewords the output-schema injection, every existing test stays green and the project's central claim is silently broken. That's the canonical canary-shaped failure mode: *correct by every existing measure, semantically regressed in the way that matters most.*

## Proposed canary set

Eight to twelve small assertion-style tests, each pinning down one headline behaviour:

```ts
// __tests__/canaries.test.ts (sketch)
describe("compiler canaries — headline executive phrasings", () => {
  test("step with output_schema produces a Required Output section", () => {
    const compiled = compileStep(specWithSchema, "audit", emptyContext);
    expect(compiled.systemPromptSegment).toContain("## Required Output Format");
    expect(compiled.systemPromptSegment).toContain("must be valid JSON");
  });

  test("step with confidence config produces explicit threshold language", () => {
    const compiled = compileStep(specWithConfidence, "audit", emptyContext);
    expect(compiled.systemPromptSegment).toMatch(/minimum confidence of [0-9.]+/);
  });

  test("step with quality gates produces a Pre-Response Checklist", () => {
    const compiled = compileStep(specWithGates, "audit", emptyContext);
    expect(compiled.systemPromptSegment).toContain("## Pre-Response Checklist");
    expect(compiled.systemPromptSegment).toMatch(/^- \[ \]/m);
  });

  test("retry attempt > 1 produces Retry Context section with previous failure", () => {
    const compiled = compileStep(spec, "audit", { ...ctx, attemptNumber: 2, previousFailureReason: "low confidence" });
    expect(compiled.systemPromptSegment).toContain("## Retry Context");
    expect(compiled.systemPromptSegment).toContain("low confidence");
  });

  test("reasoning strategy is named in the strategy preamble", () => {
    const compiled = compileStep({ ...spec, reasoning: { strategy: "react" } }, "audit", emptyContext);
    expect(compiled.systemPromptSegment).toMatch(/using react reasoning/i);
  });

  // ...
});
```

Suggested coverage:

| Canary | Pins down |
|---|---|
| Required Output section | Output-contract injection (the "deliverable" claim) |
| Pre-Response Checklist | Quality-gate surfacing |
| Confidence threshold language | Confidence-config rendering |
| Retry context with prior failure | Retry-loop semantics |
| Strategy preamble | Reasoning-config rendering |
| Branch context with alternatives | Branch-routing transparency |
| Token budget warning | Budget-overflow surfacing |
| Output schema as fenced JSON | Schema serialisation contract |

A second tier worth considering: **fixture-level compiled-output snapshots.** For each conformance fixture, a `compiled.snapshot.json` capturing `compileWorkflow` output. Catches any change a human reviewer might miss in a PR diff. More noise than the targeted canaries, but broader coverage.

## How this sits next to other test layers

| Layer | Catches | Misses |
|---|---|---|
| Unit (existing) | Function-level bugs | Semantic drift in compiled output |
| Conformance fixtures (existing) | Parser/validator regressions | Compiler regressions |
| Perf assertions (#46) | Quadratic regressions | Functional changes that stay fast |
| **Canaries (this)** | **Silent semantic regressions** | — |

Each catches a different class. None overlap meaningfully.

## Cost

Cheap. No LLM calls. No external dependencies. Runs in milliseconds. Lives in `packages/core/__tests__/` alongside existing vitest tests. Authoring is the only cost — maybe an afternoon for the first 8-12 canaries.

## Suggested fix shape

1. **Single PR adding `__tests__/canaries.test.ts`** with 8-12 deterministic assertions on the headline executive phrasings.
2. **Optional follow-up:** snapshot-test integration for conformance fixtures (broader but noisier).

I'd suggest authoring tier 1 first, treating tier 2 as a separate decision once tier 1 is in.

## Offer

Happy to PR the first tier. Will draft against current compiler output so the assertions match what the compiler does today — *not* what I think it should do — so the canaries codify present behaviour as the contract going forward. If any of the phrasings pinned down feel wrong as the long-term contract, easy to adjust.

Holding off on a PR until you've had a chance to look at the proposed canary set above.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: missing canaries on compiled prompt semantics — silent regression risk to headline phrasings #47

Summary

The regression class this catches

Proposed canary set

How this sits next to other test layers

Cost

Suggested fix shape

Offer

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Canary	Pins down
Required Output section	Output-contract injection (the "deliverable" claim)
Pre-Response Checklist	Quality-gate surfacing
Confidence threshold language	Confidence-config rendering
Retry context with prior failure	Retry-loop semantics
Strategy preamble	Reasoning-config rendering
Branch context with alternatives	Branch-routing transparency
Token budget warning	Budget-overflow surfacing
Output schema as fenced JSON	Schema serialisation contract

Layer	Catches	Misses
Unit (existing)	Function-level bugs	Semantic drift in compiled output
Conformance fixtures (existing)	Parser/validator regressions	Compiler regressions
Perf assertions (#46)	Quadratic regressions	Functional changes that stay fast
Canaries (this)	Silent semantic regressions	—

tests: missing canaries on compiled prompt semantics — silent regression risk to headline phrasings #47

Description

Summary

The regression class this catches

Proposed canary set

How this sits next to other test layers

Cost

Suggested fix shape

Offer

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions