Summary
@logic-md/core has 325 unit tests for behavioural correctness, 29 conformance fixtures for parser/validator alignment, and (proposed in #46) perf assertions for scaling cost. The fourth layer missing from this stack is semantic canaries on compiled prompt output — small, deterministic tests that pin down the executive phrasings the compiler is supposed to produce, so a refactor that subtly changes prompt semantics fails loudly rather than silently.
This is orthogonal to #46 — perf catches "got slower," canaries catch "doesn't say what it's meant to say anymore."
The regression class this catches
The README's central pitch is that LOGIC.md compilation produces prompts containing specific structures: the "Your output IS the deliverable" execution mandate, the ## Required Output contract injection, the Pre-Response Checklist from quality gates, the confidence-requirement language, the retry-context phrasing.
Today, nothing in the test suite asserts any of those phrasings exist. The compiler tests check that compileStep returns a CompiledStep with a string systemPromptSegment of roughly the right size; they don't check that the string contains the phrasings the project's pitch depends on. Conformance fixtures don't go past the parser. Perf tests (when they land) will only assert speed.
If a refactor of compiler.ts accidentally drops formatStrategyPreamble or rewords the output-schema injection, every existing test stays green and the project's central claim is silently broken. That's the canonical canary-shaped failure mode: correct by every existing measure, semantically regressed in the way that matters most.
Proposed canary set
Eight to twelve small assertion-style tests, each pinning down one headline behaviour:
// __tests__/canaries.test.ts (sketch)
describe("compiler canaries — headline executive phrasings", () => {
test("step with output_schema produces a Required Output section", () => {
const compiled = compileStep(specWithSchema, "audit", emptyContext);
expect(compiled.systemPromptSegment).toContain("## Required Output Format");
expect(compiled.systemPromptSegment).toContain("must be valid JSON");
});
test("step with confidence config produces explicit threshold language", () => {
const compiled = compileStep(specWithConfidence, "audit", emptyContext);
expect(compiled.systemPromptSegment).toMatch(/minimum confidence of [0-9.]+/);
});
test("step with quality gates produces a Pre-Response Checklist", () => {
const compiled = compileStep(specWithGates, "audit", emptyContext);
expect(compiled.systemPromptSegment).toContain("## Pre-Response Checklist");
expect(compiled.systemPromptSegment).toMatch(/^- \[ \]/m);
});
test("retry attempt > 1 produces Retry Context section with previous failure", () => {
const compiled = compileStep(spec, "audit", { ...ctx, attemptNumber: 2, previousFailureReason: "low confidence" });
expect(compiled.systemPromptSegment).toContain("## Retry Context");
expect(compiled.systemPromptSegment).toContain("low confidence");
});
test("reasoning strategy is named in the strategy preamble", () => {
const compiled = compileStep({ ...spec, reasoning: { strategy: "react" } }, "audit", emptyContext);
expect(compiled.systemPromptSegment).toMatch(/using react reasoning/i);
});
// ...
});
Suggested coverage:
| Canary |
Pins down |
| Required Output section |
Output-contract injection (the "deliverable" claim) |
| Pre-Response Checklist |
Quality-gate surfacing |
| Confidence threshold language |
Confidence-config rendering |
| Retry context with prior failure |
Retry-loop semantics |
| Strategy preamble |
Reasoning-config rendering |
| Branch context with alternatives |
Branch-routing transparency |
| Token budget warning |
Budget-overflow surfacing |
| Output schema as fenced JSON |
Schema serialisation contract |
A second tier worth considering: fixture-level compiled-output snapshots. For each conformance fixture, a compiled.snapshot.json capturing compileWorkflow output. Catches any change a human reviewer might miss in a PR diff. More noise than the targeted canaries, but broader coverage.
How this sits next to other test layers
| Layer |
Catches |
Misses |
| Unit (existing) |
Function-level bugs |
Semantic drift in compiled output |
| Conformance fixtures (existing) |
Parser/validator regressions |
Compiler regressions |
| Perf assertions (#46) |
Quadratic regressions |
Functional changes that stay fast |
| Canaries (this) |
Silent semantic regressions |
— |
Each catches a different class. None overlap meaningfully.
Cost
Cheap. No LLM calls. No external dependencies. Runs in milliseconds. Lives in packages/core/__tests__/ alongside existing vitest tests. Authoring is the only cost — maybe an afternoon for the first 8-12 canaries.
Suggested fix shape
- Single PR adding
__tests__/canaries.test.ts with 8-12 deterministic assertions on the headline executive phrasings.
- Optional follow-up: snapshot-test integration for conformance fixtures (broader but noisier).
I'd suggest authoring tier 1 first, treating tier 2 as a separate decision once tier 1 is in.
Offer
Happy to PR the first tier. Will draft against current compiler output so the assertions match what the compiler does today — not what I think it should do — so the canaries codify present behaviour as the contract going forward. If any of the phrasings pinned down feel wrong as the long-term contract, easy to adjust.
Holding off on a PR until you've had a chance to look at the proposed canary set above.
Summary
@logic-md/corehas 325 unit tests for behavioural correctness, 29 conformance fixtures for parser/validator alignment, and (proposed in #46) perf assertions for scaling cost. The fourth layer missing from this stack is semantic canaries on compiled prompt output — small, deterministic tests that pin down the executive phrasings the compiler is supposed to produce, so a refactor that subtly changes prompt semantics fails loudly rather than silently.This is orthogonal to #46 — perf catches "got slower," canaries catch "doesn't say what it's meant to say anymore."
The regression class this catches
The README's central pitch is that LOGIC.md compilation produces prompts containing specific structures: the "Your output IS the deliverable" execution mandate, the
## Required Outputcontract injection, thePre-Response Checklistfrom quality gates, the confidence-requirement language, the retry-context phrasing.Today, nothing in the test suite asserts any of those phrasings exist. The compiler tests check that
compileStepreturns aCompiledStepwith a stringsystemPromptSegmentof roughly the right size; they don't check that the string contains the phrasings the project's pitch depends on. Conformance fixtures don't go past the parser. Perf tests (when they land) will only assert speed.If a refactor of
compiler.tsaccidentally dropsformatStrategyPreambleor rewords the output-schema injection, every existing test stays green and the project's central claim is silently broken. That's the canonical canary-shaped failure mode: correct by every existing measure, semantically regressed in the way that matters most.Proposed canary set
Eight to twelve small assertion-style tests, each pinning down one headline behaviour:
Suggested coverage:
A second tier worth considering: fixture-level compiled-output snapshots. For each conformance fixture, a
compiled.snapshot.jsoncapturingcompileWorkflowoutput. Catches any change a human reviewer might miss in a PR diff. More noise than the targeted canaries, but broader coverage.How this sits next to other test layers
Each catches a different class. None overlap meaningfully.
Cost
Cheap. No LLM calls. No external dependencies. Runs in milliseconds. Lives in
packages/core/__tests__/alongside existing vitest tests. Authoring is the only cost — maybe an afternoon for the first 8-12 canaries.Suggested fix shape
__tests__/canaries.test.tswith 8-12 deterministic assertions on the headline executive phrasings.I'd suggest authoring tier 1 first, treating tier 2 as a separate decision once tier 1 is in.
Offer
Happy to PR the first tier. Will draft against current compiler output so the assertions match what the compiler does today — not what I think it should do — so the canaries codify present behaviour as the contract going forward. If any of the phrasings pinned down feel wrong as the long-term contract, easy to adjust.
Holding off on a PR until you've had a chance to look at the proposed canary set above.