Skip to content

tests: missing canaries on compiled prompt semantics — silent regression risk to headline phrasings #47

@antnewman

Description

@antnewman

Summary

@logic-md/core has 325 unit tests for behavioural correctness, 29 conformance fixtures for parser/validator alignment, and (proposed in #46) perf assertions for scaling cost. The fourth layer missing from this stack is semantic canaries on compiled prompt output — small, deterministic tests that pin down the executive phrasings the compiler is supposed to produce, so a refactor that subtly changes prompt semantics fails loudly rather than silently.

This is orthogonal to #46 — perf catches "got slower," canaries catch "doesn't say what it's meant to say anymore."

The regression class this catches

The README's central pitch is that LOGIC.md compilation produces prompts containing specific structures: the "Your output IS the deliverable" execution mandate, the ## Required Output contract injection, the Pre-Response Checklist from quality gates, the confidence-requirement language, the retry-context phrasing.

Today, nothing in the test suite asserts any of those phrasings exist. The compiler tests check that compileStep returns a CompiledStep with a string systemPromptSegment of roughly the right size; they don't check that the string contains the phrasings the project's pitch depends on. Conformance fixtures don't go past the parser. Perf tests (when they land) will only assert speed.

If a refactor of compiler.ts accidentally drops formatStrategyPreamble or rewords the output-schema injection, every existing test stays green and the project's central claim is silently broken. That's the canonical canary-shaped failure mode: correct by every existing measure, semantically regressed in the way that matters most.

Proposed canary set

Eight to twelve small assertion-style tests, each pinning down one headline behaviour:

// __tests__/canaries.test.ts (sketch)
describe("compiler canaries — headline executive phrasings", () => {
  test("step with output_schema produces a Required Output section", () => {
    const compiled = compileStep(specWithSchema, "audit", emptyContext);
    expect(compiled.systemPromptSegment).toContain("## Required Output Format");
    expect(compiled.systemPromptSegment).toContain("must be valid JSON");
  });

  test("step with confidence config produces explicit threshold language", () => {
    const compiled = compileStep(specWithConfidence, "audit", emptyContext);
    expect(compiled.systemPromptSegment).toMatch(/minimum confidence of [0-9.]+/);
  });

  test("step with quality gates produces a Pre-Response Checklist", () => {
    const compiled = compileStep(specWithGates, "audit", emptyContext);
    expect(compiled.systemPromptSegment).toContain("## Pre-Response Checklist");
    expect(compiled.systemPromptSegment).toMatch(/^- \[ \]/m);
  });

  test("retry attempt > 1 produces Retry Context section with previous failure", () => {
    const compiled = compileStep(spec, "audit", { ...ctx, attemptNumber: 2, previousFailureReason: "low confidence" });
    expect(compiled.systemPromptSegment).toContain("## Retry Context");
    expect(compiled.systemPromptSegment).toContain("low confidence");
  });

  test("reasoning strategy is named in the strategy preamble", () => {
    const compiled = compileStep({ ...spec, reasoning: { strategy: "react" } }, "audit", emptyContext);
    expect(compiled.systemPromptSegment).toMatch(/using react reasoning/i);
  });

  // ...
});

Suggested coverage:

Canary Pins down
Required Output section Output-contract injection (the "deliverable" claim)
Pre-Response Checklist Quality-gate surfacing
Confidence threshold language Confidence-config rendering
Retry context with prior failure Retry-loop semantics
Strategy preamble Reasoning-config rendering
Branch context with alternatives Branch-routing transparency
Token budget warning Budget-overflow surfacing
Output schema as fenced JSON Schema serialisation contract

A second tier worth considering: fixture-level compiled-output snapshots. For each conformance fixture, a compiled.snapshot.json capturing compileWorkflow output. Catches any change a human reviewer might miss in a PR diff. More noise than the targeted canaries, but broader coverage.

How this sits next to other test layers

Layer Catches Misses
Unit (existing) Function-level bugs Semantic drift in compiled output
Conformance fixtures (existing) Parser/validator regressions Compiler regressions
Perf assertions (#46) Quadratic regressions Functional changes that stay fast
Canaries (this) Silent semantic regressions

Each catches a different class. None overlap meaningfully.

Cost

Cheap. No LLM calls. No external dependencies. Runs in milliseconds. Lives in packages/core/__tests__/ alongside existing vitest tests. Authoring is the only cost — maybe an afternoon for the first 8-12 canaries.

Suggested fix shape

  1. Single PR adding __tests__/canaries.test.ts with 8-12 deterministic assertions on the headline executive phrasings.
  2. Optional follow-up: snapshot-test integration for conformance fixtures (broader but noisier).

I'd suggest authoring tier 1 first, treating tier 2 as a separate decision once tier 1 is in.

Offer

Happy to PR the first tier. Will draft against current compiler output so the assertions match what the compiler does today — not what I think it should do — so the canaries codify present behaviour as the contract going forward. If any of the phrasings pinned down feel wrong as the long-term contract, easy to adjust.

Holding off on a PR until you've had a chance to look at the proposed canary set above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions