From aeeee0a71836671655b1a5c639e55efb91fb3a00 Mon Sep 17 00:00:00 2001 From: Rainier Potgieter Date: Thu, 7 May 2026 13:14:28 +0200 Subject: [PATCH 1/2] chore(benchmarks): publish 2026-05-07 cross-model results Two independent runs on Claude Sonnet 4.6 (code-review, n=10) and Llama 3.1 70B via Nvidia NIM (3 tasks, n=10) show no measurable quality lift from LOGIC.md on these tasks at this sample size. - Sonnet 4.6 ceiling effect: control 99/100, treatment 100/100 - Llama 3.1 70B flat to slightly negative after excluding 7 Nvidia connection-drop runs (cleaned means: code-review 98.3 vs 98.9, research-synthesis 94.6 vs 94.0, security-audit 89.0 vs 83.0) Per-run results, raw harness output, and honest analysis preserved at benchmarks/published/-/. INDEX.md catalogues runs and documents open methodology questions. Motivates positioning pivot toward structural consistency / audit / governance, anchored in the 2026-05-06 Archon integration test (87% hash agreement under LOGIC.md vs 70% without). --- benchmarks/.gitignore | 4 +- .../summary.md | 68 + .../2026-05-07-llama-3.1-70b/analysis.md | 131 ++ .../2026-05-07-llama-3.1-70b/results.json | 1534 +++++++++++++++++ .../2026-05-07-llama-3.1-70b/results.md | 80 + benchmarks/published/INDEX.md | 44 + 6 files changed, 1858 insertions(+), 3 deletions(-) create mode 100644 benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md create mode 100644 benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md create mode 100644 benchmarks/published/2026-05-07-llama-3.1-70b/results.json create mode 100644 benchmarks/published/2026-05-07-llama-3.1-70b/results.md create mode 100644 benchmarks/published/INDEX.md diff --git a/benchmarks/.gitignore b/benchmarks/.gitignore index 2d74893..ad971da 100644 --- a/benchmarks/.gitignore +++ b/benchmarks/.gitignore @@ -3,10 +3,8 @@ node_modules/ package-lock.json yarn.lock -# Results +# Ephemeral run output (committed runs live in published/) results/ -results.json -results.md # Logs *.log diff --git a/benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md b/benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md new file mode 100644 index 0000000..a724b5d --- /dev/null +++ b/benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md @@ -0,0 +1,68 @@ +# Benchmark Run Summary — Claude Sonnet 4.6, code-review only (2026-05-07) + +## Setup + +| Field | Value | +|---|---| +| Date | 2026-05-07 | +| Model | `claude-sonnet-4-6` | +| Provider | Anthropic API | +| Task | code-review (only — single-task validation run) | +| Conditions | control (prose prompt), treatment (LOGIC.md compiled prompt) | +| Runs per condition | 10 | +| Total runs | 20 | +| Cost | ~$0.50 | + +## Why no raw JSON + +This run was followed by a Llama 3.1 70B all-tasks run that overwrote `results/results.json` and `results/results.md` (the harness writes to fixed paths, not dated paths). This summary captures the headline numbers from the markdown report before overwrite. + +For future runs, recommend either: (a) the harness writes to dated filenames by default, or (b) operators copy results to `published/` before triggering the next run. Tracked as a benchmark-harness improvement. + +## Results + +``` + Control Treatment Diff +code-review 99 ± 1 100 ± 0 +1.0 + (range 97-100) (range 99-100) +``` + +Per-dimension: + +| Dimension | Control | Treatment | +|---|---|---| +| Structured Compliance | 100% ± 0% | 100% ± 0% | +| Describing vs Doing | 4% ± 2% | 2% ± 2% | +| Pipeline Completion | 100% ± 0% | 100% ± 0% | + +## Interpretation + +**Ceiling effect.** Claude Sonnet 4.6 scores 99/100 on the control prompt for this task. There is no headroom for LOGIC.md to add measurable value: treatment can at best go to 100, which it does. The +1 difference is within sampling noise. + +This is consistent with the hypothesis that LOGIC.md's value (if any on raw quality) shows on weaker models. Subsequent Llama 3.1 70B testing (see `2026-05-07-llama-3.1-70b/`) showed flat-to-negative results, contradicting that hypothesis. + +**Net: on Sonnet 4.6 code-review at n=10, LOGIC.md is a no-op on quality.** + +## Reproducibility + +```bash +cd benchmarks +export ANTHROPIC_API_KEY=sk-ant-... +export BENCHMARK_MODEL=claude-sonnet-4-6 +node run.mjs --task=code-review +``` + +## Limitations + +- Single task only (code-review), so this is not a complete cross-condition picture for Sonnet. +- n=10 is small; CIs are wide. +- The code-review sample input (`tasks/inputs/code-review-sample.js`) is short and contains obvious vulnerabilities, making the task easy for any capable model. +- A harder code-review fixture might create headroom for LOGIC.md to differentiate; this was not tested. + +## Honest disclosure + +This single-task run cost ~$0.50. The next planned step (a 3-task Sonnet sweep) was not executed because the prelim showed clear ceiling effects and the marginal value of more Sonnet data was low compared to running a free-tier Llama sweep on all 3 tasks. That decision is reflected in `2026-05-07-llama-3.1-70b/`. + +## Files + +- `summary.md` — this document (raw JSON not preserved due to harness overwrite, see "Why no raw JSON" above) diff --git a/benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md b/benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md new file mode 100644 index 0000000..f4e173e --- /dev/null +++ b/benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md @@ -0,0 +1,131 @@ +# Benchmark Run Analysis — Llama 3.1 70B (2026-05-07) + +## Setup + +| Field | Value | +|---|---| +| Date | 2026-05-07 | +| Model | `meta/llama-3.1-70b-instruct` | +| Provider | Nvidia NIM (free tier) | +| Tasks | code-review, research-synthesis, security-audit | +| Conditions | control (prose prompt), treatment (LOGIC.md compiled prompt) | +| Runs per condition | 10 | +| Total runs | 60 | +| Harness | `benchmarks/run.mjs` (commit at run-time) | +| Scoring | 4-dimensional aggregate: structured-compliance (40%), describing-vs-doing (30%), pipeline-completion (20%), quality-gate-compliance (10%) | +| Cost | $0 (Nvidia free tier) | + +## Raw aggregate scores (as reported by harness) + +``` + Control Treatment Diff +code-review 98 ± 2 89 ± 30 -9 +research-synthesis 66 ± 44 85 ± 29 +19 +security-audit 71 ± 36 83 ± 17 +12 +``` + +## Investigation: variance and zero-score runs + +Treatment ranges of 0-100 across all three tasks indicated catastrophic outliers. Diagnostic dig revealed **7 of 60 runs failed with `"Connection error."` from the Nvidia NIM endpoint**, not from the LLM or LOGIC.md: + +| Task | Condition | Connection-drop runs | +|---|---|---| +| code-review | treatment | 1 | +| research-synthesis | control | 3 | +| research-synthesis | treatment | 1 | +| security-audit | control | 2 | +| (everywhere else) | | 0 | + +These are infrastructure failures, not signal. Each scored 0 (no output to score) and dragged the mean of its group down. + +## Cleaned aggregate scores (excluding fatal connection drops) + +``` + Control Treatment Diff +code-review 98.3 (n=10) 98.9 (n=9) +0.6 +research-synthesis 94.6 (n=7) 94.0 (n=9) -0.6 +security-audit 89.0 (n=8) 83.0 (n=10) -6.0 +``` + +## Honest interpretation + +The original "+19 / +12" lifts were artifacts of unequal connection-drop incidence between conditions: control runs failed infrastructurally more often than treatment runs, dragging control means down. Once corrected: + +- **code-review**: ceiling effect. Both ~98. LOGIC.md adds nothing measurable on a task this easy for a 70B-parameter model. +- **research-synthesis**: flat. Both ~94. LOGIC.md adds nothing measurable. +- **security-audit**: treatment underperforms control by 6 points (89 → 83). Treatment also showed wider range (50-100 vs control's 89-89). + +**On Llama 3.1 70B at n=10, LOGIC.md does not produce measurable quality lift on these tasks.** + +A separate Sonnet 4.6 single-task run earlier the same day showed the same flat result on code-review (control 99, treatment 100). Two independent flat data points. + +## Anomalies worth investigating before re-running + +**1. The 89-89-89-89 pattern on security-audit control.** Every successful control run scored exactly 89, with range 89-89. This is improbable from a stochastic LLM and points to a scoring-system quirk: every run hits the same fixed-magnitude penalty in the rubric, suggesting the scorer applies binary deductions rather than graduated ones. The penalty calibration may need review before the security-audit results can be trusted as quality signal. + +**2. Strict JSON-schema enum validation.** Example error from a `research-synthesis` control run: + +``` +"errors": ["/sources/3/type: must be equal to one of the allowed values"] +``` + +The output was otherwise valid (structured-compliance 75%, aggregate 88). The scoring system applies the schema-validation error harshly. Real-world consumers might accept the output as good. The scorer rejects it. This contributes variance that doesn't reflect real quality. + +**3. n=10 with high inherent LLM variance.** Confidence intervals are wide. A real but small effect may be invisible at this sample size. n=30 is the minimum recommended in the benchmark MANIFEST for tighter intervals. + +## What this does and does not say about LOGIC.md + +**Does not say**: LOGIC.md is broken, doesn't work, or has no value. + +**Does say**: on these specific tasks, this specific sample size, and this specific scoring rubric, LOGIC.md does not produce measurable quality lift on Llama 3.1 70B. The benchmark was designed to test the README's quality-lift claim. The claim is not supported by this data. + +## What evidence we DO have for LOGIC.md value + +The Archon integration test (2026-05-06, separate experiment, 60 trials) showed: + +| Metric | Without LOGIC.md | With LOGIC.md | +|---|---|---| +| Verdict-agreement (auth-sql-injection) | 50% (5/10) | 100% (10/10) | +| Structural hash agreement (overall) | 70% | 87% | +| Audit trail | Manual reconstruction | Workflow event JSONL out of the box | +| Modifiability | Prose edit, no validation | Structured rule + CLI contract check | + +That experiment measured **structural consistency** rather than **quality lift**, and produced a clean positive result. The two experiments together suggest LOGIC.md's value is in consistency, audit, and modifiability — not in making individual outputs better. + +## Reproducibility + +To re-run this exact configuration: + +```bash +cd benchmarks +unset BENCHMARK_MODEL # default is meta/llama-3.1-70b-instruct +export NVIDIA_API_KEY=nvapi-... +node run.mjs # runs all 3 tasks at 10 runs/condition +``` + +Caveats: Nvidia NIM free tier introduces non-deterministic connection drops. Paid tier or alternative endpoint recommended for cleaner data. + +To analyse with infra failures excluded: + +```bash +node -e " +const r = require('./results/results.json'); +const groups = {}; +r.results.forEach(x => { + if (x.stopReason === 'error' || x.outputLength === 0) return; + const k = x.task + ':' + x.condition; + groups[k] = groups[k] || []; + groups[k].push(x.aggregateScore); +}); +Object.entries(groups).forEach(([k, scores]) => { + const mean = scores.reduce((a,b)=>a+b,0) / scores.length; + console.log(k.padEnd(40), 'n=' + scores.length, '| mean=' + mean.toFixed(1)); +}); +" +``` + +## Files + +- `results.json` — full per-run output as emitted by `run.mjs` +- `results.md` — auto-generated summary (raw, before cleanup) +- `analysis.md` — this document (honest interpretation post-cleanup) diff --git a/benchmarks/published/2026-05-07-llama-3.1-70b/results.json b/benchmarks/published/2026-05-07-llama-3.1-70b/results.json new file mode 100644 index 0000000..29df33e --- /dev/null +++ b/benchmarks/published/2026-05-07-llama-3.1-70b/results.json @@ -0,0 +1,1534 @@ +{ + "results": [ + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 1, + "timestamp": "2026-05-07T07:23:30.798Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 65197, + "tokens": { + "input": 803, + "output": 439 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1991 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 2, + "timestamp": "2026-05-07T07:24:47.591Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 76785, + "tokens": { + "input": 803, + "output": 367 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1619 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 3, + "timestamp": "2026-05-07T07:28:41.132Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 6, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 98, + "executionTime": 233539, + "tokens": { + "input": 803, + "output": 418 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1881 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 4, + "timestamp": "2026-05-07T07:29:40.918Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 18, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 94, + "executionTime": 59775, + "tokens": { + "input": 803, + "output": 393 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1757 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 5, + "timestamp": "2026-05-07T07:29:58.158Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 9, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 97, + "executionTime": 17232, + "tokens": { + "input": 803, + "output": 401 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1693 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 6, + "timestamp": "2026-05-07T07:30:25.643Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 17, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 94, + "executionTime": 27478, + "tokens": { + "input": 803, + "output": 425 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1861 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 7, + "timestamp": "2026-05-07T07:31:22.558Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 56908, + "tokens": { + "input": 803, + "output": 342 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1421 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 8, + "timestamp": "2026-05-07T07:32:00.496Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 37936, + "tokens": { + "input": 803, + "output": 434 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1854 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 9, + "timestamp": "2026-05-07T07:32:26.046Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 25544, + "tokens": { + "input": 803, + "output": 358 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1543 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 10, + "timestamp": "2026-05-07T07:32:54.833Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 28785, + "tokens": { + "input": 803, + "output": 364 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1590 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 1, + "timestamp": "2026-05-07T07:34:46.237Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 111399, + "tokens": { + "input": 1508, + "output": 390 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1608 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 2, + "timestamp": "2026-05-07T07:36:44.204Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 8, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 98, + "executionTime": 117964, + "tokens": { + "input": 1508, + "output": 483 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 2076 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 3, + "timestamp": "2026-05-07T07:38:32.983Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 108777, + "tokens": { + "input": 1508, + "output": 397 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1663 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 4, + "timestamp": "2026-05-07T07:40:56.814Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 143829, + "tokens": { + "input": 1508, + "output": 397 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1638 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 5, + "timestamp": "2026-05-07T07:41:35.173Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 100, + "pipelineCompletion": 0, + "qualityGateCompliance": null + }, + "aggregateScore": 0, + "executionTime": 38355, + "tokens": { + "input": 0, + "output": 0 + }, + "stopReason": "error", + "errors": [ + "Connection error." + ], + "outputLength": 0 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 6, + "timestamp": "2026-05-07T07:53:00.168Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 18, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 95, + "executionTime": 684992, + "tokens": { + "input": 1508, + "output": 430 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1816 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 7, + "timestamp": "2026-05-07T07:56:00.560Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 10, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 97, + "executionTime": 180391, + "tokens": { + "input": 1508, + "output": 402 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1749 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 8, + "timestamp": "2026-05-07T07:56:37.353Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 36791, + "tokens": { + "input": 1508, + "output": 434 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1939 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 9, + "timestamp": "2026-05-07T07:57:19.384Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 42028, + "tokens": { + "input": 1508, + "output": 354 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1461 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 10, + "timestamp": "2026-05-07T07:57:48.719Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 29333, + "tokens": { + "input": 1508, + "output": 416 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1812 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 1, + "timestamp": "2026-05-07T08:01:29.302Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 4, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 88, + "executionTime": 220549, + "tokens": { + "input": 777, + "output": 1061 + }, + "stopReason": "stop", + "errors": [ + "/sources/3/type: must be equal to one of the allowed values" + ], + "outputLength": 4540 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 2, + "timestamp": "2026-05-07T08:06:47.444Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 8, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 97, + "executionTime": 318141, + "tokens": { + "input": 777, + "output": 1147 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 4850 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 3, + "timestamp": "2026-05-07T08:10:53.434Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 245985, + "tokens": { + "input": 777, + "output": 1459 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 5988 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 4, + "timestamp": "2026-05-07T08:12:14.691Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 100, + "pipelineCompletion": 0, + "qualityGateCompliance": null + }, + "aggregateScore": 0, + "executionTime": 81252, + "tokens": { + "input": 0, + "output": 0 + }, + "stopReason": "error", + "errors": [ + "Connection error." + ], + "outputLength": 0 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 5, + "timestamp": "2026-05-07T08:13:51.648Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 96954, + "tokens": { + "input": 777, + "output": 1430 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 6206 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 6, + "timestamp": "2026-05-07T08:17:46.103Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 234443, + "tokens": { + "input": 777, + "output": 1442 + }, + "stopReason": "stop", + "errors": [ + "/sources/8/type: must be equal to one of the allowed values" + ], + "outputLength": 5896 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 7, + "timestamp": "2026-05-07T08:19:07.347Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 100, + "pipelineCompletion": 0, + "qualityGateCompliance": null + }, + "aggregateScore": 0, + "executionTime": 81243, + "tokens": { + "input": 0, + "output": 0 + }, + "stopReason": "error", + "errors": [ + "Connection error." + ], + "outputLength": 0 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 8, + "timestamp": "2026-05-07T08:23:09.927Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 4, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 88, + "executionTime": 242575, + "tokens": { + "input": 777, + "output": 847 + }, + "stopReason": "stop", + "errors": [ + "/sources/3/type: must be equal to one of the allowed values" + ], + "outputLength": 3690 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 9, + "timestamp": "2026-05-07T08:26:45.109Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 215179, + "tokens": { + "input": 777, + "output": 1395 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 5804 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 10, + "timestamp": "2026-05-07T08:33:07.287Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 100, + "pipelineCompletion": 0, + "qualityGateCompliance": null + }, + "aggregateScore": 0, + "executionTime": 382176, + "tokens": { + "input": 0, + "output": 0 + }, + "stopReason": "error", + "errors": [ + "Connection error." + ], + "outputLength": 0 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 1, + "timestamp": "2026-05-07T08:34:25.877Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 78583, + "tokens": { + "input": 1552, + "output": 935 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 3966 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 2, + "timestamp": "2026-05-07T08:37:19.329Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 173445, + "tokens": { + "input": 1552, + "output": 1248 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 5434 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 3, + "timestamp": "2026-05-07T08:39:25.421Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 4, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 89, + "executionTime": 126086, + "tokens": { + "input": 1552, + "output": 1013 + }, + "stopReason": "stop", + "errors": [ + "/sources/0/type: must be equal to one of the allowed values" + ], + "outputLength": 4405 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 4, + "timestamp": "2026-05-07T08:42:35.511Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 190082, + "tokens": { + "input": 1552, + "output": 943 + }, + "stopReason": "stop", + "errors": [ + "/sources/0/type: must be equal to one of the allowed values" + ], + "outputLength": 4092 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 5, + "timestamp": "2026-05-07T08:47:21.335Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 8, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 88, + "executionTime": 285814, + "tokens": { + "input": 1552, + "output": 929 + }, + "stopReason": "stop", + "errors": [ + "/sources/2/type: must be equal to one of the allowed values" + ], + "outputLength": 3986 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 6, + "timestamp": "2026-05-07T08:49:22.241Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 120901, + "tokens": { + "input": 1552, + "output": 885 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 3796 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 7, + "timestamp": "2026-05-07T08:52:12.362Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 100, + "pipelineCompletion": 0, + "qualityGateCompliance": null + }, + "aggregateScore": 0, + "executionTime": 170114, + "tokens": { + "input": 0, + "output": 0 + }, + "stopReason": "error", + "errors": [ + "Connection error." + ], + "outputLength": 0 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 8, + "timestamp": "2026-05-07T08:54:25.820Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 133455, + "tokens": { + "input": 1552, + "output": 1032 + }, + "stopReason": "stop", + "errors": [ + "/sources/0/type: must be equal to one of the allowed values" + ], + "outputLength": 4317 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 9, + "timestamp": "2026-05-07T08:57:53.407Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 207585, + "tokens": { + "input": 1552, + "output": 1019 + }, + "stopReason": "stop", + "errors": [ + "/sources/0/type: must be equal to one of the allowed values" + ], + "outputLength": 4297 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 10, + "timestamp": "2026-05-07T09:01:47.020Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 3, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 99, + "executionTime": 233610, + "tokens": { + "input": 1552, + "output": 1226 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 5245 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 1, + "timestamp": "2026-05-07T09:06:21.964Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 274911, + "tokens": { + "input": 1124, + "output": 1268 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 5123 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 2, + "timestamp": "2026-05-07T09:09:12.149Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 170173, + "tokens": { + "input": 1124, + "output": 1151 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 4560 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 3, + "timestamp": "2026-05-07T09:11:38.172Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 146015, + "tokens": { + "input": 1124, + "output": 1069 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 4271 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 4, + "timestamp": "2026-05-07T09:14:12.250Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 154065, + "tokens": { + "input": 1124, + "output": 1171 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 4836 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 5, + "timestamp": "2026-05-07T09:15:58.275Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 106011, + "tokens": { + "input": 1124, + "output": 1088 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 4249 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 6, + "timestamp": "2026-05-07T09:18:41.235Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 162936, + "tokens": { + "input": 1124, + "output": 1172 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 4666 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 7, + "timestamp": "2026-05-07T09:20:04.655Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 100, + "pipelineCompletion": 0, + "qualityGateCompliance": null + }, + "aggregateScore": 0, + "executionTime": 83414, + "tokens": { + "input": 0, + "output": 0 + }, + "stopReason": "error", + "errors": [ + "Connection error." + ], + "outputLength": 0 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 8, + "timestamp": "2026-05-07T09:21:50.463Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 105777, + "tokens": { + "input": 1124, + "output": 1337 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 5374 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 9, + "timestamp": "2026-05-07T09:28:16.308Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 100, + "pipelineCompletion": 0, + "qualityGateCompliance": null + }, + "aggregateScore": 0, + "executionTime": 385839, + "tokens": { + "input": 0, + "output": 0 + }, + "stopReason": "error", + "errors": [ + "Connection error." + ], + "outputLength": 0 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 10, + "timestamp": "2026-05-07T09:32:06.439Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 230125, + "tokens": { + "input": 1124, + "output": 1263 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 4936 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 1, + "timestamp": "2026-05-07T09:34:59.410Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 172964, + "tokens": { + "input": 2220, + "output": 1285 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 5181 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 2, + "timestamp": "2026-05-07T09:38:03.706Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 184277, + "tokens": { + "input": 2220, + "output": 1386 + }, + "stopReason": "stop", + "errors": [ + "/remediation_plan/5/priority: must be <= 5" + ], + "outputLength": 5755 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 3, + "timestamp": "2026-05-07T09:39:58.110Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 114400, + "tokens": { + "input": 2220, + "output": 1397 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 5476 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 4, + "timestamp": "2026-05-07T09:41:58.893Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 120774, + "tokens": { + "input": 2220, + "output": 1096 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 4289 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 5, + "timestamp": "2026-05-07T09:44:08.774Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 129878, + "tokens": { + "input": 2220, + "output": 1436 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 5878 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 6, + "timestamp": "2026-05-07T09:45:41.243Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 92464, + "tokens": { + "input": 2220, + "output": 1079 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 4240 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 7, + "timestamp": "2026-05-07T09:50:54.192Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 312945, + "tokens": { + "input": 2220, + "output": 1267 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 5136 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 8, + "timestamp": "2026-05-07T09:53:38.447Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 164246, + "tokens": { + "input": 2220, + "output": 1097 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 4340 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 9, + "timestamp": "2026-05-07T09:55:26.829Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 0 + }, + "aggregateScore": 50, + "executionTime": 108373, + "tokens": { + "input": 2220, + "output": 1118 + }, + "stopReason": "stop", + "errors": [ + "root: must have required property 'vulnerabilities'" + ], + "outputLength": 4806 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 10, + "timestamp": "2026-05-07T09:58:06.640Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 0 + }, + "aggregateScore": 50, + "executionTime": 159801, + "tokens": { + "input": 2220, + "output": 1099 + }, + "stopReason": "stop", + "errors": [ + "root: must have required property 'vulnerabilities'" + ], + "outputLength": 4798 + } + ], + "stats": { + "code-review:meta/llama-3.1-70b-instruct:control": { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "runs": 10, + "aggregateScore": { + "mean": 98, + "stddev": 2, + "min": 94, + "max": 100 + }, + "structuredCompliance": { + "mean": 100, + "stddev": 0 + }, + "describingVsDoing": { + "mean": 5, + "stddev": 7 + }, + "pipelineCompletion": { + "mean": 100, + "stddev": 0 + } + }, + "code-review:meta/llama-3.1-70b-instruct:treatment": { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "runs": 10, + "aggregateScore": { + "mean": 89, + "stddev": 30, + "min": 0, + "max": 100 + }, + "structuredCompliance": { + "mean": 90, + "stddev": 30 + }, + "describingVsDoing": { + "mean": 14, + "stddev": 29 + }, + "pipelineCompletion": { + "mean": 90, + "stddev": 30 + } + }, + "research-synthesis:meta/llama-3.1-70b-instruct:control": { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "runs": 10, + "aggregateScore": { + "mean": 66, + "stddev": 44, + "min": 0, + "max": 100 + }, + "structuredCompliance": { + "mean": 63, + "stddev": 42 + }, + "describingVsDoing": { + "mean": 32, + "stddev": 45 + }, + "pipelineCompletion": { + "mean": 70, + "stddev": 46 + } + }, + "research-synthesis:meta/llama-3.1-70b-instruct:treatment": { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "runs": 10, + "aggregateScore": { + "mean": 85, + "stddev": 29, + "min": 0, + "max": 100 + }, + "structuredCompliance": { + "mean": 78, + "stddev": 28 + }, + "describingVsDoing": { + "mean": 12, + "stddev": 30 + }, + "pipelineCompletion": { + "mean": 90, + "stddev": 30 + } + }, + "security-audit:meta/llama-3.1-70b-instruct:control": { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "runs": 10, + "aggregateScore": { + "mean": 71, + "stddev": 36, + "min": 0, + "max": 89 + }, + "structuredCompliance": { + "mean": 60, + "stddev": 30 + }, + "describingVsDoing": { + "mean": 20, + "stddev": 40 + }, + "pipelineCompletion": { + "mean": 80, + "stddev": 40 + } + }, + "security-audit:meta/llama-3.1-70b-instruct:treatment": { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "runs": 10, + "aggregateScore": { + "mean": 83, + "stddev": 17, + "min": 50, + "max": 100 + }, + "structuredCompliance": { + "mean": 63, + "stddev": 32 + }, + "describingVsDoing": { + "mean": 0, + "stddev": 0 + }, + "pipelineCompletion": { + "mean": 100, + "stddev": 0 + } + } + } +} \ No newline at end of file diff --git a/benchmarks/published/2026-05-07-llama-3.1-70b/results.md b/benchmarks/published/2026-05-07-llama-3.1-70b/results.md new file mode 100644 index 0000000..e3d1aab --- /dev/null +++ b/benchmarks/published/2026-05-07-llama-3.1-70b/results.md @@ -0,0 +1,80 @@ +# LOGIC.md Benchmark Results + +Generated: 2026-05-07T09:58:06.657Z + +## Summary Statistics + +### code-review (meta/llama-3.1-70b-instruct) - control + +- Runs: 10 +- Aggregate Score: 98 ± 2 (range: 94-100) +- Structured Compliance: 100% ± 0% +- Describing vs Doing: 5% ± 7% (lower is better) +- Pipeline Completion: 100% ± 0% + +### code-review (meta/llama-3.1-70b-instruct) - treatment + +- Runs: 10 +- Aggregate Score: 89 ± 30 (range: 0-100) +- Structured Compliance: 90% ± 30% +- Describing vs Doing: 14% ± 29% (lower is better) +- Pipeline Completion: 90% ± 30% + +### research-synthesis (meta/llama-3.1-70b-instruct) - control + +- Runs: 10 +- Aggregate Score: 66 ± 44 (range: 0-100) +- Structured Compliance: 63% ± 42% +- Describing vs Doing: 32% ± 45% (lower is better) +- Pipeline Completion: 70% ± 46% + +### research-synthesis (meta/llama-3.1-70b-instruct) - treatment + +- Runs: 10 +- Aggregate Score: 85 ± 29 (range: 0-100) +- Structured Compliance: 78% ± 28% +- Describing vs Doing: 12% ± 30% (lower is better) +- Pipeline Completion: 90% ± 30% + +### security-audit (meta/llama-3.1-70b-instruct) - control + +- Runs: 10 +- Aggregate Score: 71 ± 36 (range: 0-89) +- Structured Compliance: 60% ± 30% +- Describing vs Doing: 20% ± 40% (lower is better) +- Pipeline Completion: 80% ± 40% + +### security-audit (meta/llama-3.1-70b-instruct) - treatment + +- Runs: 10 +- Aggregate Score: 83 ± 17 (range: 50-100) +- Structured Compliance: 63% ± 32% +- Describing vs Doing: 0% ± 0% (lower is better) +- Pipeline Completion: 100% ± 0% + +## Key Findings + +### code-review:meta/llama-3.1-70b-instruct +- Control Aggregate Score: 98 +- Treatment Aggregate Score: 89 +- **Difference: -9 (-9.2%)** +- Control Describing vs Doing: 5% +- Treatment Describing vs Doing: 14% +- **Reduction: -9% points** + +### research-synthesis:meta/llama-3.1-70b-instruct +- Control Aggregate Score: 66 +- Treatment Aggregate Score: 85 +- **Difference: +19 (28.8%)** +- Control Describing vs Doing: 32% +- Treatment Describing vs Doing: 12% +- **Reduction: 20% points** + +### security-audit:meta/llama-3.1-70b-instruct +- Control Aggregate Score: 71 +- Treatment Aggregate Score: 83 +- **Difference: +12 (16.9%)** +- Control Describing vs Doing: 20% +- Treatment Describing vs Doing: 0% +- **Reduction: 20% points** + diff --git a/benchmarks/published/INDEX.md b/benchmarks/published/INDEX.md new file mode 100644 index 0000000..94f1caa --- /dev/null +++ b/benchmarks/published/INDEX.md @@ -0,0 +1,44 @@ +# Published Benchmark Runs + +This directory holds committed benchmark runs as evidence artifacts. Each run gets a dated subdirectory with raw output (`results.json`, `results.md`) and an interpretation document (`analysis.md` or `summary.md`). + +The ephemeral `benchmarks/results/` directory holds the most recent run-in-progress and is gitignored. To preserve a run as evidence, copy it into a dated subdirectory here and commit. + +## Index + +| Date | Model | Tasks | n/condition | Result | +|---|---|---|---|---| +| 2026-05-07 | `claude-sonnet-4-6` | code-review only | 10 | Control 99 / Treatment 100. Ceiling effect, LOGIC.md is no-op on quality. [Details](./2026-05-07-claude-sonnet-4-6-codereview/summary.md) | +| 2026-05-07 | `meta/llama-3.1-70b-instruct` | code-review, research-synthesis, security-audit | 10 | Flat to slightly negative after excluding 7 Nvidia NIM connection drops. [Details](./2026-05-07-llama-3.1-70b/analysis.md) | + +## Headline finding (2026-05-07) + +Two independent runs on different models show **no measurable quality lift from LOGIC.md** on these tasks at n=10. Sonnet 4.6 is ceiling-bound; Llama 3.1 70B is flat to slightly negative once infrastructure failures are stripped. + +This contradicts the original "describing-vs-doing fix" framing in the README and motivated the positioning pivot toward auditability and structural consistency, anchored instead in the [Archon integration test (2026-05-06)](https://github.com/SingularityAI-Dev/logic-md-archon-eval) which showed clean structural-consistency results (87% hash agreement under LOGIC.md vs 70% without). + +## Open methodology questions (deferred for future runs) + +1. **Scoring system rigidity.** Security-audit control runs all scored exactly 89, suggesting the rubric applies fixed-magnitude penalties rather than graduated ones. The `89-89-89-89` pattern is improbable from a stochastic LLM and likely a scorer artifact. + +2. **Strict JSON-schema enum validation.** Outputs that are valid in spirit but use slightly different enum values get scored down harshly. May not reflect real-world acceptance criteria. + +3. **Sample size.** n=10 is below the MANIFEST recommendation of 30. Confidence intervals are wide. Real-but-small effects may be invisible at this sample size. + +4. **Task difficulty.** The current sample inputs (`tasks/inputs/*-sample.{js,txt}`) are short and may be too easy for capable models, eliminating the headroom LOGIC.md needs to differentiate. Harder fixtures might produce different signal. + +5. **Single-temperature.** Temperature 0.7 is hardcoded. Sensitivity analysis across temperatures has not been done. + +These are tracked but not blocking. The honest current finding stands: **LOGIC.md does not produce measurable quality lift on these tasks at this sample size.** + +## Re-running + +See each run's analysis for the exact reproduction command. Common pattern: + +```bash +cd benchmarks +export _API_KEY=... +export BENCHMARK_MODEL= # optional, default is Llama 3.1 70B +node run.mjs [--task=] # omit --task to run all 3 +# Then copy results/ to published/-/ before next run +``` From 2e9890ebee8dd6a2da914e145e8ef1e2f9d685e1 Mon Sep 17 00:00:00 2001 From: Rainier Potgieter Date: Thu, 7 May 2026 14:31:43 +0200 Subject: [PATCH 2/2] docs(readme): pivot positioning to audit/governance/consistency Quality-lift framing was unsupported by the cross-model benchmark sweep run 2026-05-07 (Sonnet 4.6 ceiling effect, Llama 3.1 70B flat-to-negative once infra failures are excluded). Phase 0 Ship Honest doctrine requires updating the README to match what the data supports. What changed: - Hero pitch: 'declarative reasoning layer' -> 'audit and governance layer for AI agent reasoning' - 'The problem' rewritten around audit, modifiability, and consistency. Old 'describing-vs-doing' framing dropped (it was Modular9-specific and not generalisable). - Case study replaced with the 60-trial Archon integration test that DID show clean signal: 87% vs 70% structural hash agreement, 10/10 vs 5/10 identical tuples on auth-sql-injection. Anchored against the new public eval repo at github.com/SingularityAI-Dev/logic-md-archon-eval. - 'When to use it' updated to recommend AGAINST LOGIC.md for raw quality on capable models. Added explicit honest disclosure. - New section 'What LOGIC.md actually delivers' enumerating the three real properties: structure as contract, audit trail as default artifact, modifications as structured diffs. - Benchmarks section rewritten with honest disclosure of the 2026-05-07 cross-model results and the 2026-05-06 Archon results, framing them as complementary rather than conflicting. - Roadmap 'Near term' updated to reflect that the benchmark suite has run (partially) and to queue actual next experiments. Pitch and adopt LOGIC.md for structure and governance, not quality. The technical features are unchanged. --- README.md | 131 +++++++++++++++++++++++++++++++++++------------------- 1 file changed, 86 insertions(+), 45 deletions(-) diff --git a/README.md b/README.md index 89199bc..0c01dde 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,8 @@ # LOGIC.md -**The declarative reasoning layer for AI agents.** +**The audit and governance layer for AI agent reasoning.** -A portable, framework-agnostic file format for specifying *how* an agent thinks: strategy, step DAGs, contracts, quality gates, and fallback policies: declared in YAML rather than hardcoded in Python. +A portable, framework-agnostic file format for declaring agent reasoning as structured contracts: step DAGs, output schemas, tool permissions, and quality gates. Contracts are validated at compile time, executed deterministically, and produce auditable event traces by default. Where prose prompts give you behaviour, LOGIC.md gives you accountability. [![npm](https://img.shields.io/npm/v/@logic-md/core?color=7c6fe0&label=%40logic-md%2Fcore)](https://www.npmjs.com/package/@logic-md/core) [![npm](https://img.shields.io/npm/v/@logic-md/cli?color=2db88a&label=%40logic-md%2Fcli)](https://www.npmjs.com/package/@logic-md/cli) @@ -11,7 +11,7 @@ A portable, framework-agnostic file format for specifying *how* an agent thinks: [![Tests](https://img.shields.io/badge/tests-307%20core%20%2F%2018%20mcp-brightgreen)](packages/core) [![Coverage](https://img.shields.io/badge/coverage-95.9%25%20branch-brightgreen)](packages/core) -Developed alongside and validated through [Modular9](https://github.com/SingleSourceStudios/modular9), a visual node-based agent builder by the same author, where it addressed a common agent pipeline failure mode that ad-hoc prompt engineering could not reliably solve at scale. +Developed alongside and validated in production through [Modular9](https://github.com/SingleSourceStudios/modular9), a visual node-based agent builder by the same author, where the contract-enforcement architecture was first proven at scale. Independently evaluated against [Archon](https://github.com/coleam00/Archon) in a 60-trial controlled experiment ([logic-md-archon-eval](https://github.com/SingularityAI-Dev/logic-md-archon-eval)).

LOGIC.md sits between agent identity (CLAUDE.md), capability (SKILL.md), and protocols (MCP, A2A) as the missing declarative reasoning layer. @@ -21,23 +21,17 @@ Developed alongside and validated through [Modular9](https://github.com/SingleSo ## The problem -Your agent describes what it would do instead of doing it. +Your agent reasoning is locked inside prose prompts. That has three consequences: -``` -"As a Security Auditor, I would perform an OWASP Top 10 review -and map findings to CWE IDs. I would then scan for injection -vulnerabilities..." -``` +**You cannot audit it.** When a regulator, security team, or user asks "why did the agent take that action?", the answer is buried in the model's hidden reasoning trace. Replaying the exact same input rarely produces the exact same output. The audit trail is whatever logs you remembered to add. -The next node in your pipeline receives an intent description, not data. Your workflow becomes a chain of *I would do X* statements that never produce real artifacts. +**You cannot modify it safely.** Updating a multi-step workflow means editing prose. There are no contracts, no types, no `validate()`. A six-word change can quietly break four downstream consumers. The diff doesn't tell you what changed semantically. -

- Before and after LOGIC.md: agents that describe intent versus agents that emit structured artifacts that flow to the next step. -

+**You cannot trust the consistency.** Two runs of the same workflow on the same input produce different structured outputs at non-trivial rates. On reasoning tasks where the model has multiple plausible paths, prose prompts under-constrain the output enough that the variance becomes a real problem for downstream consumers. -This is not a prompt engineering problem. It is a missing contracts problem. +This is not a prompt-engineering problem you can fix by writing better prompts. It is a missing-contracts problem. -Every agent framework gives you identity (`CLAUDE.md`), tools (`SKILL.md`), and memory. None of them give you a portable, framework-agnostic file format for declaring reasoning contracts: step dependencies, quality gates, and multi-agent handoffs: that travels with your code and survives framework changes. +Every agent framework gives you identity (`CLAUDE.md`), tools (`SKILL.md`), and memory. None of them give you a portable, framework-agnostic file format for declaring reasoning contracts that travel with your code, validate at compile time, execute deterministically, and produce structured audit trails. **LOGIC.md fills that gap.** @@ -65,11 +59,9 @@ steps: action: retry ``` -When a node has output contracts, the runtime injects: +When a node has output contracts, the runtime compiles a structured prompt segment with a `## Required Output` section listing every field the agent must produce, with type and description, and emits the resulting compiled prompt + schema as part of the workflow event trace. -> *You MUST produce a concrete artifact. Your output IS the deliverable.* - -Agents stop describing. They start doing. +The contract is enforceable, diffable, and auditable. The compiled prompt is deterministic given the spec. --- @@ -99,47 +91,75 @@ Agents stop describing. They start doing. --- -## Case study: the describing-vs-doing fix +## Case study: structural consistency under LOGIC.md + +This was validated in May 2026 against [Archon](https://github.com/coleam00/Archon), an open-source AI workflow engine, in a 60-trial controlled experiment. Full repo: [logic-md-archon-eval](https://github.com/SingularityAI-Dev/logic-md-archon-eval). -This was validated during the Modular9 integration. +**Setup.** Identical PR-review task. Two configurations on the same Archon node: +- **Case A**: stock prose prompt with Zod output schema. No LOGIC.md. +- **Case B**: same node, reasoning step delegated to LOGIC.md's MCP server (`@logic-md/mcp`) via Archon's existing `mcp:` field. The LOGIC.md spec defines 4 reasoning steps, output contracts, and a deterministic precedence rule for verdict aggregation. -Before LOGIC.md, running a Modular9 workflow produced output like *"As a Security Auditor, I would perform an OWASP Top 10 review..."* from every node: intent descriptions instead of artifacts. The next node in the chain received a summary of what the previous node would have done, not the thing itself. Pipelines never produced real deliverables end-to-end. +Same model, same fixtures, same harness. Zero patches to Archon. We measure verdict-agreement rate (do the same inputs produce the same verdicts across 10 runs?) and structural hash agreement (`sha256({verdict, critical_count, high_count})`). -The root cause was two compounding gaps. Plugin identity prompts said *"You are a Security Auditor specialist"* but never said *"produce the actual artifact"*: role-descriptive, not action-directive. And the user prompt framing was vague enough that the LLM defaulted to a conversational summary instead of structured output. +**Results across 3 fixtures × 2 cases × 10 runs (60 trials):** -Three changes, all enabled by LOGIC.md contracts, solved it permanently: +| | Case A (prose) | Case B (LOGIC.md) | +|---|---|---| +| Verdict agreement (auth-sql-injection) | 5/10 different tuples | **10/10 identical** | +| Structural hash agreement (overall) | 70% | **87%** | +| Audit trail | Manual reconstruction | Workflow event JSONL out of the box | +| Modifiability (add HIGH-blocks-PR rule) | 8-line prose edit, no validation | 9-line structured rule + `validate()` | +| Runtime overhead | 1× | 2.6× | -1. **Execution mandate**: every node's system prompt gains: *"You are a node in an automated pipeline. Your output IS the artifact."* +**The headline.** On the auth-sql-injection fixture, Case A produced 5 different `(verdict, critical_count, high_count)` tuples across 10 identical runs. The model's choice between e.g. "1 critical / 2 high" and "2 critical / 1 high" was essentially a coin flip. Case B produced 1 unique tuple — 10/10 identical structured output. -2. **Output contract injection**: when a node has `contracts.outputs`, the user prompt gains a structured `## Required Output` section listing every field the agent must produce with type and description. +LOGIC.md does not change the verdict (both cases reach `REQUEST_CHANGES`). It eliminates the structural variance in the supporting metadata that downstream consumers depend on. -3. **Input framing**: previous node output is labeled `## Input Data` with contract field descriptions, so the agent knows it is transforming structured data, not answering a question. +**Audit and modifiability are properties of the runtime, not separately measured.** Every step's compiled prompt, output schema, and quality-gate result is emitted as a structured workflow event. Adding a new rule (e.g. "any HIGH-severity issue blocks the PR") is a 9-line YAML change with `@logic-md/cli validate` as the contract check. Without LOGIC.md, the equivalent is an 8-line prose edit with no enforcement and no signal that downstream consumers might break. -The result: each node now produces actual deliverables. Node A writes the audit report. Node B receives it as data and produces the threat model. Node C receives that and produces the Slack summary. The pipeline produces real artifacts end-to-end. +**The 2.6× runtime cost is real.** LOGIC.md is not free. The trade is consistency, auditability, and safe modifiability against execution speed. For regulated domains, multi-agent pipelines, or any workflow where agent decisions need to be defensible, the trade is worth making. For one-shot prototypes or fast classification gates, it is not. -The underlying techniques: execution mandates, output contract injection, structured input framing: are established patterns in prompt engineering. What made them hard to apply was doing it consistently across every node in a pipeline without framework-specific glue code. LOGIC.md provides a declarative, portable way to apply them systematically: **the fix travels with the spec rather than being buried in imperative code**. +[Full report with methodology, traces, and per-fixture analysis](https://github.com/SingularityAI-Dev/logic-md-archon-eval/blob/main/REPORT.md). --- ## When to use it: and when not to **Use LOGIC.md when:** -- You have multi-step agent pipelines where one agent's output feeds another -- You need reproducible, auditable reasoning that survives model swaps -- You're hitting the describing-vs-doing failure mode -- You need per-step tool permissions, confidence thresholds, or structured fallback -- You want reasoning configuration that is portable across LangGraph, CrewAI, AutoGen, or your own runtime +- You need agent decisions to be auditable. Regulated domains, security teams, internal compliance. +- You have multi-step pipelines where one agent's output feeds another and you need the contract to be enforceable. +- You need consistent verdicts across runs. The same input should produce the same structured output, not a different paraphrase each time. +- Your workflow needs to be safely modifiable by people who didn't author it. Structured contracts catch what prose review misses. +- You need per-step tool permissions, confidence thresholds, or governed fallback policies. +- You want reasoning configuration that is portable across LangGraph, CrewAI, AutoGen, or your own runtime. **You probably don't need LOGIC.md when:** -- Your agent is a single LLM call with no downstream consumers -- You're prototyping and don't yet know the shape of your reasoning steps -- Your workflow is fully covered by a DSPy signature or a single LangGraph node -- You have no quality gates, no contracts between stages, and no multi-agent handoffs +- Your agent is a single LLM call with no downstream consumers. +- You're prototyping and don't yet know the shape of your reasoning steps. +- Your workflow is fully covered by a DSPy signature or a single LangGraph node. +- You're optimising for raw output quality on a frontier model. LOGIC.md does not provide measurable quality lift on capable models; structured outputs add overhead. +- You have no quality gates, no contracts between stages, and no multi-agent handoffs. + +**Honest disclosure on quality lift.** Cross-model benchmarks (Claude Sonnet 4.6 and Llama 3.1 70B at n=10 per condition) do not show measurable quality lift from LOGIC.md on these tasks. The value is structural — consistency, auditability, modifiability — not generative. Pitch and adopt accordingly. [Benchmark methodology and raw results](benchmarks/published/INDEX.md). LOGIC.md is a reasoning *contract* format. If you don't have anything to contract between, you don't need it yet. --- +## What LOGIC.md actually delivers + +Three properties that fall out of the runtime, not the model: + +**1. Structure as a contract.** A LOGIC.md spec compiles into a deterministic execution plan: a DAG with named steps, typed input/output schemas, and quality gates. The compiler is pure: same spec, same plan. The CLI's `validate()` catches contract violations before any LLM is called. Refactors are diffable as structure, not prose. + +**2. Audit trail as a default artifact.** Every compiled step's prompt, schema, and gate evaluation is emitted as a structured workflow event (JSONL). The trail is a property of the runtime, not an instrumentation library you remembered to add. When someone asks "what did the agent do?", you have a structured record, not a screen scrape. + +**3. Modifications as structured diffs.** Changing a workflow is a YAML edit, validated against the spec. Adding a new rule, tool restriction, or quality gate produces a reviewable diff with type-checked semantics. Adding the equivalent to a prose prompt produces an English edit with no enforcement. + +These properties matter most where agent decisions are consequential, replayable, and questioned: regulated industries, security review, compliance workflows, internal governance. They matter least where prompts are throwaway and outputs are not audited. + +--- + ## How this differs from DSPy DSPy is the most prominent project in the field that approaches declarative reasoning, so it's the right comparison to start with. @@ -400,15 +420,35 @@ For building implementations in other languages, see [`docs/IMPLEMENTER-GUIDE.md --- -## Benchmarks +## Benchmarks and honest disclosure + +LOGIC.md's value proposition is structural — consistency, auditability, modifiability — not generative quality lift. The benchmarks below substantiate that distinction. + +**Quality lift: no measurable signal on these tasks (2026-05-07).** Cross-model sweep at n=10 per condition: + +| Model | Tasks | Result | +|---|---|---| +| Claude Sonnet 4.6 | code-review | Control 99/100, treatment 100/100. Ceiling effect. | +| Llama 3.1 70B (Nvidia NIM) | code-review, research-synthesis, security-audit | Flat to slightly negative after excluding 7 infrastructure-failure runs. | + +Raw results, per-trial JSON, and analysis: [`benchmarks/published/`](benchmarks/published/INDEX.md). + +The honest reading: on capable models with reasonable prose prompts on these tasks, LOGIC.md does not produce measurable quality lift. Treat any positioning that claims otherwise with scepticism, including older versions of this README. **Use LOGIC.md for structure and governance, not for quality.** -LOGIC.md's thesis: that declarative contracts + quality gates measurably improve multi-step reasoning reliability: is internally validated (Modular9 integration) but not yet benchmarked across model families. +Open methodology questions are catalogued in `benchmarks/published/INDEX.md` (rigid scoring rubric, strict enum validation, n=10 too small, fixtures may be too easy). Re-runs at higher n on a paid tier are queued. + +**Structural consistency: clean positive signal (2026-05-06).** A separate 60-trial integration test against [Archon](https://github.com/coleam00/Archon) measured verdict-agreement and structural-hash agreement under stock prose vs LOGIC.md compiled prompts on the same node: + +| | Case A (prose) | Case B (LOGIC.md) | +|---|---|---| +| Hash agreement (overall) | 70% | **87%** | +| auth-sql-injection fixture | 5/10 different `(verdict, critical, high)` tuples | **10/10 identical** | -**Preliminary (Llama 3.1 70B, April 2026):** single-model artifact-rate comparison between freeform prompting and LOGIC.md-compiled prompts. Deltas were within variance. Conclusion: inconclusive on weaker open-weight models; a meaningful signal likely requires frontier models where instruction-following is tight enough to expose the contract effect. See [`benchmarks/`](benchmarks/) for the harness and raw runs. +Full repo with methodology, raw traces, and report: [logic-md-archon-eval](https://github.com/SingularityAI-Dev/logic-md-archon-eval). -**Next:** cross-model sweep on Claude Sonnet and GPT-4o class models, same harness, measuring (a) artifact-rate: does the step produce the declared output shape, (b) handoff fidelity: does the next step receive usable data, and (c) latency-adjusted reliability under a fixed retry budget. +**Why the two findings don't conflict.** Quality lift measures whether outputs are *better*. Structural consistency measures whether outputs are *the same across runs*. LOGIC.md doesn't appear to make individual outputs better — it makes the distribution across runs tighter, which is what audit, replay, and downstream-pipeline contracts actually need. -If you have API credits and want to co-run the benchmark, [open an issue](https://github.com/SingularityAI-Dev/logic-md/issues): the harness is reproducible and raw outputs will be published regardless of outcome. +**Reproducing.** The cross-model harness is at [`benchmarks/`](benchmarks/) and runs against any Anthropic, OpenAI, or Nvidia NIM endpoint. The Archon eval harness is in the [eval repo](https://github.com/SingularityAI-Dev/logic-md-archon-eval). All raw outputs are published. If you have credits and want to co-run at higher n or on additional models, [open an issue](https://github.com/SingularityAI-Dev/logic-md/issues). --- @@ -421,10 +461,11 @@ If you have API credits and want to co-run the benchmark, [open an issue](https: - 325 tests · 95.9% branch coverage on compiler · 18 conformance fixtures · canonical JSON Schema **Near term** -- Cross-model benchmark suite on frontier models (Claude Sonnet, GPT-4o) +- Cross-model benchmark expansion: re-run at n=30 on a paid tier (eliminate Nvidia connection-drop noise), add Haiku 4.5 within-vendor comparison, add DeepSeek V3 / Qwen 2.5 as current open-model data points +- Benchmark scoring system audit: investigate the rigid penalty thresholds surfaced in 2026-05-07 results - VSCode marketplace publish - Python SDK feature parity: compiler + dry-run executor matching TypeScript -- LangGraph adapter Phase 2: branch support, quality-gate enforcement, parallel execution: timeline TBD, pending frontier-model benchmark results +- LangGraph adapter Phase 2: branch support, quality-gate enforcement, parallel execution - Documentation site at logic-md.org **Medium term**