From aeeee0a71836671655b1a5c639e55efb91fb3a00 Mon Sep 17 00:00:00 2001 From: Rainier Potgieter Date: Thu, 7 May 2026 13:14:28 +0200 Subject: [PATCH] chore(benchmarks): publish 2026-05-07 cross-model results Two independent runs on Claude Sonnet 4.6 (code-review, n=10) and Llama 3.1 70B via Nvidia NIM (3 tasks, n=10) show no measurable quality lift from LOGIC.md on these tasks at this sample size. - Sonnet 4.6 ceiling effect: control 99/100, treatment 100/100 - Llama 3.1 70B flat to slightly negative after excluding 7 Nvidia connection-drop runs (cleaned means: code-review 98.3 vs 98.9, research-synthesis 94.6 vs 94.0, security-audit 89.0 vs 83.0) Per-run results, raw harness output, and honest analysis preserved at benchmarks/published/-/. INDEX.md catalogues runs and documents open methodology questions. Motivates positioning pivot toward structural consistency / audit / governance, anchored in the 2026-05-06 Archon integration test (87% hash agreement under LOGIC.md vs 70% without). --- benchmarks/.gitignore | 4 +- .../summary.md | 68 + .../2026-05-07-llama-3.1-70b/analysis.md | 131 ++ .../2026-05-07-llama-3.1-70b/results.json | 1534 +++++++++++++++++ .../2026-05-07-llama-3.1-70b/results.md | 80 + benchmarks/published/INDEX.md | 44 + 6 files changed, 1858 insertions(+), 3 deletions(-) create mode 100644 benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md create mode 100644 benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md create mode 100644 benchmarks/published/2026-05-07-llama-3.1-70b/results.json create mode 100644 benchmarks/published/2026-05-07-llama-3.1-70b/results.md create mode 100644 benchmarks/published/INDEX.md diff --git a/benchmarks/.gitignore b/benchmarks/.gitignore index 2d74893..ad971da 100644 --- a/benchmarks/.gitignore +++ b/benchmarks/.gitignore @@ -3,10 +3,8 @@ node_modules/ package-lock.json yarn.lock -# Results +# Ephemeral run output (committed runs live in published/) results/ -results.json -results.md # Logs *.log diff --git a/benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md b/benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md new file mode 100644 index 0000000..a724b5d --- /dev/null +++ b/benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md @@ -0,0 +1,68 @@ +# Benchmark Run Summary — Claude Sonnet 4.6, code-review only (2026-05-07) + +## Setup + +| Field | Value | +|---|---| +| Date | 2026-05-07 | +| Model | `claude-sonnet-4-6` | +| Provider | Anthropic API | +| Task | code-review (only — single-task validation run) | +| Conditions | control (prose prompt), treatment (LOGIC.md compiled prompt) | +| Runs per condition | 10 | +| Total runs | 20 | +| Cost | ~$0.50 | + +## Why no raw JSON + +This run was followed by a Llama 3.1 70B all-tasks run that overwrote `results/results.json` and `results/results.md` (the harness writes to fixed paths, not dated paths). This summary captures the headline numbers from the markdown report before overwrite. + +For future runs, recommend either: (a) the harness writes to dated filenames by default, or (b) operators copy results to `published/` before triggering the next run. Tracked as a benchmark-harness improvement. + +## Results + +``` + Control Treatment Diff +code-review 99 ± 1 100 ± 0 +1.0 + (range 97-100) (range 99-100) +``` + +Per-dimension: + +| Dimension | Control | Treatment | +|---|---|---| +| Structured Compliance | 100% ± 0% | 100% ± 0% | +| Describing vs Doing | 4% ± 2% | 2% ± 2% | +| Pipeline Completion | 100% ± 0% | 100% ± 0% | + +## Interpretation + +**Ceiling effect.** Claude Sonnet 4.6 scores 99/100 on the control prompt for this task. There is no headroom for LOGIC.md to add measurable value: treatment can at best go to 100, which it does. The +1 difference is within sampling noise. + +This is consistent with the hypothesis that LOGIC.md's value (if any on raw quality) shows on weaker models. Subsequent Llama 3.1 70B testing (see `2026-05-07-llama-3.1-70b/`) showed flat-to-negative results, contradicting that hypothesis. + +**Net: on Sonnet 4.6 code-review at n=10, LOGIC.md is a no-op on quality.** + +## Reproducibility + +```bash +cd benchmarks +export ANTHROPIC_API_KEY=sk-ant-... +export BENCHMARK_MODEL=claude-sonnet-4-6 +node run.mjs --task=code-review +``` + +## Limitations + +- Single task only (code-review), so this is not a complete cross-condition picture for Sonnet. +- n=10 is small; CIs are wide. +- The code-review sample input (`tasks/inputs/code-review-sample.js`) is short and contains obvious vulnerabilities, making the task easy for any capable model. +- A harder code-review fixture might create headroom for LOGIC.md to differentiate; this was not tested. + +## Honest disclosure + +This single-task run cost ~$0.50. The next planned step (a 3-task Sonnet sweep) was not executed because the prelim showed clear ceiling effects and the marginal value of more Sonnet data was low compared to running a free-tier Llama sweep on all 3 tasks. That decision is reflected in `2026-05-07-llama-3.1-70b/`. + +## Files + +- `summary.md` — this document (raw JSON not preserved due to harness overwrite, see "Why no raw JSON" above) diff --git a/benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md b/benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md new file mode 100644 index 0000000..f4e173e --- /dev/null +++ b/benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md @@ -0,0 +1,131 @@ +# Benchmark Run Analysis — Llama 3.1 70B (2026-05-07) + +## Setup + +| Field | Value | +|---|---| +| Date | 2026-05-07 | +| Model | `meta/llama-3.1-70b-instruct` | +| Provider | Nvidia NIM (free tier) | +| Tasks | code-review, research-synthesis, security-audit | +| Conditions | control (prose prompt), treatment (LOGIC.md compiled prompt) | +| Runs per condition | 10 | +| Total runs | 60 | +| Harness | `benchmarks/run.mjs` (commit at run-time) | +| Scoring | 4-dimensional aggregate: structured-compliance (40%), describing-vs-doing (30%), pipeline-completion (20%), quality-gate-compliance (10%) | +| Cost | $0 (Nvidia free tier) | + +## Raw aggregate scores (as reported by harness) + +``` + Control Treatment Diff +code-review 98 ± 2 89 ± 30 -9 +research-synthesis 66 ± 44 85 ± 29 +19 +security-audit 71 ± 36 83 ± 17 +12 +``` + +## Investigation: variance and zero-score runs + +Treatment ranges of 0-100 across all three tasks indicated catastrophic outliers. Diagnostic dig revealed **7 of 60 runs failed with `"Connection error."` from the Nvidia NIM endpoint**, not from the LLM or LOGIC.md: + +| Task | Condition | Connection-drop runs | +|---|---|---| +| code-review | treatment | 1 | +| research-synthesis | control | 3 | +| research-synthesis | treatment | 1 | +| security-audit | control | 2 | +| (everywhere else) | | 0 | + +These are infrastructure failures, not signal. Each scored 0 (no output to score) and dragged the mean of its group down. + +## Cleaned aggregate scores (excluding fatal connection drops) + +``` + Control Treatment Diff +code-review 98.3 (n=10) 98.9 (n=9) +0.6 +research-synthesis 94.6 (n=7) 94.0 (n=9) -0.6 +security-audit 89.0 (n=8) 83.0 (n=10) -6.0 +``` + +## Honest interpretation + +The original "+19 / +12" lifts were artifacts of unequal connection-drop incidence between conditions: control runs failed infrastructurally more often than treatment runs, dragging control means down. Once corrected: + +- **code-review**: ceiling effect. Both ~98. LOGIC.md adds nothing measurable on a task this easy for a 70B-parameter model. +- **research-synthesis**: flat. Both ~94. LOGIC.md adds nothing measurable. +- **security-audit**: treatment underperforms control by 6 points (89 → 83). Treatment also showed wider range (50-100 vs control's 89-89). + +**On Llama 3.1 70B at n=10, LOGIC.md does not produce measurable quality lift on these tasks.** + +A separate Sonnet 4.6 single-task run earlier the same day showed the same flat result on code-review (control 99, treatment 100). Two independent flat data points. + +## Anomalies worth investigating before re-running + +**1. The 89-89-89-89 pattern on security-audit control.** Every successful control run scored exactly 89, with range 89-89. This is improbable from a stochastic LLM and points to a scoring-system quirk: every run hits the same fixed-magnitude penalty in the rubric, suggesting the scorer applies binary deductions rather than graduated ones. The penalty calibration may need review before the security-audit results can be trusted as quality signal. + +**2. Strict JSON-schema enum validation.** Example error from a `research-synthesis` control run: + +``` +"errors": ["/sources/3/type: must be equal to one of the allowed values"] +``` + +The output was otherwise valid (structured-compliance 75%, aggregate 88). The scoring system applies the schema-validation error harshly. Real-world consumers might accept the output as good. The scorer rejects it. This contributes variance that doesn't reflect real quality. + +**3. n=10 with high inherent LLM variance.** Confidence intervals are wide. A real but small effect may be invisible at this sample size. n=30 is the minimum recommended in the benchmark MANIFEST for tighter intervals. + +## What this does and does not say about LOGIC.md + +**Does not say**: LOGIC.md is broken, doesn't work, or has no value. + +**Does say**: on these specific tasks, this specific sample size, and this specific scoring rubric, LOGIC.md does not produce measurable quality lift on Llama 3.1 70B. The benchmark was designed to test the README's quality-lift claim. The claim is not supported by this data. + +## What evidence we DO have for LOGIC.md value + +The Archon integration test (2026-05-06, separate experiment, 60 trials) showed: + +| Metric | Without LOGIC.md | With LOGIC.md | +|---|---|---| +| Verdict-agreement (auth-sql-injection) | 50% (5/10) | 100% (10/10) | +| Structural hash agreement (overall) | 70% | 87% | +| Audit trail | Manual reconstruction | Workflow event JSONL out of the box | +| Modifiability | Prose edit, no validation | Structured rule + CLI contract check | + +That experiment measured **structural consistency** rather than **quality lift**, and produced a clean positive result. The two experiments together suggest LOGIC.md's value is in consistency, audit, and modifiability — not in making individual outputs better. + +## Reproducibility + +To re-run this exact configuration: + +```bash +cd benchmarks +unset BENCHMARK_MODEL # default is meta/llama-3.1-70b-instruct +export NVIDIA_API_KEY=nvapi-... +node run.mjs # runs all 3 tasks at 10 runs/condition +``` + +Caveats: Nvidia NIM free tier introduces non-deterministic connection drops. Paid tier or alternative endpoint recommended for cleaner data. + +To analyse with infra failures excluded: + +```bash +node -e " +const r = require('./results/results.json'); +const groups = {}; +r.results.forEach(x => { + if (x.stopReason === 'error' || x.outputLength === 0) return; + const k = x.task + ':' + x.condition; + groups[k] = groups[k] || []; + groups[k].push(x.aggregateScore); +}); +Object.entries(groups).forEach(([k, scores]) => { + const mean = scores.reduce((a,b)=>a+b,0) / scores.length; + console.log(k.padEnd(40), 'n=' + scores.length, '| mean=' + mean.toFixed(1)); +}); +" +``` + +## Files + +- `results.json` — full per-run output as emitted by `run.mjs` +- `results.md` — auto-generated summary (raw, before cleanup) +- `analysis.md` — this document (honest interpretation post-cleanup) diff --git a/benchmarks/published/2026-05-07-llama-3.1-70b/results.json b/benchmarks/published/2026-05-07-llama-3.1-70b/results.json new file mode 100644 index 0000000..29df33e --- /dev/null +++ b/benchmarks/published/2026-05-07-llama-3.1-70b/results.json @@ -0,0 +1,1534 @@ +{ + "results": [ + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 1, + "timestamp": "2026-05-07T07:23:30.798Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 65197, + "tokens": { + "input": 803, + "output": 439 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1991 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 2, + "timestamp": "2026-05-07T07:24:47.591Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 76785, + "tokens": { + "input": 803, + "output": 367 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1619 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 3, + "timestamp": "2026-05-07T07:28:41.132Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 6, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 98, + "executionTime": 233539, + "tokens": { + "input": 803, + "output": 418 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1881 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 4, + "timestamp": "2026-05-07T07:29:40.918Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 18, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 94, + "executionTime": 59775, + "tokens": { + "input": 803, + "output": 393 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1757 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 5, + "timestamp": "2026-05-07T07:29:58.158Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 9, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 97, + "executionTime": 17232, + "tokens": { + "input": 803, + "output": 401 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1693 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 6, + "timestamp": "2026-05-07T07:30:25.643Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 17, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 94, + "executionTime": 27478, + "tokens": { + "input": 803, + "output": 425 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1861 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 7, + "timestamp": "2026-05-07T07:31:22.558Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 56908, + "tokens": { + "input": 803, + "output": 342 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1421 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 8, + "timestamp": "2026-05-07T07:32:00.496Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 37936, + "tokens": { + "input": 803, + "output": 434 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1854 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 9, + "timestamp": "2026-05-07T07:32:26.046Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 25544, + "tokens": { + "input": 803, + "output": 358 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1543 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 10, + "timestamp": "2026-05-07T07:32:54.833Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 28785, + "tokens": { + "input": 803, + "output": 364 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1590 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 1, + "timestamp": "2026-05-07T07:34:46.237Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 111399, + "tokens": { + "input": 1508, + "output": 390 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1608 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 2, + "timestamp": "2026-05-07T07:36:44.204Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 8, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 98, + "executionTime": 117964, + "tokens": { + "input": 1508, + "output": 483 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 2076 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 3, + "timestamp": "2026-05-07T07:38:32.983Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 108777, + "tokens": { + "input": 1508, + "output": 397 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1663 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 4, + "timestamp": "2026-05-07T07:40:56.814Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 143829, + "tokens": { + "input": 1508, + "output": 397 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1638 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 5, + "timestamp": "2026-05-07T07:41:35.173Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 100, + "pipelineCompletion": 0, + "qualityGateCompliance": null + }, + "aggregateScore": 0, + "executionTime": 38355, + "tokens": { + "input": 0, + "output": 0 + }, + "stopReason": "error", + "errors": [ + "Connection error." + ], + "outputLength": 0 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 6, + "timestamp": "2026-05-07T07:53:00.168Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 18, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 95, + "executionTime": 684992, + "tokens": { + "input": 1508, + "output": 430 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1816 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 7, + "timestamp": "2026-05-07T07:56:00.560Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 10, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 97, + "executionTime": 180391, + "tokens": { + "input": 1508, + "output": 402 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1749 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 8, + "timestamp": "2026-05-07T07:56:37.353Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 36791, + "tokens": { + "input": 1508, + "output": 434 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1939 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 9, + "timestamp": "2026-05-07T07:57:19.384Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 42028, + "tokens": { + "input": 1508, + "output": 354 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1461 + }, + { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 10, + "timestamp": "2026-05-07T07:57:48.719Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 29333, + "tokens": { + "input": 1508, + "output": 416 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 1812 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 1, + "timestamp": "2026-05-07T08:01:29.302Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 4, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 88, + "executionTime": 220549, + "tokens": { + "input": 777, + "output": 1061 + }, + "stopReason": "stop", + "errors": [ + "/sources/3/type: must be equal to one of the allowed values" + ], + "outputLength": 4540 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 2, + "timestamp": "2026-05-07T08:06:47.444Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 8, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 97, + "executionTime": 318141, + "tokens": { + "input": 777, + "output": 1147 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 4850 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 3, + "timestamp": "2026-05-07T08:10:53.434Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 245985, + "tokens": { + "input": 777, + "output": 1459 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 5988 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 4, + "timestamp": "2026-05-07T08:12:14.691Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 100, + "pipelineCompletion": 0, + "qualityGateCompliance": null + }, + "aggregateScore": 0, + "executionTime": 81252, + "tokens": { + "input": 0, + "output": 0 + }, + "stopReason": "error", + "errors": [ + "Connection error." + ], + "outputLength": 0 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 5, + "timestamp": "2026-05-07T08:13:51.648Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 96954, + "tokens": { + "input": 777, + "output": 1430 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 6206 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 6, + "timestamp": "2026-05-07T08:17:46.103Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 234443, + "tokens": { + "input": 777, + "output": 1442 + }, + "stopReason": "stop", + "errors": [ + "/sources/8/type: must be equal to one of the allowed values" + ], + "outputLength": 5896 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 7, + "timestamp": "2026-05-07T08:19:07.347Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 100, + "pipelineCompletion": 0, + "qualityGateCompliance": null + }, + "aggregateScore": 0, + "executionTime": 81243, + "tokens": { + "input": 0, + "output": 0 + }, + "stopReason": "error", + "errors": [ + "Connection error." + ], + "outputLength": 0 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 8, + "timestamp": "2026-05-07T08:23:09.927Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 4, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 88, + "executionTime": 242575, + "tokens": { + "input": 777, + "output": 847 + }, + "stopReason": "stop", + "errors": [ + "/sources/3/type: must be equal to one of the allowed values" + ], + "outputLength": 3690 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 9, + "timestamp": "2026-05-07T08:26:45.109Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 100, + "executionTime": 215179, + "tokens": { + "input": 777, + "output": 1395 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 5804 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 10, + "timestamp": "2026-05-07T08:33:07.287Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 100, + "pipelineCompletion": 0, + "qualityGateCompliance": null + }, + "aggregateScore": 0, + "executionTime": 382176, + "tokens": { + "input": 0, + "output": 0 + }, + "stopReason": "error", + "errors": [ + "Connection error." + ], + "outputLength": 0 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 1, + "timestamp": "2026-05-07T08:34:25.877Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 78583, + "tokens": { + "input": 1552, + "output": 935 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 3966 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 2, + "timestamp": "2026-05-07T08:37:19.329Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 173445, + "tokens": { + "input": 1552, + "output": 1248 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 5434 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 3, + "timestamp": "2026-05-07T08:39:25.421Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 4, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 89, + "executionTime": 126086, + "tokens": { + "input": 1552, + "output": 1013 + }, + "stopReason": "stop", + "errors": [ + "/sources/0/type: must be equal to one of the allowed values" + ], + "outputLength": 4405 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 4, + "timestamp": "2026-05-07T08:42:35.511Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 190082, + "tokens": { + "input": 1552, + "output": 943 + }, + "stopReason": "stop", + "errors": [ + "/sources/0/type: must be equal to one of the allowed values" + ], + "outputLength": 4092 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 5, + "timestamp": "2026-05-07T08:47:21.335Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 8, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 88, + "executionTime": 285814, + "tokens": { + "input": 1552, + "output": 929 + }, + "stopReason": "stop", + "errors": [ + "/sources/2/type: must be equal to one of the allowed values" + ], + "outputLength": 3986 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 6, + "timestamp": "2026-05-07T08:49:22.241Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 120901, + "tokens": { + "input": 1552, + "output": 885 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 3796 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 7, + "timestamp": "2026-05-07T08:52:12.362Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 100, + "pipelineCompletion": 0, + "qualityGateCompliance": null + }, + "aggregateScore": 0, + "executionTime": 170114, + "tokens": { + "input": 0, + "output": 0 + }, + "stopReason": "error", + "errors": [ + "Connection error." + ], + "outputLength": 0 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 8, + "timestamp": "2026-05-07T08:54:25.820Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 133455, + "tokens": { + "input": 1552, + "output": 1032 + }, + "stopReason": "stop", + "errors": [ + "/sources/0/type: must be equal to one of the allowed values" + ], + "outputLength": 4317 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 9, + "timestamp": "2026-05-07T08:57:53.407Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 207585, + "tokens": { + "input": 1552, + "output": 1019 + }, + "stopReason": "stop", + "errors": [ + "/sources/0/type: must be equal to one of the allowed values" + ], + "outputLength": 4297 + }, + { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 10, + "timestamp": "2026-05-07T09:01:47.020Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 3, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 99, + "executionTime": 233610, + "tokens": { + "input": 1552, + "output": 1226 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 5245 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 1, + "timestamp": "2026-05-07T09:06:21.964Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 274911, + "tokens": { + "input": 1124, + "output": 1268 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 5123 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 2, + "timestamp": "2026-05-07T09:09:12.149Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 170173, + "tokens": { + "input": 1124, + "output": 1151 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 4560 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 3, + "timestamp": "2026-05-07T09:11:38.172Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 146015, + "tokens": { + "input": 1124, + "output": 1069 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 4271 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 4, + "timestamp": "2026-05-07T09:14:12.250Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 154065, + "tokens": { + "input": 1124, + "output": 1171 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 4836 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 5, + "timestamp": "2026-05-07T09:15:58.275Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 106011, + "tokens": { + "input": 1124, + "output": 1088 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 4249 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 6, + "timestamp": "2026-05-07T09:18:41.235Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 162936, + "tokens": { + "input": 1124, + "output": 1172 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 4666 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 7, + "timestamp": "2026-05-07T09:20:04.655Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 100, + "pipelineCompletion": 0, + "qualityGateCompliance": null + }, + "aggregateScore": 0, + "executionTime": 83414, + "tokens": { + "input": 0, + "output": 0 + }, + "stopReason": "error", + "errors": [ + "Connection error." + ], + "outputLength": 0 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 8, + "timestamp": "2026-05-07T09:21:50.463Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 105777, + "tokens": { + "input": 1124, + "output": 1337 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 5374 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 9, + "timestamp": "2026-05-07T09:28:16.308Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 100, + "pipelineCompletion": 0, + "qualityGateCompliance": null + }, + "aggregateScore": 0, + "executionTime": 385839, + "tokens": { + "input": 0, + "output": 0 + }, + "stopReason": "error", + "errors": [ + "Connection error." + ], + "outputLength": 0 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "run": 10, + "timestamp": "2026-05-07T09:32:06.439Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": null + }, + "aggregateScore": 89, + "executionTime": 230125, + "tokens": { + "input": 1124, + "output": 1263 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 4936 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 1, + "timestamp": "2026-05-07T09:34:59.410Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 172964, + "tokens": { + "input": 2220, + "output": 1285 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 5181 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 2, + "timestamp": "2026-05-07T09:38:03.706Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 184277, + "tokens": { + "input": 2220, + "output": 1386 + }, + "stopReason": "stop", + "errors": [ + "/remediation_plan/5/priority: must be <= 5" + ], + "outputLength": 5755 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 3, + "timestamp": "2026-05-07T09:39:58.110Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 114400, + "tokens": { + "input": 2220, + "output": 1397 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 5476 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 4, + "timestamp": "2026-05-07T09:41:58.893Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 120774, + "tokens": { + "input": 2220, + "output": 1096 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 4289 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 5, + "timestamp": "2026-05-07T09:44:08.774Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 129878, + "tokens": { + "input": 2220, + "output": 1436 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 5878 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 6, + "timestamp": "2026-05-07T09:45:41.243Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 92464, + "tokens": { + "input": 2220, + "output": 1079 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 4240 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 7, + "timestamp": "2026-05-07T09:50:54.192Z", + "metrics": { + "structuredCompliance": 75, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 90, + "executionTime": 312945, + "tokens": { + "input": 2220, + "output": 1267 + }, + "stopReason": "stop", + "errors": [ + "/vulnerabilities/0/id: must match pattern \"^[A-Z0-9]+$\"" + ], + "outputLength": 5136 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 8, + "timestamp": "2026-05-07T09:53:38.447Z", + "metrics": { + "structuredCompliance": 100, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 100 + }, + "aggregateScore": 100, + "executionTime": 164246, + "tokens": { + "input": 2220, + "output": 1097 + }, + "stopReason": "stop", + "errors": [], + "outputLength": 4340 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 9, + "timestamp": "2026-05-07T09:55:26.829Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 0 + }, + "aggregateScore": 50, + "executionTime": 108373, + "tokens": { + "input": 2220, + "output": 1118 + }, + "stopReason": "stop", + "errors": [ + "root: must have required property 'vulnerabilities'" + ], + "outputLength": 4806 + }, + { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "run": 10, + "timestamp": "2026-05-07T09:58:06.640Z", + "metrics": { + "structuredCompliance": 0, + "describingVsDoing": 0, + "pipelineCompletion": 100, + "qualityGateCompliance": 0 + }, + "aggregateScore": 50, + "executionTime": 159801, + "tokens": { + "input": 2220, + "output": 1099 + }, + "stopReason": "stop", + "errors": [ + "root: must have required property 'vulnerabilities'" + ], + "outputLength": 4798 + } + ], + "stats": { + "code-review:meta/llama-3.1-70b-instruct:control": { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "runs": 10, + "aggregateScore": { + "mean": 98, + "stddev": 2, + "min": 94, + "max": 100 + }, + "structuredCompliance": { + "mean": 100, + "stddev": 0 + }, + "describingVsDoing": { + "mean": 5, + "stddev": 7 + }, + "pipelineCompletion": { + "mean": 100, + "stddev": 0 + } + }, + "code-review:meta/llama-3.1-70b-instruct:treatment": { + "task": "code-review", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "runs": 10, + "aggregateScore": { + "mean": 89, + "stddev": 30, + "min": 0, + "max": 100 + }, + "structuredCompliance": { + "mean": 90, + "stddev": 30 + }, + "describingVsDoing": { + "mean": 14, + "stddev": 29 + }, + "pipelineCompletion": { + "mean": 90, + "stddev": 30 + } + }, + "research-synthesis:meta/llama-3.1-70b-instruct:control": { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "runs": 10, + "aggregateScore": { + "mean": 66, + "stddev": 44, + "min": 0, + "max": 100 + }, + "structuredCompliance": { + "mean": 63, + "stddev": 42 + }, + "describingVsDoing": { + "mean": 32, + "stddev": 45 + }, + "pipelineCompletion": { + "mean": 70, + "stddev": 46 + } + }, + "research-synthesis:meta/llama-3.1-70b-instruct:treatment": { + "task": "research-synthesis", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "runs": 10, + "aggregateScore": { + "mean": 85, + "stddev": 29, + "min": 0, + "max": 100 + }, + "structuredCompliance": { + "mean": 78, + "stddev": 28 + }, + "describingVsDoing": { + "mean": 12, + "stddev": 30 + }, + "pipelineCompletion": { + "mean": 90, + "stddev": 30 + } + }, + "security-audit:meta/llama-3.1-70b-instruct:control": { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "control", + "runs": 10, + "aggregateScore": { + "mean": 71, + "stddev": 36, + "min": 0, + "max": 89 + }, + "structuredCompliance": { + "mean": 60, + "stddev": 30 + }, + "describingVsDoing": { + "mean": 20, + "stddev": 40 + }, + "pipelineCompletion": { + "mean": 80, + "stddev": 40 + } + }, + "security-audit:meta/llama-3.1-70b-instruct:treatment": { + "task": "security-audit", + "model": "meta/llama-3.1-70b-instruct", + "condition": "treatment", + "runs": 10, + "aggregateScore": { + "mean": 83, + "stddev": 17, + "min": 50, + "max": 100 + }, + "structuredCompliance": { + "mean": 63, + "stddev": 32 + }, + "describingVsDoing": { + "mean": 0, + "stddev": 0 + }, + "pipelineCompletion": { + "mean": 100, + "stddev": 0 + } + } + } +} \ No newline at end of file diff --git a/benchmarks/published/2026-05-07-llama-3.1-70b/results.md b/benchmarks/published/2026-05-07-llama-3.1-70b/results.md new file mode 100644 index 0000000..e3d1aab --- /dev/null +++ b/benchmarks/published/2026-05-07-llama-3.1-70b/results.md @@ -0,0 +1,80 @@ +# LOGIC.md Benchmark Results + +Generated: 2026-05-07T09:58:06.657Z + +## Summary Statistics + +### code-review (meta/llama-3.1-70b-instruct) - control + +- Runs: 10 +- Aggregate Score: 98 ± 2 (range: 94-100) +- Structured Compliance: 100% ± 0% +- Describing vs Doing: 5% ± 7% (lower is better) +- Pipeline Completion: 100% ± 0% + +### code-review (meta/llama-3.1-70b-instruct) - treatment + +- Runs: 10 +- Aggregate Score: 89 ± 30 (range: 0-100) +- Structured Compliance: 90% ± 30% +- Describing vs Doing: 14% ± 29% (lower is better) +- Pipeline Completion: 90% ± 30% + +### research-synthesis (meta/llama-3.1-70b-instruct) - control + +- Runs: 10 +- Aggregate Score: 66 ± 44 (range: 0-100) +- Structured Compliance: 63% ± 42% +- Describing vs Doing: 32% ± 45% (lower is better) +- Pipeline Completion: 70% ± 46% + +### research-synthesis (meta/llama-3.1-70b-instruct) - treatment + +- Runs: 10 +- Aggregate Score: 85 ± 29 (range: 0-100) +- Structured Compliance: 78% ± 28% +- Describing vs Doing: 12% ± 30% (lower is better) +- Pipeline Completion: 90% ± 30% + +### security-audit (meta/llama-3.1-70b-instruct) - control + +- Runs: 10 +- Aggregate Score: 71 ± 36 (range: 0-89) +- Structured Compliance: 60% ± 30% +- Describing vs Doing: 20% ± 40% (lower is better) +- Pipeline Completion: 80% ± 40% + +### security-audit (meta/llama-3.1-70b-instruct) - treatment + +- Runs: 10 +- Aggregate Score: 83 ± 17 (range: 50-100) +- Structured Compliance: 63% ± 32% +- Describing vs Doing: 0% ± 0% (lower is better) +- Pipeline Completion: 100% ± 0% + +## Key Findings + +### code-review:meta/llama-3.1-70b-instruct +- Control Aggregate Score: 98 +- Treatment Aggregate Score: 89 +- **Difference: -9 (-9.2%)** +- Control Describing vs Doing: 5% +- Treatment Describing vs Doing: 14% +- **Reduction: -9% points** + +### research-synthesis:meta/llama-3.1-70b-instruct +- Control Aggregate Score: 66 +- Treatment Aggregate Score: 85 +- **Difference: +19 (28.8%)** +- Control Describing vs Doing: 32% +- Treatment Describing vs Doing: 12% +- **Reduction: 20% points** + +### security-audit:meta/llama-3.1-70b-instruct +- Control Aggregate Score: 71 +- Treatment Aggregate Score: 83 +- **Difference: +12 (16.9%)** +- Control Describing vs Doing: 20% +- Treatment Describing vs Doing: 0% +- **Reduction: 20% points** + diff --git a/benchmarks/published/INDEX.md b/benchmarks/published/INDEX.md new file mode 100644 index 0000000..94f1caa --- /dev/null +++ b/benchmarks/published/INDEX.md @@ -0,0 +1,44 @@ +# Published Benchmark Runs + +This directory holds committed benchmark runs as evidence artifacts. Each run gets a dated subdirectory with raw output (`results.json`, `results.md`) and an interpretation document (`analysis.md` or `summary.md`). + +The ephemeral `benchmarks/results/` directory holds the most recent run-in-progress and is gitignored. To preserve a run as evidence, copy it into a dated subdirectory here and commit. + +## Index + +| Date | Model | Tasks | n/condition | Result | +|---|---|---|---|---| +| 2026-05-07 | `claude-sonnet-4-6` | code-review only | 10 | Control 99 / Treatment 100. Ceiling effect, LOGIC.md is no-op on quality. [Details](./2026-05-07-claude-sonnet-4-6-codereview/summary.md) | +| 2026-05-07 | `meta/llama-3.1-70b-instruct` | code-review, research-synthesis, security-audit | 10 | Flat to slightly negative after excluding 7 Nvidia NIM connection drops. [Details](./2026-05-07-llama-3.1-70b/analysis.md) | + +## Headline finding (2026-05-07) + +Two independent runs on different models show **no measurable quality lift from LOGIC.md** on these tasks at n=10. Sonnet 4.6 is ceiling-bound; Llama 3.1 70B is flat to slightly negative once infrastructure failures are stripped. + +This contradicts the original "describing-vs-doing fix" framing in the README and motivated the positioning pivot toward auditability and structural consistency, anchored instead in the [Archon integration test (2026-05-06)](https://github.com/SingularityAI-Dev/logic-md-archon-eval) which showed clean structural-consistency results (87% hash agreement under LOGIC.md vs 70% without). + +## Open methodology questions (deferred for future runs) + +1. **Scoring system rigidity.** Security-audit control runs all scored exactly 89, suggesting the rubric applies fixed-magnitude penalties rather than graduated ones. The `89-89-89-89` pattern is improbable from a stochastic LLM and likely a scorer artifact. + +2. **Strict JSON-schema enum validation.** Outputs that are valid in spirit but use slightly different enum values get scored down harshly. May not reflect real-world acceptance criteria. + +3. **Sample size.** n=10 is below the MANIFEST recommendation of 30. Confidence intervals are wide. Real-but-small effects may be invisible at this sample size. + +4. **Task difficulty.** The current sample inputs (`tasks/inputs/*-sample.{js,txt}`) are short and may be too easy for capable models, eliminating the headroom LOGIC.md needs to differentiate. Harder fixtures might produce different signal. + +5. **Single-temperature.** Temperature 0.7 is hardcoded. Sensitivity analysis across temperatures has not been done. + +These are tracked but not blocking. The honest current finding stands: **LOGIC.md does not produce measurable quality lift on these tasks at this sample size.** + +## Re-running + +See each run's analysis for the exact reproduction command. Common pattern: + +```bash +cd benchmarks +export _API_KEY=... +export BENCHMARK_MODEL= # optional, default is Llama 3.1 70B +node run.mjs [--task=] # omit --task to run all 3 +# Then copy results/ to published/-/ before next run +```