chore(benchmarks): publish 2026-05-07 cross-model results#49
chore(benchmarks): publish 2026-05-07 cross-model results#49SingleSourceStudios merged 2 commits intomainfrom
Conversation
Two independent runs on Claude Sonnet 4.6 (code-review, n=10) and Llama 3.1 70B via Nvidia NIM (3 tasks, n=10) show no measurable quality lift from LOGIC.md on these tasks at this sample size. - Sonnet 4.6 ceiling effect: control 99/100, treatment 100/100 - Llama 3.1 70B flat to slightly negative after excluding 7 Nvidia connection-drop runs (cleaned means: code-review 98.3 vs 98.9, research-synthesis 94.6 vs 94.0, security-audit 89.0 vs 83.0) Per-run results, raw harness output, and honest analysis preserved at benchmarks/published/<date>-<model>/. INDEX.md catalogues runs and documents open methodology questions. Motivates positioning pivot toward structural consistency / audit / governance, anchored in the 2026-05-06 Archon integration test (87% hash agreement under LOGIC.md vs 70% without).
📝 WalkthroughWalkthroughThis PR publishes the results of two benchmark runs evaluating LOGIC.md prompt effectiveness across Claude Sonnet 4.6 and Llama 3.1 70B models. It adds raw metrics data, statistical analysis identifying infrastructure failures, detailed interpretation showing no measurable quality lift at n=10, and centralized documentation for reproducibility. ChangesBenchmark Results Publication
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (4)
benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md (1)
24-28: ⚡ Quick winAdd language identifier to fenced code block.
The code block should specify a language (e.g.,
text) to satisfy markdown linting rules and improve documentation consistency.📝 Proposed fix
-``` +```text Control Treatment Diff code-review 99 ± 1 100 ± 0 +1.0 (range 97-100) (range 99-100)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md` around lines 24 - 28, The fenced code block in summary.md lacks a language tag; update the triple-backtick fence that wraps the table starting with "Control Treatment Diff" to include a language identifier (e.g., add `text` after the opening ```), so the block becomes ```text and satisfies markdown linting and documentation consistency.benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md (3)
68-70: ⚡ Quick winAdd language identifier to fenced code block.
The code block should specify a language (e.g.,
json) to satisfy markdown linting rules, as this contains a JSON error message.📝 Proposed fix
-``` +```json "errors": ["/sources/3/type: must be equal to one of the allowed values"]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md` around lines 68 - 70, The fenced code block displaying the JSON error needs a language specifier for markdown linting: change the opening ``` to ```json for the block that contains `"errors": ["/sources/3/type: must be equal to one of the allowed values"]`, ensuring the closing ``` remains; this will mark the snippet as JSON and satisfy the linter.
43-48: ⚡ Quick winAdd language identifier to fenced code block.
The code block should specify a language (e.g.,
text) to satisfy markdown linting rules.📝 Proposed fix
-``` +```text Control Treatment Diff code-review 98.3 (n=10) 98.9 (n=9) +0.6 research-synthesis 94.6 (n=7) 94.0 (n=9) -0.6 security-audit 89.0 (n=8) 83.0 (n=10) -6.0🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md` around lines 43 - 48, The fenced code block in analysis.md is missing a language identifier which fails markdown linting; update the block that currently starts with ``` to include a language token (e.g., change ``` to ```text) for the table-like block showing Control/Treatment/Diff so the snippet is ```text ... ```, ensuring the fenced block around the lines containing "Control Treatment Diff" is updated accordingly.
20-25: ⚡ Quick winAdd language identifier to fenced code block.
The code block should specify a language (e.g.,
text) to satisfy markdown linting rules and improve documentation consistency.📝 Proposed fix
-``` +```text Control Treatment Diff code-review 98 ± 2 89 ± 30 -9 research-synthesis 66 ± 44 85 ± 29 +19 security-audit 71 ± 36 83 ± 17 +12🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md` around lines 20 - 25, The fenced code block in analysis.md lacks a language identifier which fails markdown linting; update the triple-backtick fence in the block containing the table (the block that currently starts with " Control Treatment Diff") to include a language token such as text (e.g., change ``` to ```text) so the block becomes a labeled code fence and satisfies the linter and documentation consistency checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md`:
- Around line 24-28: The fenced code block in summary.md lacks a language tag;
update the triple-backtick fence that wraps the table starting with "Control
Treatment Diff" to include a language identifier (e.g., add `text` after
the opening ```), so the block becomes ```text and satisfies markdown linting
and documentation consistency.
In `@benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md`:
- Around line 68-70: The fenced code block displaying the JSON error needs a
language specifier for markdown linting: change the opening ``` to ```json for
the block that contains `"errors": ["/sources/3/type: must be equal to one of
the allowed values"]`, ensuring the closing ``` remains; this will mark the
snippet as JSON and satisfy the linter.
- Around line 43-48: The fenced code block in analysis.md is missing a language
identifier which fails markdown linting; update the block that currently starts
with ``` to include a language token (e.g., change ``` to ```text) for the
table-like block showing Control/Treatment/Diff so the snippet is ```text ...
```, ensuring the fenced block around the lines containing "Control Treatment
Diff" is updated accordingly.
- Around line 20-25: The fenced code block in analysis.md lacks a language
identifier which fails markdown linting; update the triple-backtick fence in the
block containing the table (the block that currently starts with "
Control Treatment Diff") to include a language token such as text
(e.g., change ``` to ```text) so the block becomes a labeled code fence and
satisfies the linter and documentation consistency checks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2a769d53-ca03-4bcc-a388-b2903c3ef8e9
📒 Files selected for processing (6)
benchmarks/.gitignorebenchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.mdbenchmarks/published/2026-05-07-llama-3.1-70b/analysis.mdbenchmarks/published/2026-05-07-llama-3.1-70b/results.jsonbenchmarks/published/2026-05-07-llama-3.1-70b/results.mdbenchmarks/published/INDEX.md
Two independent runs on Claude Sonnet 4.6 (code-review, n=10) and
Llama 3.1 70B via Nvidia NIM (3 tasks, n=10) show no measurable
quality lift from LOGIC.md on these tasks at this sample size.
connection-drop runs (cleaned means: code-review 98.3 vs 98.9,
research-synthesis 94.6 vs 94.0, security-audit 89.0 vs 83.0)
Per-run results, raw harness output, and honest analysis preserved
at benchmarks/published/-/. INDEX.md catalogues runs
and documents open methodology questions.
Motivates positioning pivot toward structural consistency / audit /
governance, anchored in the 2026-05-06 Archon integration test
(87% hash agreement under LOGIC.md vs 70% without).
Summary by cubic
Publish 2026-05-07 benchmark artifacts for
claude-sonnet-4-6andmeta/llama-3.1-70b-instructwith analysis and an index. Result: no measurable quality lift from LOGIC.md at n=10 (Sonnet ceiling; Llama 70B flat to slightly negative after excluding Nvidia NIM connection drops).New Features
benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md(single-task; raw JSON not preserved due to harness overwrite).benchmarks/published/2026-05-07-llama-3.1-70b/{results.json,results.md,analysis.md}; analysis removes 7 NIM connection-drop runs. Cleaned means: code-review 98.3 vs 98.9, research-synthesis 94.6 vs 94.0, security-audit 89.0 vs 83.0.benchmarks/published/INDEX.mdto catalog runs and note methodology questions.Refactors
benchmarks/.gitignoreto treatbenchmarks/results/as ephemeral; committed runs live underbenchmarks/published/.Written for commit 3a2f07a. Summary will update on new commits.
Summary by CodeRabbit
Documentation
Chores