Skip to content

chore(benchmarks): publish 2026-05-07 cross-model results#49

Merged
SingleSourceStudios merged 2 commits intomainfrom
chore/publish-benchmark-results
May 7, 2026
Merged

chore(benchmarks): publish 2026-05-07 cross-model results#49
SingleSourceStudios merged 2 commits intomainfrom
chore/publish-benchmark-results

Conversation

@SingleSourceStudios
Copy link
Copy Markdown
Collaborator

@SingleSourceStudios SingleSourceStudios commented May 7, 2026

Two independent runs on Claude Sonnet 4.6 (code-review, n=10) and
Llama 3.1 70B via Nvidia NIM (3 tasks, n=10) show no measurable
quality lift from LOGIC.md on these tasks at this sample size.

  • Sonnet 4.6 ceiling effect: control 99/100, treatment 100/100
  • Llama 3.1 70B flat to slightly negative after excluding 7 Nvidia
    connection-drop runs (cleaned means: code-review 98.3 vs 98.9,
    research-synthesis 94.6 vs 94.0, security-audit 89.0 vs 83.0)

Per-run results, raw harness output, and honest analysis preserved
at benchmarks/published/-/. INDEX.md catalogues runs
and documents open methodology questions.

Motivates positioning pivot toward structural consistency / audit /
governance, anchored in the 2026-05-06 Archon integration test
(87% hash agreement under LOGIC.md vs 70% without).


Summary by cubic

Publish 2026-05-07 benchmark artifacts for claude-sonnet-4-6 and meta/llama-3.1-70b-instruct with analysis and an index. Result: no measurable quality lift from LOGIC.md at n=10 (Sonnet ceiling; Llama 70B flat to slightly negative after excluding Nvidia NIM connection drops).

  • New Features

    • Published benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md (single-task; raw JSON not preserved due to harness overwrite).
    • Published benchmarks/published/2026-05-07-llama-3.1-70b/{results.json,results.md,analysis.md}; analysis removes 7 NIM connection-drop runs. Cleaned means: code-review 98.3 vs 98.9, research-synthesis 94.6 vs 94.0, security-audit 89.0 vs 83.0.
    • Added benchmarks/published/INDEX.md to catalog runs and note methodology questions.
  • Refactors

    • Updated benchmarks/.gitignore to treat benchmarks/results/ as ephemeral; committed runs live under benchmarks/published/.

Written for commit 3a2f07a. Summary will update on new commits.

Summary by CodeRabbit

  • Documentation

    • Added comprehensive benchmark reports and analysis documentation for model evaluations conducted on 2026-05-07.
    • Published detailed results with performance metrics and reproducibility instructions.
    • Created index documentation for published benchmark runs.
  • Chores

    • Updated configuration to exclude benchmark output directories from version control.

Two independent runs on Claude Sonnet 4.6 (code-review, n=10) and
Llama 3.1 70B via Nvidia NIM (3 tasks, n=10) show no measurable
quality lift from LOGIC.md on these tasks at this sample size.

- Sonnet 4.6 ceiling effect: control 99/100, treatment 100/100
- Llama 3.1 70B flat to slightly negative after excluding 7 Nvidia
  connection-drop runs (cleaned means: code-review 98.3 vs 98.9,
  research-synthesis 94.6 vs 94.0, security-audit 89.0 vs 83.0)

Per-run results, raw harness output, and honest analysis preserved
at benchmarks/published/<date>-<model>/. INDEX.md catalogues runs
and documents open methodology questions.

Motivates positioning pivot toward structural consistency / audit /
governance, anchored in the 2026-05-06 Archon integration test
(87% hash agreement under LOGIC.md vs 70% without).
@SingleSourceStudios SingleSourceStudios enabled auto-merge (squash) May 7, 2026 11:14
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR publishes the results of two benchmark runs evaluating LOGIC.md prompt effectiveness across Claude Sonnet 4.6 and Llama 3.1 70B models. It adds raw metrics data, statistical analysis identifying infrastructure failures, detailed interpretation showing no measurable quality lift at n=10, and centralized documentation for reproducibility.

Changes

Benchmark Results Publication

Layer / File(s) Summary
Output Configuration
benchmarks/.gitignore
Git ignore pattern updated to exclude ephemeral results/ directory.
Benchmark Results Data & Metrics
benchmarks/published/2026-05-07-llama-3.1-70b/results.json, results.md
Raw per-run metrics records and aggregated statistics as JSON; formatted markdown summary tables comparing control vs. treatment with mean±stddev scores across three tasks.
Claude Sonnet Code-Review Report
benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md
Single-task code-review benchmark showing ceiling effect with no lift (control 99±1 vs treatment 100±0); includes setup, results, reproducibility, and limitations analysis.
Llama Analysis & Interpretation
benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md
Identifies Nvidia NIM connection failures across runs, recomputes cleaned aggregates after exclusion, interprets no measurable LOGIC.md lift at n=10, documents anomalies (fixed 89-point pattern, harsh enum validation penalties, high LLM variance), and references supporting integration test evidence.
Reproducibility & Methodology
benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md
Exact run commands, environment variable guidance, helper script to filter infrastructure failures, and list of produced benchmark artifacts.
Published Benchmark Index
benchmarks/published/INDEX.md
Central documentation of published runs, headline finding (no quality lift from LOGIC.md), index table linking to individual runs, deferred methodology questions, and re-run/publish workflow instructions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 Benchmarks bouncing, results taking flight,
Two models tested, metrics burning bright,
No lift from LOGIC at this sample size,
But hey, the data's honest—that's the prize!
Infrastructure quirks and anomalies to explore,
Each run recorded, evidence to pour.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description lacks required sections from the template (Linked issue, Spec impact, Checklist) but provides substantial context about the benchmark results and methodology. Add the missing required sections: specify linked issue (or N/A), check applicable Spec impact boxes, and complete the test/build/lint checklist.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: publishing dated benchmark results from two cross-model runs (Sonnet 4.6 and Llama 3.1 70B).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/publish-benchmark-results

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md (1)

24-28: ⚡ Quick win

Add language identifier to fenced code block.

The code block should specify a language (e.g., text) to satisfy markdown linting rules and improve documentation consistency.

📝 Proposed fix
-```
+```text
                 Control          Treatment        Diff
 code-review     99 ± 1           100 ± 0          +1.0
                 (range 97-100)   (range 99-100)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md`
around lines 24 - 28, The fenced code block in summary.md lacks a language tag;
update the triple-backtick fence that wraps the table starting with "Control    
Treatment        Diff" to include a language identifier (e.g., add `text` after
the opening ```), so the block becomes ```text and satisfies markdown linting
and documentation consistency.
benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md (3)

68-70: ⚡ Quick win

Add language identifier to fenced code block.

The code block should specify a language (e.g., json) to satisfy markdown linting rules, as this contains a JSON error message.

📝 Proposed fix
-```
+```json
 "errors": ["/sources/3/type: must be equal to one of the allowed values"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md` around lines 68 -
70, The fenced code block displaying the JSON error needs a language specifier
for markdown linting: change the opening ``` to ```json for the block that
contains `"errors": ["/sources/3/type: must be equal to one of the allowed
values"]`, ensuring the closing ``` remains; this will mark the snippet as JSON
and satisfy the linter.

43-48: ⚡ Quick win

Add language identifier to fenced code block.

The code block should specify a language (e.g., text) to satisfy markdown linting rules.

📝 Proposed fix
-```
+```text
                         Control          Treatment       Diff
 code-review             98.3 (n=10)      98.9 (n=9)      +0.6
 research-synthesis      94.6 (n=7)       94.0 (n=9)      -0.6
 security-audit          89.0 (n=8)       83.0 (n=10)     -6.0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md` around lines 43 -
48, The fenced code block in analysis.md is missing a language identifier which
fails markdown linting; update the block that currently starts with ``` to
include a language token (e.g., change ``` to ```text) for the table-like block
showing Control/Treatment/Diff so the snippet is ```text ... ```, ensuring the
fenced block around the lines containing "Control Treatment Diff" is updated
accordingly.

20-25: ⚡ Quick win

Add language identifier to fenced code block.

The code block should specify a language (e.g., text) to satisfy markdown linting rules and improve documentation consistency.

📝 Proposed fix
-```
+```text
                         Control       Treatment    Diff
 code-review             98 ± 2        89 ± 30      -9
 research-synthesis      66 ± 44       85 ± 29      +19
 security-audit          71 ± 36       83 ± 17      +12
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md` around lines 20 -
25, The fenced code block in analysis.md lacks a language identifier which fails
markdown linting; update the triple-backtick fence in the block containing the
table (the block that currently starts with "                         Control   
Treatment    Diff") to include a language token such as text (e.g., change ```
to ```text) so the block becomes a labeled code fence and satisfies the linter
and documentation consistency checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md`:
- Around line 24-28: The fenced code block in summary.md lacks a language tag;
update the triple-backtick fence that wraps the table starting with "Control    
Treatment        Diff" to include a language identifier (e.g., add `text` after
the opening ```), so the block becomes ```text and satisfies markdown linting
and documentation consistency.

In `@benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md`:
- Around line 68-70: The fenced code block displaying the JSON error needs a
language specifier for markdown linting: change the opening ``` to ```json for
the block that contains `"errors": ["/sources/3/type: must be equal to one of
the allowed values"]`, ensuring the closing ``` remains; this will mark the
snippet as JSON and satisfy the linter.
- Around line 43-48: The fenced code block in analysis.md is missing a language
identifier which fails markdown linting; update the block that currently starts
with ``` to include a language token (e.g., change ``` to ```text) for the
table-like block showing Control/Treatment/Diff so the snippet is ```text ...
```, ensuring the fenced block around the lines containing "Control Treatment
Diff" is updated accordingly.
- Around line 20-25: The fenced code block in analysis.md lacks a language
identifier which fails markdown linting; update the triple-backtick fence in the
block containing the table (the block that currently starts with "              
Control       Treatment    Diff") to include a language token such as text
(e.g., change ``` to ```text) so the block becomes a labeled code fence and
satisfies the linter and documentation consistency checks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2a769d53-ca03-4bcc-a388-b2903c3ef8e9

📥 Commits

Reviewing files that changed from the base of the PR and between 9d9f8dc and aeeee0a.

📒 Files selected for processing (6)
  • benchmarks/.gitignore
  • benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md
  • benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md
  • benchmarks/published/2026-05-07-llama-3.1-70b/results.json
  • benchmarks/published/2026-05-07-llama-3.1-70b/results.md
  • benchmarks/published/INDEX.md

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 6 files

@SingleSourceStudios SingleSourceStudios merged commit c263324 into main May 7, 2026
4 checks passed
@SingleSourceStudios SingleSourceStudios deleted the chore/publish-benchmark-results branch May 7, 2026 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant