chore(benchmarks): publish 2026-05-07 cross-model results by SingleSourceStudios · Pull Request #49 · SingularityAI-Dev/logic-md

SingleSourceStudios · 2026-05-07T11:14:33Z

Two independent runs on Claude Sonnet 4.6 (code-review, n=10) and
Llama 3.1 70B via Nvidia NIM (3 tasks, n=10) show no measurable
quality lift from LOGIC.md on these tasks at this sample size.

Sonnet 4.6 ceiling effect: control 99/100, treatment 100/100
Llama 3.1 70B flat to slightly negative after excluding 7 Nvidia
connection-drop runs (cleaned means: code-review 98.3 vs 98.9,
research-synthesis 94.6 vs 94.0, security-audit 89.0 vs 83.0)

Per-run results, raw harness output, and honest analysis preserved
at benchmarks/published/-/. INDEX.md catalogues runs
and documents open methodology questions.

Motivates positioning pivot toward structural consistency / audit /
governance, anchored in the 2026-05-06 Archon integration test
(87% hash agreement under LOGIC.md vs 70% without).

Summary by cubic

Publish 2026-05-07 benchmark artifacts for claude-sonnet-4-6 and meta/llama-3.1-70b-instruct with analysis and an index. Result: no measurable quality lift from LOGIC.md at n=10 (Sonnet ceiling; Llama 70B flat to slightly negative after excluding Nvidia NIM connection drops).

New Features
- Published benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md (single-task; raw JSON not preserved due to harness overwrite).
- Published benchmarks/published/2026-05-07-llama-3.1-70b/{results.json,results.md,analysis.md}; analysis removes 7 NIM connection-drop runs. Cleaned means: code-review 98.3 vs 98.9, research-synthesis 94.6 vs 94.0, security-audit 89.0 vs 83.0.
- Added benchmarks/published/INDEX.md to catalog runs and note methodology questions.
Refactors
- Updated benchmarks/.gitignore to treat benchmarks/results/ as ephemeral; committed runs live under benchmarks/published/.

^{Written for commit 3a2f07a. Summary will update on new commits.}

Summary by CodeRabbit

Documentation
- Added comprehensive benchmark reports and analysis documentation for model evaluations conducted on 2026-05-07.
- Published detailed results with performance metrics and reproducibility instructions.
- Created index documentation for published benchmark runs.
Chores
- Updated configuration to exclude benchmark output directories from version control.

Two independent runs on Claude Sonnet 4.6 (code-review, n=10) and Llama 3.1 70B via Nvidia NIM (3 tasks, n=10) show no measurable quality lift from LOGIC.md on these tasks at this sample size. - Sonnet 4.6 ceiling effect: control 99/100, treatment 100/100 - Llama 3.1 70B flat to slightly negative after excluding 7 Nvidia connection-drop runs (cleaned means: code-review 98.3 vs 98.9, research-synthesis 94.6 vs 94.0, security-audit 89.0 vs 83.0) Per-run results, raw harness output, and honest analysis preserved at benchmarks/published/<date>-<model>/. INDEX.md catalogues runs and documents open methodology questions. Motivates positioning pivot toward structural consistency / audit / governance, anchored in the 2026-05-06 Archon integration test (87% hash agreement under LOGIC.md vs 70% without).

coderabbitai · 2026-05-07T11:14:45Z

📝 Walkthrough

Walkthrough

This PR publishes the results of two benchmark runs evaluating LOGIC.md prompt effectiveness across Claude Sonnet 4.6 and Llama 3.1 70B models. It adds raw metrics data, statistical analysis identifying infrastructure failures, detailed interpretation showing no measurable quality lift at n=10, and centralized documentation for reproducibility.

Changes

Benchmark Results Publication

Layer / File(s)	Summary
Output Configuration `benchmarks/.gitignore`	Git ignore pattern updated to exclude ephemeral `results/` directory.
Benchmark Results Data & Metrics `benchmarks/published/2026-05-07-llama-3.1-70b/results.json`, `results.md`	Raw per-run metrics records and aggregated statistics as JSON; formatted markdown summary tables comparing control vs. treatment with mean±stddev scores across three tasks.
Claude Sonnet Code-Review Report `benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md`	Single-task code-review benchmark showing ceiling effect with no lift (control 99±1 vs treatment 100±0); includes setup, results, reproducibility, and limitations analysis.
Llama Analysis & Interpretation `benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md`	Identifies Nvidia NIM connection failures across runs, recomputes cleaned aggregates after exclusion, interprets no measurable LOGIC.md lift at n=10, documents anomalies (fixed 89-point pattern, harsh enum validation penalties, high LLM variance), and references supporting integration test evidence.
Reproducibility & Methodology `benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md`	Exact run commands, environment variable guidance, helper script to filter infrastructure failures, and list of produced benchmark artifacts.
Published Benchmark Index `benchmarks/published/INDEX.md`	Central documentation of published runs, headline finding (no quality lift from LOGIC.md), index table linking to individual runs, deferred methodology questions, and re-run/publish workflow instructions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 Benchmarks bouncing, results taking flight,
Two models tested, metrics burning bright,
No lift from LOGIC at this sample size,
But hey, the data's honest—that's the prize!
Infrastructure quirks and anomalies to explore,
Each run recorded, evidence to pour.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description lacks required sections from the template (Linked issue, Spec impact, Checklist) but provides substantial context about the benchmark results and methodology.	Add the missing required sections: specify linked issue (or N/A), check applicable Spec impact boxes, and complete the test/build/lint checklist.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: publishing dated benchmark results from two cross-model runs (Sonnet 4.6 and Llama 3.1 70B).
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/publish-benchmark-results

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (4)

benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md (1)

24-28: ⚡ Quick win

Add language identifier to fenced code block.

The code block should specify a language (e.g., text) to satisfy markdown linting rules and improve documentation consistency.

📝 Proposed fix

-```
+```text
                 Control          Treatment        Diff
 code-review     99 ± 1           100 ± 0          +1.0
                 (range 97-100)   (range 99-100)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md`
around lines 24 - 28, The fenced code block in summary.md lacks a language tag;
update the triple-backtick fence that wraps the table starting with "Control    
Treatment        Diff" to include a language identifier (e.g., add `text` after
the opening ```), so the block becomes ```text and satisfies markdown linting
and documentation consistency.

benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md (3)

68-70: ⚡ Quick win

Add language identifier to fenced code block.

The code block should specify a language (e.g., json) to satisfy markdown linting rules, as this contains a JSON error message.

📝 Proposed fix

-```
+```json
 "errors": ["/sources/3/type: must be equal to one of the allowed values"]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md` around lines 68 -
70, The fenced code block displaying the JSON error needs a language specifier
for markdown linting: change the opening ``` to ```json for the block that
contains `"errors": ["/sources/3/type: must be equal to one of the allowed
values"]`, ensuring the closing ``` remains; this will mark the snippet as JSON
and satisfy the linter.

43-48: ⚡ Quick win

Add language identifier to fenced code block.

The code block should specify a language (e.g., text) to satisfy markdown linting rules.

📝 Proposed fix

-```
+```text
                         Control          Treatment       Diff
 code-review             98.3 (n=10)      98.9 (n=9)      +0.6
 research-synthesis      94.6 (n=7)       94.0 (n=9)      -0.6
 security-audit          89.0 (n=8)       83.0 (n=10)     -6.0

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md` around lines 43 -
48, The fenced code block in analysis.md is missing a language identifier which
fails markdown linting; update the block that currently starts with ``` to
include a language token (e.g., change ``` to ```text) for the table-like block
showing Control/Treatment/Diff so the snippet is ```text ... ```, ensuring the
fenced block around the lines containing "Control Treatment Diff" is updated
accordingly.

20-25: ⚡ Quick win

Add language identifier to fenced code block.

The code block should specify a language (e.g., text) to satisfy markdown linting rules and improve documentation consistency.

📝 Proposed fix

-```
+```text
                         Control       Treatment    Diff
 code-review             98 ± 2        89 ± 30      -9
 research-synthesis      66 ± 44       85 ± 29      +19
 security-audit          71 ± 36       83 ± 17      +12

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md` around lines 20 -
25, The fenced code block in analysis.md lacks a language identifier which fails
markdown linting; update the triple-backtick fence in the block containing the
table (the block that currently starts with "                         Control   
Treatment    Diff") to include a language token such as text (e.g., change ```
to ```text) so the block becomes a labeled code fence and satisfies the linter
and documentation consistency checks.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md`:
- Around line 24-28: The fenced code block in summary.md lacks a language tag;
update the triple-backtick fence that wraps the table starting with "Control    
Treatment        Diff" to include a language identifier (e.g., add `text` after
the opening ```), so the block becomes ```text and satisfies markdown linting
and documentation consistency.

In `@benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md`:
- Around line 68-70: The fenced code block displaying the JSON error needs a
language specifier for markdown linting: change the opening ``` to ```json for
the block that contains `"errors": ["/sources/3/type: must be equal to one of
the allowed values"]`, ensuring the closing ``` remains; this will mark the
snippet as JSON and satisfy the linter.
- Around line 43-48: The fenced code block in analysis.md is missing a language
identifier which fails markdown linting; update the block that currently starts
with ``` to include a language token (e.g., change ``` to ```text) for the
table-like block showing Control/Treatment/Diff so the snippet is ```text ...
```, ensuring the fenced block around the lines containing "Control Treatment
Diff" is updated accordingly.
- Around line 20-25: The fenced code block in analysis.md lacks a language
identifier which fails markdown linting; update the triple-backtick fence in the
block containing the table (the block that currently starts with "              
Control       Treatment    Diff") to include a language token such as text
(e.g., change ``` to ```text) so the block becomes a labeled code fence and
satisfies the linter and documentation consistency checks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2a769d53-ca03-4bcc-a388-b2903c3ef8e9

📥 Commits

Reviewing files that changed from the base of the PR and between 9d9f8dc and aeeee0a.

📒 Files selected for processing (6)

benchmarks/.gitignore
benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md
benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md
benchmarks/published/2026-05-07-llama-3.1-70b/results.json
benchmarks/published/2026-05-07-llama-3.1-70b/results.md
benchmarks/published/INDEX.md

cubic-dev-ai

No issues found across 6 files

…rk-results

SingleSourceStudios enabled auto-merge (squash) May 7, 2026 11:14

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

cubic-dev-ai Bot reviewed May 7, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into chore/publish-benchma…

3a2f07a

…rk-results

SingleSourceStudios merged commit c263324 into main May 7, 2026
4 checks passed

SingleSourceStudios deleted the chore/publish-benchmark-results branch May 7, 2026 12:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(benchmarks): publish 2026-05-07 cross-model results#49

chore(benchmarks): publish 2026-05-07 cross-model results#49
SingleSourceStudios merged 2 commits intomainfrom
chore/publish-benchmark-results

SingleSourceStudios commented May 7, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

coderabbitai Bot commented May 7, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SingleSourceStudios commented May 7, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SingleSourceStudios commented May 7, 2026 •

edited by cubic-dev-ai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading