chore/readme positioning pivot by SingleSourceStudios · Pull Request #50 · SingularityAI-Dev/logic-md

SingleSourceStudios · 2026-05-07T12:31:48Z

chore(benchmarks): publish 2026-05-07 cross-model results
docs(readme): pivot positioning to audit/governance/consistency

Summary by cubic

Repositioned README to make LOGIC.md the audit and governance layer for agent reasoning, and added 2026‑05‑07 cross‑model benchmarks showing no measurable quality lift on tested tasks. Published a public evidence index and kept only ephemeral benchmark outputs ignored to preserve runs.

Docs
- Updated hero and problem framing around auditability, safe modification, and consistency.
- Replaced case study with a 60‑trial Archon integration showing higher structural agreement (87% vs 70%) and noted the 2.6× runtime tradeoff.
- Added “What LOGIC.md delivers” and honest “When to use” guidance; rewrote Benchmarks section to separate structural consistency from quality; refreshed near‑term roadmap.
Benchmarks
- Published Llama 3.1 70B runs (3 tasks, n=10); flat to slightly negative after excluding infra failures.
- Published Claude Sonnet 4.6 code‑review (n=10); ceiling effect (99→100).
- Added benchmarks/published/INDEX.md; refined benchmarks/.gitignore to ignore only ephemeral results/ while committing published evidence.

^{Written for commit ac7bfd3. Summary will update on new commits.}

Summary by CodeRabbit

Documentation
- Updated README with expanded problem statement and new case study on structural consistency.
- Added comprehensive benchmark findings showing no measurable quality improvements from LOGIC.md at current sample sizes (n=10).
Benchmarks
- Published results from Llama 3.1 70B and Claude Sonnet 4.6 evaluation runs across multiple tasks.
- Added index for accessing published benchmark evidence and reproduction methodologies.

Two independent runs on Claude Sonnet 4.6 (code-review, n=10) and Llama 3.1 70B via Nvidia NIM (3 tasks, n=10) show no measurable quality lift from LOGIC.md on these tasks at this sample size. - Sonnet 4.6 ceiling effect: control 99/100, treatment 100/100 - Llama 3.1 70B flat to slightly negative after excluding 7 Nvidia connection-drop runs (cleaned means: code-review 98.3 vs 98.9, research-synthesis 94.6 vs 94.0, security-audit 89.0 vs 83.0) Per-run results, raw harness output, and honest analysis preserved at benchmarks/published/<date>-<model>/. INDEX.md catalogues runs and documents open methodology questions. Motivates positioning pivot toward structural consistency / audit / governance, anchored in the 2026-05-06 Archon integration test (87% hash agreement under LOGIC.md vs 70% without).

Quality-lift framing was unsupported by the cross-model benchmark sweep run 2026-05-07 (Sonnet 4.6 ceiling effect, Llama 3.1 70B flat-to-negative once infra failures are excluded). Phase 0 Ship Honest doctrine requires updating the README to match what the data supports. What changed: - Hero pitch: 'declarative reasoning layer' -> 'audit and governance layer for AI agent reasoning' - 'The problem' rewritten around audit, modifiability, and consistency. Old 'describing-vs-doing' framing dropped (it was Modular9-specific and not generalisable). - Case study replaced with the 60-trial Archon integration test that DID show clean signal: 87% vs 70% structural hash agreement, 10/10 vs 5/10 identical tuples on auth-sql-injection. Anchored against the new public eval repo at github.com/SingularityAI-Dev/logic-md-archon-eval. - 'When to use it' updated to recommend AGAINST LOGIC.md for raw quality on capable models. Added explicit honest disclosure. - New section 'What LOGIC.md actually delivers' enumerating the three real properties: structure as contract, audit trail as default artifact, modifications as structured diffs. - Benchmarks section rewritten with honest disclosure of the 2026-05-07 cross-model results and the 2026-05-06 Archon results, framing them as complementary rather than conflicting. - Roadmap 'Near term' updated to reflect that the benchmark suite has run (partially) and to queue actual next experiments. Pitch and adopt LOGIC.md for structure and governance, not quality. The technical features are unchanged.

coderabbitai · 2026-05-07T12:32:00Z

📝 Walkthrough

Walkthrough

This PR updates README.md with revised LOGIC.md framing emphasizing governance and auditability, adds a new structural consistency case study with Archon experimental results, and publishes comprehensive benchmark evidence for Claude Sonnet 4.6 and Llama 3.1 70B models showing no measurable quality lift at n=10. Benchmark infrastructure is updated to ignore ephemeral outputs and organize published evidence.

Changes

Documentation and Benchmark Results

Layer / File(s)	Summary
README Core Messaging `README.md`	Introduction revised to emphasize deterministic execution, compile-time contract validation, and auditable event traces. "The problem" section reworked from intent-vs-data narrative to explicit claims about auditability, safe modification, and run-to-run consistency gaps. History paragraph updated with Archon experiment reference.
README Implementation Details `README.md`	Output contracts explanation updated: runtime compiles deterministic prompt segment with `## Required Output` section and emits prompt+schema in event traces. "What LOGIC.md actually delivers" expanded to describe structure as contract, audit trail as default artifact, modifications as structured diffs.
README Case Studies and Evidence `README.md`	"Describing-vs-doing fix" section replaced with new "Case study: structural consistency under LOGIC.md" containing May 2026 Archon 60-trial experiment results (verdict agreement, structural-hash agreement), audit/modifiability/overhead discussion. New "Benchmarks and honest disclosure" section consolidates cross-model quality-lift results, explicit no-measurable-signal statements, structural consistency findings, reproducibility notes, and distinction between quality-lift and consistency evidence.
README Roadmap `README.md`	"Near term" bullets specify concrete benchmark expansion (higher n, model comparisons) and follow-on tasks (scoring audit, VSCode marketplace, Python SDK parity, LangGraph Phase 2). Earlier generic benchmarks section removed.
Benchmark Infrastructure `benchmarks/.gitignore`	Added "ephemeral run output" ignore section with `results/` directory pattern. `.log` and npm debug patterns continue to be ignored.
Benchmark Evidence Organization `benchmarks/published/INDEX.md`	New index document describing published benchmark structure, evidence-preservation rules, table of two 2026-05-07 benchmark runs (Sonnet 4.6 code-review; Llama 3.1 70B three-task), headline finding of no measurable quality lift, open methodology questions, and re-running workflow with environment variables and command pattern.
Sonnet 4.6 Code-Review Benchmark `benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md`	New benchmark report for Claude Sonnet 4.6 code-review-only run (n=10). Metadata, rationale for missing raw JSON (harness path overwrite), headline +1.0 aggregate difference, per-dimension metrics, ceiling-effect interpretation concluding LOGIC.md is no-op for quality, reproducibility commands, limitations, and cost disclosure.
Llama 3.1 70B Multi-Task Benchmark `benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md`, `results.json`, `results.md`	Complete Llama 3.1 70B benchmark: analysis.md documents experimental setup, identifies 7/60 Nvidia NIM connection-drop failures, reports cleaned aggregate scores post-exclusion, task-specific results (code-review/research-synthesis flat, security-audit treatment −6 points), anomaly enumeration, interpretation statements, reproducibility instructions, and artifact summary. results.json captures per-run metrics (timestamps, tokens, errors, compliance/quality dimensions, aggregates) plus stats summaries. results.md provides control/treatment summary tables per-task with key-findings numeric differences.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A benchmark sprint, now etched in stone—
No quality lift, but structures shown!
With Archon's grace and Llama's dance,
We measure twice, and truth we chance. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description does not follow the required template structure, missing several mandatory sections including linked issue, spec impact checkboxes, and completion checklist.	Update the PR description to include all required template sections: add linked issue reference (Closes/Refs/#n or N/A), fill in spec impact checkboxes, and complete the checklist items with verification status.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'chore/readme positioning pivot' accurately reflects the main content change: a repositioning of the README's framing from quality-lift focus to audit/governance/consistency.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/readme-positioning-pivot

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md`:
- Around line 24-28: The fenced code block in the summary.md lacks a language
tag which triggers markdownlint MD040; update the block fence that contains the
table-like text (the triple-backtick block showing "Control Treatment Diff" and
the code-review rows) to include a language identifier (e.g., change ``` to
```text) so the block is recognized as plaintext and the linter warning is
resolved.

In `@benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md`:
- Around line 20-25: The fenced code blocks in analysis.md lack language
specifiers (MD040); update each triple-backtick block (including the shown table
block and the other occurrences at the referenced ranges) to include an
appropriate language token such as "text" or "json" (e.g., change ``` to
```text) so markdown-lint passes; locate the blocks by scanning for the
triple-backtick fences around the table rows (the lines showing "Control      
Treatment    Diff" and surrounding rows) and the other blocks at the mentioned
ranges and add the same language token consistently.

In `@benchmarks/published/2026-05-07-llama-3.1-70b/results.md`:
- Around line 55-79: Update the "Key Findings" section to clearly mark these
numbers as raw/unadjusted and note they are confounded by infra failures
(zero-score connection-drop runs); specifically, append "(raw / unadjusted)" to
the "Key Findings" heading or each subsection title such as
"code-review:meta/llama-3.1-70b-instruct", add a short parenthetical sentence
immediately under the Key Findings heading stating that some runs were
zero-scored due to connection drops and that adjusted/cleaned results are in
analysis.md, and insert a hyperlink to analysis.md labeled "See cleaned findings
in analysis.md" so readers are directed to the corrected data.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a5972d08-9491-4f88-a24f-9508242b88e8

📥 Commits

Reviewing files that changed from the base of the PR and between 9d9f8dc and 2e9890e.

📒 Files selected for processing (7)

README.md
benchmarks/.gitignore
benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md
benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md
benchmarks/published/2026-05-07-llama-3.1-70b/results.json
benchmarks/published/2026-05-07-llama-3.1-70b/results.md
benchmarks/published/INDEX.md

cubic-dev-ai

2 issues found across 7 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="benchmarks/published/2026-05-07-llama-3.1-70b/results.md">

<violation number="1" location="benchmarks/published/2026-05-07-llama-3.1-70b/results.md:55">
P2: These "Key Findings" include zero-score runs caused by Nvidia NIM connection drops, which materially skew the reported deltas (e.g. the +19 and +12 lifts are artifacts of unequal failure distribution). Add an explicit caveat that these are raw/unadjusted figures and link to `analysis.md` for the cleaned comparison.</violation>

<violation number="2" location="benchmarks/published/2026-05-07-llama-3.1-70b/results.md:63">
P2: The code-review finding labels a worsening metric as a "Reduction". This should be reported as an increase (+9 percentage points) to avoid misleading conclusions.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

…ing-pivot

SingleSourceStudios added 2 commits May 7, 2026 13:14

SingleSourceStudios enabled auto-merge (squash) May 7, 2026 12:31

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md

Comment thread benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md

Comment thread benchmarks/published/2026-05-07-llama-3.1-70b/results.md

cubic-dev-ai Bot reviewed May 7, 2026

View reviewed changes

Comment thread benchmarks/published/2026-05-07-llama-3.1-70b/results.md

Comment thread benchmarks/published/2026-05-07-llama-3.1-70b/results.md

SingleSourceStudios added 2 commits May 7, 2026 14:37

Merge remote-tracking branch 'origin/main' into chore/readme-position…

aff9975

…ing-pivot

Merge remote-tracking branch 'origin/main' into chore/readme-position…

ac7bfd3

…ing-pivot

SingleSourceStudios merged commit baa0763 into main May 7, 2026
4 checks passed

SingleSourceStudios deleted the chore/readme-positioning-pivot branch May 7, 2026 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore/readme positioning pivot#50

chore/readme positioning pivot#50
SingleSourceStudios merged 4 commits intomainfrom
chore/readme-positioning-pivot

SingleSourceStudios commented May 7, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

coderabbitai Bot commented May 7, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SingleSourceStudios commented May 7, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SingleSourceStudios commented May 7, 2026 •

edited by cubic-dev-ai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading