Skip to content

chore/readme positioning pivot#50

Merged
SingleSourceStudios merged 4 commits intomainfrom
chore/readme-positioning-pivot
May 7, 2026
Merged

chore/readme positioning pivot#50
SingleSourceStudios merged 4 commits intomainfrom
chore/readme-positioning-pivot

Conversation

@SingleSourceStudios
Copy link
Copy Markdown
Collaborator

@SingleSourceStudios SingleSourceStudios commented May 7, 2026

  • chore(benchmarks): publish 2026-05-07 cross-model results
  • docs(readme): pivot positioning to audit/governance/consistency

Summary by cubic

Repositioned README to make LOGIC.md the audit and governance layer for agent reasoning, and added 2026‑05‑07 cross‑model benchmarks showing no measurable quality lift on tested tasks. Published a public evidence index and kept only ephemeral benchmark outputs ignored to preserve runs.

  • Docs

    • Updated hero and problem framing around auditability, safe modification, and consistency.
    • Replaced case study with a 60‑trial Archon integration showing higher structural agreement (87% vs 70%) and noted the 2.6× runtime tradeoff.
    • Added “What LOGIC.md delivers” and honest “When to use” guidance; rewrote Benchmarks section to separate structural consistency from quality; refreshed near‑term roadmap.
  • Benchmarks

    • Published Llama 3.1 70B runs (3 tasks, n=10); flat to slightly negative after excluding infra failures.
    • Published Claude Sonnet 4.6 code‑review (n=10); ceiling effect (99→100).
    • Added benchmarks/published/INDEX.md; refined benchmarks/.gitignore to ignore only ephemeral results/ while committing published evidence.

Written for commit ac7bfd3. Summary will update on new commits.

Summary by CodeRabbit

  • Documentation

    • Updated README with expanded problem statement and new case study on structural consistency.
    • Added comprehensive benchmark findings showing no measurable quality improvements from LOGIC.md at current sample sizes (n=10).
  • Benchmarks

    • Published results from Llama 3.1 70B and Claude Sonnet 4.6 evaluation runs across multiple tasks.
    • Added index for accessing published benchmark evidence and reproduction methodologies.

Two independent runs on Claude Sonnet 4.6 (code-review, n=10) and
Llama 3.1 70B via Nvidia NIM (3 tasks, n=10) show no measurable
quality lift from LOGIC.md on these tasks at this sample size.

- Sonnet 4.6 ceiling effect: control 99/100, treatment 100/100
- Llama 3.1 70B flat to slightly negative after excluding 7 Nvidia
  connection-drop runs (cleaned means: code-review 98.3 vs 98.9,
  research-synthesis 94.6 vs 94.0, security-audit 89.0 vs 83.0)

Per-run results, raw harness output, and honest analysis preserved
at benchmarks/published/<date>-<model>/. INDEX.md catalogues runs
and documents open methodology questions.

Motivates positioning pivot toward structural consistency / audit /
governance, anchored in the 2026-05-06 Archon integration test
(87% hash agreement under LOGIC.md vs 70% without).
Quality-lift framing was unsupported by the cross-model benchmark
sweep run 2026-05-07 (Sonnet 4.6 ceiling effect, Llama 3.1 70B
flat-to-negative once infra failures are excluded). Phase 0 Ship
Honest doctrine requires updating the README to match what the
data supports.

What changed:
- Hero pitch: 'declarative reasoning layer' -> 'audit and
  governance layer for AI agent reasoning'
- 'The problem' rewritten around audit, modifiability, and
  consistency. Old 'describing-vs-doing' framing dropped (it
  was Modular9-specific and not generalisable).
- Case study replaced with the 60-trial Archon integration test
  that DID show clean signal: 87% vs 70% structural hash
  agreement, 10/10 vs 5/10 identical tuples on auth-sql-injection.
  Anchored against the new public eval repo at
  github.com/SingularityAI-Dev/logic-md-archon-eval.
- 'When to use it' updated to recommend AGAINST LOGIC.md for
  raw quality on capable models. Added explicit honest disclosure.
- New section 'What LOGIC.md actually delivers' enumerating the
  three real properties: structure as contract, audit trail as
  default artifact, modifications as structured diffs.
- Benchmarks section rewritten with honest disclosure of the
  2026-05-07 cross-model results and the 2026-05-06 Archon
  results, framing them as complementary rather than conflicting.
- Roadmap 'Near term' updated to reflect that the benchmark suite
  has run (partially) and to queue actual next experiments.

Pitch and adopt LOGIC.md for structure and governance, not
quality. The technical features are unchanged.
@SingleSourceStudios SingleSourceStudios enabled auto-merge (squash) May 7, 2026 12:31
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR updates README.md with revised LOGIC.md framing emphasizing governance and auditability, adds a new structural consistency case study with Archon experimental results, and publishes comprehensive benchmark evidence for Claude Sonnet 4.6 and Llama 3.1 70B models showing no measurable quality lift at n=10. Benchmark infrastructure is updated to ignore ephemeral outputs and organize published evidence.

Changes

Documentation and Benchmark Results

Layer / File(s) Summary
README Core Messaging
README.md
Introduction revised to emphasize deterministic execution, compile-time contract validation, and auditable event traces. "The problem" section reworked from intent-vs-data narrative to explicit claims about auditability, safe modification, and run-to-run consistency gaps. History paragraph updated with Archon experiment reference.
README Implementation Details
README.md
Output contracts explanation updated: runtime compiles deterministic prompt segment with ## Required Output section and emits prompt+schema in event traces. "What LOGIC.md actually delivers" expanded to describe structure as contract, audit trail as default artifact, modifications as structured diffs.
README Case Studies and Evidence
README.md
"Describing-vs-doing fix" section replaced with new "Case study: structural consistency under LOGIC.md" containing May 2026 Archon 60-trial experiment results (verdict agreement, structural-hash agreement), audit/modifiability/overhead discussion. New "Benchmarks and honest disclosure" section consolidates cross-model quality-lift results, explicit no-measurable-signal statements, structural consistency findings, reproducibility notes, and distinction between quality-lift and consistency evidence.
README Roadmap
README.md
"Near term" bullets specify concrete benchmark expansion (higher n, model comparisons) and follow-on tasks (scoring audit, VSCode marketplace, Python SDK parity, LangGraph Phase 2). Earlier generic benchmarks section removed.
Benchmark Infrastructure
benchmarks/.gitignore
Added "ephemeral run output" ignore section with results/ directory pattern. .log and npm debug patterns continue to be ignored.
Benchmark Evidence Organization
benchmarks/published/INDEX.md
New index document describing published benchmark structure, evidence-preservation rules, table of two 2026-05-07 benchmark runs (Sonnet 4.6 code-review; Llama 3.1 70B three-task), headline finding of no measurable quality lift, open methodology questions, and re-running workflow with environment variables and command pattern.
Sonnet 4.6 Code-Review Benchmark
benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md
New benchmark report for Claude Sonnet 4.6 code-review-only run (n=10). Metadata, rationale for missing raw JSON (harness path overwrite), headline +1.0 aggregate difference, per-dimension metrics, ceiling-effect interpretation concluding LOGIC.md is no-op for quality, reproducibility commands, limitations, and cost disclosure.
Llama 3.1 70B Multi-Task Benchmark
benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md, results.json, results.md
Complete Llama 3.1 70B benchmark: analysis.md documents experimental setup, identifies 7/60 Nvidia NIM connection-drop failures, reports cleaned aggregate scores post-exclusion, task-specific results (code-review/research-synthesis flat, security-audit treatment −6 points), anomaly enumeration, interpretation statements, reproducibility instructions, and artifact summary. results.json captures per-run metrics (timestamps, tokens, errors, compliance/quality dimensions, aggregates) plus stats summaries. results.md provides control/treatment summary tables per-task with key-findings numeric differences.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A benchmark sprint, now etched in stone—
No quality lift, but structures shown!
With Archon's grace and Llama's dance,
We measure twice, and truth we chance.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description does not follow the required template structure, missing several mandatory sections including linked issue, spec impact checkboxes, and completion checklist. Update the PR description to include all required template sections: add linked issue reference (Closes/Refs/#n or N/A), fill in spec impact checkboxes, and complete the checklist items with verification status.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'chore/readme positioning pivot' accurately reflects the main content change: a repositioning of the README's framing from quality-lift focus to audit/governance/consistency.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/readme-positioning-pivot

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md`:
- Around line 24-28: The fenced code block in the summary.md lacks a language
tag which triggers markdownlint MD040; update the block fence that contains the
table-like text (the triple-backtick block showing "Control Treatment Diff" and
the code-review rows) to include a language identifier (e.g., change ``` to
```text) so the block is recognized as plaintext and the linter warning is
resolved.

In `@benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md`:
- Around line 20-25: The fenced code blocks in analysis.md lack language
specifiers (MD040); update each triple-backtick block (including the shown table
block and the other occurrences at the referenced ranges) to include an
appropriate language token such as "text" or "json" (e.g., change ``` to
```text) so markdown-lint passes; locate the blocks by scanning for the
triple-backtick fences around the table rows (the lines showing "Control      
Treatment    Diff" and surrounding rows) and the other blocks at the mentioned
ranges and add the same language token consistently.

In `@benchmarks/published/2026-05-07-llama-3.1-70b/results.md`:
- Around line 55-79: Update the "Key Findings" section to clearly mark these
numbers as raw/unadjusted and note they are confounded by infra failures
(zero-score connection-drop runs); specifically, append "(raw / unadjusted)" to
the "Key Findings" heading or each subsection title such as
"code-review:meta/llama-3.1-70b-instruct", add a short parenthetical sentence
immediately under the Key Findings heading stating that some runs were
zero-scored due to connection drops and that adjusted/cleaned results are in
analysis.md, and insert a hyperlink to analysis.md labeled "See cleaned findings
in analysis.md" so readers are directed to the corrected data.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a5972d08-9491-4f88-a24f-9508242b88e8

📥 Commits

Reviewing files that changed from the base of the PR and between 9d9f8dc and 2e9890e.

📒 Files selected for processing (7)
  • README.md
  • benchmarks/.gitignore
  • benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md
  • benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md
  • benchmarks/published/2026-05-07-llama-3.1-70b/results.json
  • benchmarks/published/2026-05-07-llama-3.1-70b/results.md
  • benchmarks/published/INDEX.md

Comment thread benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md
Comment thread benchmarks/published/2026-05-07-llama-3.1-70b/results.md
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 7 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="benchmarks/published/2026-05-07-llama-3.1-70b/results.md">

<violation number="1" location="benchmarks/published/2026-05-07-llama-3.1-70b/results.md:55">
P2: These "Key Findings" include zero-score runs caused by Nvidia NIM connection drops, which materially skew the reported deltas (e.g. the +19 and +12 lifts are artifacts of unequal failure distribution). Add an explicit caveat that these are raw/unadjusted figures and link to `analysis.md` for the cleaned comparison.</violation>

<violation number="2" location="benchmarks/published/2026-05-07-llama-3.1-70b/results.md:63">
P2: The code-review finding labels a worsening metric as a "Reduction". This should be reported as an increase (+9 percentage points) to avoid misleading conclusions.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread benchmarks/published/2026-05-07-llama-3.1-70b/results.md
Comment thread benchmarks/published/2026-05-07-llama-3.1-70b/results.md
@SingleSourceStudios SingleSourceStudios merged commit baa0763 into main May 7, 2026
4 checks passed
@SingleSourceStudios SingleSourceStudios deleted the chore/readme-positioning-pivot branch May 7, 2026 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant