chore/readme positioning pivot#50
Conversation
Two independent runs on Claude Sonnet 4.6 (code-review, n=10) and Llama 3.1 70B via Nvidia NIM (3 tasks, n=10) show no measurable quality lift from LOGIC.md on these tasks at this sample size. - Sonnet 4.6 ceiling effect: control 99/100, treatment 100/100 - Llama 3.1 70B flat to slightly negative after excluding 7 Nvidia connection-drop runs (cleaned means: code-review 98.3 vs 98.9, research-synthesis 94.6 vs 94.0, security-audit 89.0 vs 83.0) Per-run results, raw harness output, and honest analysis preserved at benchmarks/published/<date>-<model>/. INDEX.md catalogues runs and documents open methodology questions. Motivates positioning pivot toward structural consistency / audit / governance, anchored in the 2026-05-06 Archon integration test (87% hash agreement under LOGIC.md vs 70% without).
Quality-lift framing was unsupported by the cross-model benchmark sweep run 2026-05-07 (Sonnet 4.6 ceiling effect, Llama 3.1 70B flat-to-negative once infra failures are excluded). Phase 0 Ship Honest doctrine requires updating the README to match what the data supports. What changed: - Hero pitch: 'declarative reasoning layer' -> 'audit and governance layer for AI agent reasoning' - 'The problem' rewritten around audit, modifiability, and consistency. Old 'describing-vs-doing' framing dropped (it was Modular9-specific and not generalisable). - Case study replaced with the 60-trial Archon integration test that DID show clean signal: 87% vs 70% structural hash agreement, 10/10 vs 5/10 identical tuples on auth-sql-injection. Anchored against the new public eval repo at github.com/SingularityAI-Dev/logic-md-archon-eval. - 'When to use it' updated to recommend AGAINST LOGIC.md for raw quality on capable models. Added explicit honest disclosure. - New section 'What LOGIC.md actually delivers' enumerating the three real properties: structure as contract, audit trail as default artifact, modifications as structured diffs. - Benchmarks section rewritten with honest disclosure of the 2026-05-07 cross-model results and the 2026-05-06 Archon results, framing them as complementary rather than conflicting. - Roadmap 'Near term' updated to reflect that the benchmark suite has run (partially) and to queue actual next experiments. Pitch and adopt LOGIC.md for structure and governance, not quality. The technical features are unchanged.
📝 WalkthroughWalkthroughThis PR updates ChangesDocumentation and Benchmark Results
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.md`:
- Around line 24-28: The fenced code block in the summary.md lacks a language
tag which triggers markdownlint MD040; update the block fence that contains the
table-like text (the triple-backtick block showing "Control Treatment Diff" and
the code-review rows) to include a language identifier (e.g., change ``` to
```text) so the block is recognized as plaintext and the linter warning is
resolved.
In `@benchmarks/published/2026-05-07-llama-3.1-70b/analysis.md`:
- Around line 20-25: The fenced code blocks in analysis.md lack language
specifiers (MD040); update each triple-backtick block (including the shown table
block and the other occurrences at the referenced ranges) to include an
appropriate language token such as "text" or "json" (e.g., change ``` to
```text) so markdown-lint passes; locate the blocks by scanning for the
triple-backtick fences around the table rows (the lines showing "Control
Treatment Diff" and surrounding rows) and the other blocks at the mentioned
ranges and add the same language token consistently.
In `@benchmarks/published/2026-05-07-llama-3.1-70b/results.md`:
- Around line 55-79: Update the "Key Findings" section to clearly mark these
numbers as raw/unadjusted and note they are confounded by infra failures
(zero-score connection-drop runs); specifically, append "(raw / unadjusted)" to
the "Key Findings" heading or each subsection title such as
"code-review:meta/llama-3.1-70b-instruct", add a short parenthetical sentence
immediately under the Key Findings heading stating that some runs were
zero-scored due to connection drops and that adjusted/cleaned results are in
analysis.md, and insert a hyperlink to analysis.md labeled "See cleaned findings
in analysis.md" so readers are directed to the corrected data.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a5972d08-9491-4f88-a24f-9508242b88e8
📒 Files selected for processing (7)
README.mdbenchmarks/.gitignorebenchmarks/published/2026-05-07-claude-sonnet-4-6-codereview/summary.mdbenchmarks/published/2026-05-07-llama-3.1-70b/analysis.mdbenchmarks/published/2026-05-07-llama-3.1-70b/results.jsonbenchmarks/published/2026-05-07-llama-3.1-70b/results.mdbenchmarks/published/INDEX.md
There was a problem hiding this comment.
2 issues found across 7 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="benchmarks/published/2026-05-07-llama-3.1-70b/results.md">
<violation number="1" location="benchmarks/published/2026-05-07-llama-3.1-70b/results.md:55">
P2: These "Key Findings" include zero-score runs caused by Nvidia NIM connection drops, which materially skew the reported deltas (e.g. the +19 and +12 lifts are artifacts of unequal failure distribution). Add an explicit caveat that these are raw/unadjusted figures and link to `analysis.md` for the cleaned comparison.</violation>
<violation number="2" location="benchmarks/published/2026-05-07-llama-3.1-70b/results.md:63">
P2: The code-review finding labels a worsening metric as a "Reduction". This should be reported as an increase (+9 percentage points) to avoid misleading conclusions.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Summary by cubic
Repositioned README to make LOGIC.md the audit and governance layer for agent reasoning, and added 2026‑05‑07 cross‑model benchmarks showing no measurable quality lift on tested tasks. Published a public evidence index and kept only ephemeral benchmark outputs ignored to preserve runs.
Docs
Benchmarks
benchmarks/published/INDEX.md; refinedbenchmarks/.gitignoreto ignore only ephemeralresults/while committing published evidence.Written for commit ac7bfd3. Summary will update on new commits.
Summary by CodeRabbit
Documentation
Benchmarks