Expert code review for data science and research codebases — the kinds of projects where silent data bugs, untracked analytical decisions, and "it ran once on my laptop" are the real failure modes.
Most research code is written by one person and reviewed by nobody before it's cited in a paper. This system closes that gap by running three independent expert reviewers over your codebase in parallel, each with a different lens tailored to what you're building (e.g., data-pipeline lenses for Python, package-design lenses for R — see below). A synthesis agent then aggregates findings, flags where reviewers disagree, and walks you through triage one decision at a time.
It's designed for the projects data scientists actually work on: ETL-and-analysis pipelines, R packages maintained by small teams, pre-publication audits, handoffs to a collaborator who wasn't there for the first 80% of the work. The goal is to catch the things that don't show up in a linter or a type checker — wrong merges, hardcoded thresholds, schema drift, silent NaN propagation, utility modules turning into junk drawers.
Human + Interview agent (conversational)
└── Produces: project_context.md
│
▼
Orchestrator (automated)
├── Spawns reviewers in parallel (Claude Code Agent tool)
│ ├── Perspective A ──┐
│ ├── Perspective B ──┤ (independent, no cross-talk)
│ └── Perspective C ──┘
│ │
├── Collects all findings
│ │
└── Runs synthesis (dedup, flag disagreements, prioritize)
│
▼
review_report.md
│
▼
Human triages decisions → Claude implements
- Interview (optional): Converse with the interview agent to produce
project_context.md. This is the only step that requires human input — the agent needs your intent, constraints, and context that can't be inferred from code. - Launch: Tell the orchestrator to run. ("Run the conclave", "Review this project", etc.)
- Wait: The orchestrator spawns three reviewer agents in parallel, collects findings, and runs synthesis. You'll get status updates as each step completes.
- Triage: The orchestrator presents prioritized findings one at a time, or you can edit
review_report.mddirectly to batch your decisions. - Implement: Tell Claude to implement the "now" items from the action plan.
LLM-to-LLM debate suffers from conformity cascade — agents shift toward consensus rather than maintaining genuinely independent perspectives (Wynn et al., 2025; arxiv.org/abs/2509.05396). By keeping reviewers independent and only aggregating at the synthesis step, we preserve the value of divergent expertise.
Reviewers produce better findings when they understand intent, constraints, and stage. The interview agent explores the codebase and asks the developer targeted questions to produce a project context document that all reviewers receive. This prevents reviewers from flagging intentional decisions as bugs, and helps them calibrate severity to the project's actual priorities.
The interview is optional — you can skip it and let reviewers work from the codebase alone. But it significantly improves review quality, especially for projects with workarounds, temporary scripts, or domain-specific constraints that aren't obvious from the code.
Without it, the human has to manually spawn three agents with the right prompts, wait for each to finish, copy findings into a synthesis step, and manage the whole flow. The orchestrator handles all that mechanics so the human can focus on the two parts that actually need human judgment: the interview and the triage.
This repo contains review configurations for different ecosystems. Each shares the same fan-out/fan-in architecture and interview agent, but the perspectives and base prompts are tailored to the domain.
Focused on reviewing R packages. Three perspectives:
| Perspective | File | Core lens |
|---|---|---|
| Tidyverse API & Package Design | r/personas/tidyverse_api.md |
API design, function composability, DESCRIPTION hygiene |
| Usability & Developer Experience | r/personas/developer_experience.md |
Error messages, onboarding, documentation |
| Data Validation & Pipeline Contracts | r/personas/data_validation.md |
Codebook schema, column contracts, assertions |
See r/README.md for full details.
Focused on reviewing data science pipelines (ETL, EDA, analysis). Three perspectives:
| Perspective | File | Core lens |
|---|---|---|
| Pipeline Integrity | python/personas/pipeline_integrity.md |
Data correctness, merge safety, silent failures |
| Reproducible Research | python/personas/reproducible_research.md |
Audit trails, config-driven decisions, re-runnability |
| Pipeline Architecture | python/personas/pipeline_architecture.md |
Modularity, extensibility, abstraction calibration |
See python/README.md for full details.
code-conclave/
README.md <- you are here
orchestrator.md <- spawns reviewers, collects findings, runs synthesis
project_interview.md <- shared: repo context interview agent
base_prompt.md <- shared rules, output format, severity levels
synthesis_prompt.md <- shared synthesis logic, convergence rule, triage structure
project_context_template.md <- stubbed example of interview output
r/
README.md <- R-specific usage guide
base_prompt.md <- R-specific extension: scope, categories
synthesis_prompt.md <- R-specific extension: reviewer names, architecture prompts
personas/
tidyverse_api.md
developer_experience.md
data_validation.md
python/
README.md <- Python-specific usage guide
base_prompt.md <- Python-specific extension: scope, categories
synthesis_prompt.md <- Python-specific extension: reviewer names, architecture prompts
personas/
pipeline_integrity.md
reproducible_research.md
pipeline_architecture.md
The root-level base_prompt.md and synthesis_prompt.md contain everything that's genuinely shared across domains. The language subdirectories contain thin extensions that supply scope, category enums, reviewer names, and domain-specific architecture prompts. When you run a review, the orchestrator feeds both the shared file and the language extension to each reviewer. This keeps the rules in one place so they can't drift across languages.
Open a Claude Code session in the project you want to review and paste one of the following:
Read the orchestrator instructions at ~/code/code-conclave/orchestrator.md.
Review this codebase following those instructions. It's a [Python data science pipeline / R package].
Read the project interview instructions at ~/code/code-conclave/project_interview.md.
Interview me about this project to produce a project_context.md, then read
~/code/code-conclave/orchestrator.md and run the full review.
The interview agent will ask you questions one at a time about the project's purpose, data, architecture, and priorities. Once it has enough context, it writes project_context.md and hands off to the orchestrator, which spawns reviewers, collects findings, synthesizes, and walks you through triage.
After synthesis, the orchestrator will walk you through prioritization one question at a time. Alternatively, you can edit review_report.md directly to batch your decisions:
_pending_in the Decision column — replace withnow,next sprint,backlog, orwontfix- Add rationale in the Rationale column
- Disagreements list named options with a recommended label
Editing the file directly is useful when you want to:
- See all decisions at once and batch your answers
- Come back to it later if you need to think
- Keep a permanent decision record
- Hand off to Claude for implementation while you context-switch
- Structured output format: All reviewers use the same schema so synthesis can deduplicate programmatically
- Confidence scores: Each finding includes a confidence level — the synthesis agent uses this to weigh disagreements
- Explicit scope boundaries: Each reviewer is told what to ignore, not just what to focus on, to prevent overlap
- No cross-talk: Reviewers never see each other's output. Only the synthesis agent sees all findings.
- Project context: The interview agent produces a document that prevents reviewers from working blind
To add a new language/domain:
- Create a new subdirectory (e.g.,
julia/,sql/) - Write a small
base_prompt.mdextension — it should only contain what's genuinely domain-specific: a one-paragraph review-domain description, the scope inclusion/exclusion lists, the category enum, domain-sharpened severity notes, and the file path format. Everything shared (rules, output format, severity levels, end-of-review template) lives in the root-levelbase_prompt.mdand should not be duplicated. - Identify 3 perspectives that cover the domain's highest-stakes review concerns
- Write persona files following the established format (see existing personas for structure — "Informed by" sources, expertise, focus areas, what to deprioritize, example findings)
- Write a small
synthesis_prompt.mdextension — reviewer names + abbreviations, report header text, file path format, and domain-specific architecture prompts. The shared convergence rule, triage logic, and report template live in the root. - Add the new language to the orchestrator's Step 1 file list
project_interview.mdis language-agnostic and shared — no new copy needed
I recommend using these agents iteratively. Once you've addressed changes from a review, set off another review cycle in a fresh chat. The fresh chat matters: it prevents the new reviewers from being anchored by the prior review's framing, and it keeps each round independent in the same way the three parallel reviewers are independent.
This system references real people's published expertise to define review perspectives, not to create synthetic personas or digital twins. The distinction matters:
- We are reviewing code through the lens of publicly documented principles — books, talks, blog posts, and open-source work.
- We are NOT simulating these individuals, generating quotes attributed to them, or claiming their endorsement.
- Findings should be framed as "this conflicts with the principles in Column Names as Contracts" — not "Emily Riederer would say..."
- The named references are shorthand for well-defined schools of thought. If any of these individuals expressed discomfort with this use, the names should be replaced with descriptive labels while keeping the review criteria intact.
The project interview agent's approach — relentless one-question-at-a-time interviewing that walks down the decision tree, provides a recommended answer for each question, and explores the codebase before asking — is adapted from Matt Pocock's grill-me skill (MIT licensed). The Code Conclave interview extends that pattern with a structured output schema (project_context.md) tailored to handing off context to downstream reviewer agents, but the core interviewing discipline is Matt's.
- Du et al. (2023) "Improving Factuality and Reasoning through Multiagent Debate" — arxiv.org/abs/2305.14325
- Wynn et al. (2025) "Talk Isn't Always Cheap: Failure Modes in Multi-Agent Debate" — arxiv.org/abs/2509.05396
- Riederer, E. "Column Names as Contracts" — emilyriederer.netlify.app/post/column-name-contracts/
- Ball, P. & HRDAG "The Task Is a Quantum of Workflow" — hrdag.org/2016/06/14/the-task-is-a-quantum-of-workflow/
- Wilson et al. (2017) "Good Enough Practices in Scientific Computing" — doi.org/10.1371/journal.pcbi.1005510
- Wickham, H. & Bryan, J. R Packages (2e) — r-pkgs.org
- Bryan, J. "What They Forgot to Teach You About R" — rstats.wtf