Skip to content

Latest commit

 

History

History
71 lines (45 loc) · 3.53 KB

File metadata and controls

71 lines (45 loc) · 3.53 KB

Python Data Science Pipeline Review

Review configuration for Python data science pipelines — ETL, exploratory analysis, and statistical/ML modeling codebases.

Perspectives

1. Pipeline Integrity (personas/pipeline_integrity.md)

Will this produce correct results, and will I know if it doesn't?

Focuses on the silent failure modes that are unique to data pipelines: wrong joins, dropped rows, NaN propagation, implicit type coercion, aggregation bugs. Informed by Emily Riederer's column contracts philosophy and the Pandera approach to DataFrame schema validation.

2. Reproducible Research (personas/reproducible_research.md)

Can this be audited, re-run, and trusted?

Focuses on whether a collaborator, reviewer, or future-you can understand what the pipeline did and verify the results. Covers config-driven analytical decisions, logging/audit trails, sample accounting, determinism, and separation of processing from analytical choices. Informed by HRDAG's reproducible pipeline principles and Wilson et al.'s "Good Enough Practices."

3. Pipeline Architecture (personas/pipeline_architecture.md)

Can this grow without becoming brittle?

Focuses on project structure, utility organization, stage independence, config architecture, and knowing when to abstract vs. when to keep it concrete. Informed by Cookiecutter Data Science conventions and the scikit-learn developer guidelines.

Usage

1. Run the project interview (recommended)

Use ../project_interview.md to produce a project_context.md for your project. This gives reviewers the context they need to distinguish bugs from intentional decisions.

2. Run each reviewer independently

For each persona, provide:

  • ../base_prompt.md (repo-root shared rules, severity levels, output format)
  • base_prompt.md (Python-specific extension: scope, categories)
  • The persona file (e.g., personas/pipeline_integrity.md)
  • project_context.md (from the interview, if available)
  • Access to the codebase

Run all three in parallel — they must not see each other's output.

3. Synthesize

Feed all three review outputs into both ../synthesis_prompt.md (shared synthesis logic and convergence rule) and synthesis_prompt.md (Python-specific reviewer names and architecture prompts). This produces review_report.md with:

  • Convergence: issues flagged by 2+ reviewers
  • Disagreements: conflicting recommendations
  • Unique findings: high-value items from single reviewers
  • A priority queue for triage

4. Triage and implement

Edit review_report.md directly to make decisions, then hand off to Claude for implementation.

File structure

python/
  README.md              <- you are here
  base_prompt.md         <- Python-specific extension: scope, categories, domain notes
  synthesis_prompt.md    <- Python-specific extension: reviewer names, architecture prompts
  personas/
    pipeline_integrity.md    <- data correctness, merge safety, assertions
    reproducible_research.md <- audit trails, config, reproducibility
    pipeline_architecture.md <- structure, modularity, extensibility

The shared rules live at the repo root (../base_prompt.md, ../synthesis_prompt.md). The files in this directory layer Python-specific details on top.

Scope note

The current Python perspectives are pipeline- and data-processing-focused (correctness, reproducibility, architecture). Statistical/modeling validity, performance at scale, and security/PII handling are not yet owned by a dedicated persona — they are called out as potential future additions but are out of scope for now.