rowdiff

Semantic CSV diff and merge tool with native Git integration.

git diff treats CSV files as text. A reordered column makes every row look changed. A row reorder makes the entire file look changed. rowdiff understands that CSV files are tables — it diffs by key, detects column renames, classifies type drift, and produces human-readable output whether you're in a terminal, CI pipeline, or PR review.

What it does

Key-based row matching — rows are matched by primary key, not by line number. Reordering rows produces zero false positives.
Column rename detection — combines Levenshtein name similarity with Jaccard content correlation to distinguish a genuine rename from a coincidence.
Type-aware cell diffing — classifies mutations along the type lattice (Integer → Float → String) and flags semantic changes like epoch timestamps becoming ISO 8601.
Statistical summary — mean, stddev, p5/p95, null ratio, and anomaly detection on numeric columns across both versions.
Three-way merge — resolves CSV merge conflicts semantically with configurable strategies: pk-wins-ours, pk-wins-theirs, interleave, sum (delta columns only), last-write-wins-by.
Git-native — installs as a gitattributes diff and merge driver. No storage migration, no new tooling required.

Installation

cargo install rowdiff

Or build from source:

git clone https://github.com/yourorg/rowdiff
cd rowdiff
cargo build --release
# Binary at ./target/release/rowdiff

Basic usage

# Compare two CSV files
rowdiff before.csv after.csv

# JSON output (machine-readable)
rowdiff before.csv after.csv --format json

# Self-contained HTML report
rowdiff before.csv after.csv --format html > report.html

# Explicit primary key (overrides auto-detection)
rowdiff before.csv after.csv --primary-key customer_id

# Summary only (no row-level detail)
rowdiff before.csv after.csv --summary-only

# Cap row diff output (default 50)
rowdiff before.csv after.csv --max-rows 100

Exit codes follow diff convention: 0 = identical, 1 = differences found, 2 = error.

Git integration

Diff driver

Add to .gitattributes in your repo:

*.csv diff=rowdiff
*.tsv diff=rowdiff

Add to ~/.gitconfig or .git/config:

[diff "rowdiff"]
    command = rowdiff git-diff

Now git diff, git log -p, and git show all use semantic CSV output.

Merge driver

Add to .gitattributes:

*.csv merge=rowdiff

Add to ~/.gitconfig or .git/config:

[merge "rowdiff"]
    name = CSV semantic merge
    driver = rowdiff git-merge %O %A %B %L %P

Merge strategy defaults to pk-wins-ours. Override per-file in .rowdiff.yml.

Configuration (`.rowdiff.yml`)

Place in your repo root. Supports glob patterns for file matching.

files:
  "data/customers.csv":
    primary_key: customer_id
    pk_uniqueness_threshold: 0.98
    type_overrides:
      created_at:
        type: iso8601
        semantic: cumulative
      units_delta:
        type: integer
        semantic: delta           # sum strategy is only valid on delta columns

  "exports/*.csv":
    primary_key: auto             # auto-detect with default threshold
    merge_strategy: pk-wins-ours

policies:
  - rule: no_pk_deletion
    severity: error
  - rule: no_type_widening
    severity: warn
  - rule: stat_shift_threshold
    column: revenue
    max_mean_delta_pct: 50
    severity: error

settings:
  pk_uniqueness_threshold: 0.98
  rename_confidence_threshold: 0.80
  type_inference_threshold: 0.90

Primary key detection

When no explicit key is configured, rowdiff scores every column (and candidate composite pairs) for cardinality, null ratio, and name heuristics. A column qualifies if it is unique in ≥ 98% of rows and has zero nulls. If no single column qualifies, the top composite pair is tried. If nothing qualifies, a content hash of the full row is used as a fallback key (and a warning is emitted).

Merge strategies

Strategy	Behaviour
`pk-wins-ours`	On conflict, keep our version of the row
`pk-wins-theirs`	On conflict, keep their version of the row
`interleave`	On conflict, emit a conflict marker (non-zero exit)
`sum`	Add numeric values; only valid for `semantic: delta` columns
`last-write-wins-by: <col>`	Keep the row with the higher value in the named column

Delete-modify conflicts always produce a conflict marker regardless of strategy.

Performance

Tested on commodity hardware (Linux x86-64):

Dataset	Time
1M rows, ~5% modified	~11s
Expected on M-class Mac	~8–12s

rowdiff streams CSV files through a batch-insert SQLite pipeline with WITHOUT ROWID B-tree tables. Memory usage is bounded regardless of file size — it does not load the full dataset into RAM.

Output formats

Terminal (default) — ANSI colored, human-readable. Schema changes, row summary cards, per-row cell diffs, statistics table, policy violations.

JSON (--format json) — structured output for downstream processing. Full report or summary-only via --summary-only.

HTML (--format html) — self-contained single-file report with embedded CSS. Suitable for archiving or attaching to PR reviews.

Development

# Run all tests
cargo test

# Run benchmarks (skipped by default)
cargo test --release --test bench_1m -- --ignored --nocapture

# Lint
cargo clippy --all-targets -- -D warnings

# Format check
cargo fmt --all -- --check

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
assets		assets
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md
rowdiff-spec.md		rowdiff-spec.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rowdiff

What it does

Installation

Basic usage

Git integration

Diff driver

Merge driver

Configuration (`.rowdiff.yml`)

Primary key detection

Merge strategies

Performance

Output formats

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rowdiff

What it does

Installation

Basic usage

Git integration

Diff driver

Merge driver

Configuration (.rowdiff.yml)

Primary key detection

Merge strategies

Performance

Output formats

Development

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Configuration (`.rowdiff.yml`)

Packages