Semantic CSV diff and merge tool with native Git integration.
git diff treats CSV files as text. A reordered column makes every row look changed. A row reorder makes the entire file look changed. rowdiff understands that CSV files are tables — it diffs by key, detects column renames, classifies type drift, and produces human-readable output whether you're in a terminal, CI pipeline, or PR review.
- Key-based row matching — rows are matched by primary key, not by line number. Reordering rows produces zero false positives.
- Column rename detection — combines Levenshtein name similarity with Jaccard content correlation to distinguish a genuine rename from a coincidence.
- Type-aware cell diffing — classifies mutations along the type lattice (Integer → Float → String) and flags semantic changes like epoch timestamps becoming ISO 8601.
- Statistical summary — mean, stddev, p5/p95, null ratio, and anomaly detection on numeric columns across both versions.
- Three-way merge — resolves CSV merge conflicts semantically with configurable strategies:
pk-wins-ours,pk-wins-theirs,interleave,sum(delta columns only),last-write-wins-by. - Git-native — installs as a
gitattributesdiff and merge driver. No storage migration, no new tooling required.
cargo install rowdiffOr build from source:
git clone https://github.com/yourorg/rowdiff
cd rowdiff
cargo build --release
# Binary at ./target/release/rowdiff# Compare two CSV files
rowdiff before.csv after.csv
# JSON output (machine-readable)
rowdiff before.csv after.csv --format json
# Self-contained HTML report
rowdiff before.csv after.csv --format html > report.html
# Explicit primary key (overrides auto-detection)
rowdiff before.csv after.csv --primary-key customer_id
# Summary only (no row-level detail)
rowdiff before.csv after.csv --summary-only
# Cap row diff output (default 50)
rowdiff before.csv after.csv --max-rows 100Exit codes follow diff convention: 0 = identical, 1 = differences found, 2 = error.
Add to .gitattributes in your repo:
*.csv diff=rowdiff
*.tsv diff=rowdiffAdd to ~/.gitconfig or .git/config:
[diff "rowdiff"]
command = rowdiff git-diffNow git diff, git log -p, and git show all use semantic CSV output.
Add to .gitattributes:
*.csv merge=rowdiffAdd to ~/.gitconfig or .git/config:
[merge "rowdiff"]
name = CSV semantic merge
driver = rowdiff git-merge %O %A %B %L %PMerge strategy defaults to pk-wins-ours. Override per-file in .rowdiff.yml.
Place in your repo root. Supports glob patterns for file matching.
files:
"data/customers.csv":
primary_key: customer_id
pk_uniqueness_threshold: 0.98
type_overrides:
created_at:
type: iso8601
semantic: cumulative
units_delta:
type: integer
semantic: delta # sum strategy is only valid on delta columns
"exports/*.csv":
primary_key: auto # auto-detect with default threshold
merge_strategy: pk-wins-ours
policies:
- rule: no_pk_deletion
severity: error
- rule: no_type_widening
severity: warn
- rule: stat_shift_threshold
column: revenue
max_mean_delta_pct: 50
severity: error
settings:
pk_uniqueness_threshold: 0.98
rename_confidence_threshold: 0.80
type_inference_threshold: 0.90When no explicit key is configured, rowdiff scores every column (and candidate composite pairs) for cardinality, null ratio, and name heuristics. A column qualifies if it is unique in ≥ 98% of rows and has zero nulls. If no single column qualifies, the top composite pair is tried. If nothing qualifies, a content hash of the full row is used as a fallback key (and a warning is emitted).
| Strategy | Behaviour |
|---|---|
pk-wins-ours |
On conflict, keep our version of the row |
pk-wins-theirs |
On conflict, keep their version of the row |
interleave |
On conflict, emit a conflict marker (non-zero exit) |
sum |
Add numeric values; only valid for semantic: delta columns |
last-write-wins-by: <col> |
Keep the row with the higher value in the named column |
Delete-modify conflicts always produce a conflict marker regardless of strategy.
Tested on commodity hardware (Linux x86-64):
| Dataset | Time |
|---|---|
| 1M rows, ~5% modified | ~11s |
| Expected on M-class Mac | ~8–12s |
rowdiff streams CSV files through a batch-insert SQLite pipeline with WITHOUT ROWID B-tree tables. Memory usage is bounded regardless of file size — it does not load the full dataset into RAM.
Terminal (default) — ANSI colored, human-readable. Schema changes, row summary cards, per-row cell diffs, statistics table, policy violations.
JSON (--format json) — structured output for downstream processing. Full report or summary-only via --summary-only.
HTML (--format html) — self-contained single-file report with embedded CSS. Suitable for archiving or attaching to PR reviews.
# Run all tests
cargo test
# Run benchmarks (skipped by default)
cargo test --release --test bench_1m -- --ignored --nocapture
# Lint
cargo clippy --all-targets -- -D warnings
# Format check
cargo fmt --all -- --checkMIT
