Skip to content

Kodaxadev/RowDiFF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rowdiff

rowdiff logo

Semantic CSV diff and merge tool with native Git integration.

git diff treats CSV files as text. A reordered column makes every row look changed. A row reorder makes the entire file look changed. rowdiff understands that CSV files are tables — it diffs by key, detects column renames, classifies type drift, and produces human-readable output whether you're in a terminal, CI pipeline, or PR review.

What it does

  • Key-based row matching — rows are matched by primary key, not by line number. Reordering rows produces zero false positives.
  • Column rename detection — combines Levenshtein name similarity with Jaccard content correlation to distinguish a genuine rename from a coincidence.
  • Type-aware cell diffing — classifies mutations along the type lattice (Integer → Float → String) and flags semantic changes like epoch timestamps becoming ISO 8601.
  • Statistical summary — mean, stddev, p5/p95, null ratio, and anomaly detection on numeric columns across both versions.
  • Three-way merge — resolves CSV merge conflicts semantically with configurable strategies: pk-wins-ours, pk-wins-theirs, interleave, sum (delta columns only), last-write-wins-by.
  • Git-native — installs as a gitattributes diff and merge driver. No storage migration, no new tooling required.

Installation

cargo install rowdiff

Or build from source:

git clone https://github.com/yourorg/rowdiff
cd rowdiff
cargo build --release
# Binary at ./target/release/rowdiff

Basic usage

# Compare two CSV files
rowdiff before.csv after.csv

# JSON output (machine-readable)
rowdiff before.csv after.csv --format json

# Self-contained HTML report
rowdiff before.csv after.csv --format html > report.html

# Explicit primary key (overrides auto-detection)
rowdiff before.csv after.csv --primary-key customer_id

# Summary only (no row-level detail)
rowdiff before.csv after.csv --summary-only

# Cap row diff output (default 50)
rowdiff before.csv after.csv --max-rows 100

Exit codes follow diff convention: 0 = identical, 1 = differences found, 2 = error.

Git integration

Diff driver

Add to .gitattributes in your repo:

*.csv diff=rowdiff
*.tsv diff=rowdiff

Add to ~/.gitconfig or .git/config:

[diff "rowdiff"]
    command = rowdiff git-diff

Now git diff, git log -p, and git show all use semantic CSV output.

Merge driver

Add to .gitattributes:

*.csv merge=rowdiff

Add to ~/.gitconfig or .git/config:

[merge "rowdiff"]
    name = CSV semantic merge
    driver = rowdiff git-merge %O %A %B %L %P

Merge strategy defaults to pk-wins-ours. Override per-file in .rowdiff.yml.

Configuration (.rowdiff.yml)

Place in your repo root. Supports glob patterns for file matching.

files:
  "data/customers.csv":
    primary_key: customer_id
    pk_uniqueness_threshold: 0.98
    type_overrides:
      created_at:
        type: iso8601
        semantic: cumulative
      units_delta:
        type: integer
        semantic: delta           # sum strategy is only valid on delta columns

  "exports/*.csv":
    primary_key: auto             # auto-detect with default threshold
    merge_strategy: pk-wins-ours

policies:
  - rule: no_pk_deletion
    severity: error
  - rule: no_type_widening
    severity: warn
  - rule: stat_shift_threshold
    column: revenue
    max_mean_delta_pct: 50
    severity: error

settings:
  pk_uniqueness_threshold: 0.98
  rename_confidence_threshold: 0.80
  type_inference_threshold: 0.90

Primary key detection

When no explicit key is configured, rowdiff scores every column (and candidate composite pairs) for cardinality, null ratio, and name heuristics. A column qualifies if it is unique in ≥ 98% of rows and has zero nulls. If no single column qualifies, the top composite pair is tried. If nothing qualifies, a content hash of the full row is used as a fallback key (and a warning is emitted).

Merge strategies

Strategy Behaviour
pk-wins-ours On conflict, keep our version of the row
pk-wins-theirs On conflict, keep their version of the row
interleave On conflict, emit a conflict marker (non-zero exit)
sum Add numeric values; only valid for semantic: delta columns
last-write-wins-by: <col> Keep the row with the higher value in the named column

Delete-modify conflicts always produce a conflict marker regardless of strategy.

Performance

Tested on commodity hardware (Linux x86-64):

Dataset Time
1M rows, ~5% modified ~11s
Expected on M-class Mac ~8–12s

rowdiff streams CSV files through a batch-insert SQLite pipeline with WITHOUT ROWID B-tree tables. Memory usage is bounded regardless of file size — it does not load the full dataset into RAM.

Output formats

Terminal (default) — ANSI colored, human-readable. Schema changes, row summary cards, per-row cell diffs, statistics table, policy violations.

JSON (--format json) — structured output for downstream processing. Full report or summary-only via --summary-only.

HTML (--format html) — self-contained single-file report with embedded CSS. Suitable for archiving or attaching to PR reviews.

Development

# Run all tests
cargo test

# Run benchmarks (skipped by default)
cargo test --release --test bench_1m -- --ignored --nocapture

# Lint
cargo clippy --all-targets -- -D warnings

# Format check
cargo fmt --all -- --check

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages