feat: data-parity skill — algorithm guardrails and output style#493
feat: data-parity skill — algorithm guardrails and output style#493suryaiyer95 wants to merge 4 commits intomainfrom
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Closing — .opencode/ skill config and model defaults should not live in the open source repo. |
2bc4608 to
0f8c7ac
Compare
- Add DataParity engine integration via native Rust bindings - Add data-diff tool for LLM agent (profile, joindiff, hashdiff, cascade, auto) - Add ClickHouse driver support - Add data-parity skill: profile-first workflow, algorithm selection guide, CRITICAL warning that joindiff cannot run cross-database (always returns 0 diffs), output style rules (facts only, no editorializing) - Gitignore .altimate-code/ (credentials) and *.node (platform binaries)
0f8c7ac to
7909e55
Compare
Split large tables by a date or numeric column before diffing. Each partition is diffed independently then results are aggregated. New params: - partition_column: column to split on (date or numeric) - partition_granularity: day | week | month | year (for dates) - partition_bucket_size: bucket width for numeric columns New output field: - partition_results: per-partition breakdown (identical / differ / error) Dialect-aware SQL: Postgres, Snowflake, BigQuery, ClickHouse, MySQL. Skill updated with partition guidance and examples.
When partition_column is set without partition_granularity or partition_bucket_size, groups by raw DISTINCT values. Works for any non-date, non-numeric column: status, region, country, etc. WHERE clause uses equality: col = 'value' with proper escaping.
Rust serializes ReladiffOutcome with serde tag 'mode', producing:
{mode: 'diff', diff_rows: [...], stats: {rows_table1, rows_table2, exclusive_table1, exclusive_table2, updated, unchanged}}
Previous code checked for {Match: {...}} / {Diff: {...}} shapes that
never matched, causing partitioned diff to report all partitions as
'identical' with 0 rows.
- extractStats(): check outcome.mode === 'diff', read from stats fields
- mergeOutcomes(): aggregate mode-based outcomes correctly
- summarize()/formatOutcome(): display mode-based shape with correct labels
Summary
Two improvements to the data-parity LLM skill based on real-world testing:
Algorithm guardrail —
joindiffphysically cannot see a second table whensource_warehouse ≠ target_warehouse. It runs a single FULL OUTER JOIN on one connection, so it always reports 0 differences cross-database. Added aCRITICALwarning to the skill so the LLM always chooseshashdifforautofor cross-DB comparisons.Output style — Added explicit instruction to report facts only: counts, changed values, missing rows. No editorializing, no pitching the tool, no "this is exactly why row-level diffing matters" commentary.
Default model — Set
anthropic/claude-sonnet-4-6as the default inopencode.jsonc.Test plan
hashdiffautomaticallyjoindiffstill used correctly for same-DB