Skip to content

feat: data-parity skill — algorithm guardrails and output style#493

Draft
suryaiyer95 wants to merge 4 commits intomainfrom
feat/data-parity-skill-improvements
Draft

feat: data-parity skill — algorithm guardrails and output style#493
suryaiyer95 wants to merge 4 commits intomainfrom
feat/data-parity-skill-improvements

Conversation

@suryaiyer95
Copy link
Contributor

Summary

Two improvements to the data-parity LLM skill based on real-world testing:

Algorithm guardrailjoindiff physically cannot see a second table when source_warehouse ≠ target_warehouse. It runs a single FULL OUTER JOIN on one connection, so it always reports 0 differences cross-database. Added a CRITICAL warning to the skill so the LLM always chooses hashdiff or auto for cross-DB comparisons.

Output style — Added explicit instruction to report facts only: counts, changed values, missing rows. No editorializing, no pitching the tool, no "this is exactly why row-level diffing matters" commentary.

Default model — Set anthropic/claude-sonnet-4-6 as the default in opencode.jsonc.

Test plan

  • Ran cross-DB comparison (pg_source vs pg_target) — agent now uses hashdiff automatically
  • Ran TPC-H migration validation — output is clean fact-reporting, no promotional commentary
  • Ran SQL query comparison (same-warehouse) — joindiff still used correctly for same-DB

@coderabbitai
Copy link

coderabbitai bot commented Mar 27, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 89864d83-39a2-4a29-8350-0b02c696a0aa

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/data-parity-skill-improvements

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@suryaiyer95
Copy link
Contributor Author

Closing — .opencode/ skill config and model defaults should not live in the open source repo.

@suryaiyer95 suryaiyer95 reopened this Mar 27, 2026
@suryaiyer95 suryaiyer95 force-pushed the feat/data-parity-skill-improvements branch from 2bc4608 to 0f8c7ac Compare March 27, 2026 00:39
- Add DataParity engine integration via native Rust bindings
- Add data-diff tool for LLM agent (profile, joindiff, hashdiff, cascade, auto)
- Add ClickHouse driver support
- Add data-parity skill: profile-first workflow, algorithm selection guide,
  CRITICAL warning that joindiff cannot run cross-database (always returns 0 diffs),
  output style rules (facts only, no editorializing)
- Gitignore .altimate-code/ (credentials) and *.node (platform binaries)
@suryaiyer95 suryaiyer95 force-pushed the feat/data-parity-skill-improvements branch from 0f8c7ac to 7909e55 Compare March 27, 2026 00:41
Split large tables by a date or numeric column before diffing.
Each partition is diffed independently then results are aggregated.

New params:
- partition_column: column to split on (date or numeric)
- partition_granularity: day | week | month | year (for dates)
- partition_bucket_size: bucket width for numeric columns

New output field:
- partition_results: per-partition breakdown (identical / differ / error)

Dialect-aware SQL: Postgres, Snowflake, BigQuery, ClickHouse, MySQL.

Skill updated with partition guidance and examples.
When partition_column is set without partition_granularity or
partition_bucket_size, groups by raw DISTINCT values. Works for
any non-date, non-numeric column: status, region, country, etc.

WHERE clause uses equality: col = 'value' with proper escaping.
Rust serializes ReladiffOutcome with serde tag 'mode', producing:
  {mode: 'diff', diff_rows: [...], stats: {rows_table1, rows_table2, exclusive_table1, exclusive_table2, updated, unchanged}}

Previous code checked for {Match: {...}} / {Diff: {...}} shapes that
never matched, causing partitioned diff to report all partitions as
'identical' with 0 rows.

- extractStats(): check outcome.mode === 'diff', read from stats fields
- mergeOutcomes(): aggregate mode-based outcomes correctly
- summarize()/formatOutcome(): display mode-based shape with correct labels
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant