Fix BloomFilter buffer incompatibility between Spark and Comet #3003

Shekharrajak · 2025-12-28T09:57:25Z

Handle Spark's full serialization format (12-byte header + bits) in merge_filter() to support Spark partial / Comet final execution. The fix automatically detects the format and extracts bits data accordingly.

Fixes #2889

Rationale for this change

Spark's serialize() returns full format: 12-byte header (version + numHashFunctions + numWords) + bits data
Comet's state_as_bytes() returns bits data only
When Spark partial sends full format, Comet's merge_filter() expects bits-only, causing mismatch

Ref https://github.com/apache/spark/blob/master/common/sketch/src/main/java/org/apache/spark/util/sketch/BitArray.java#L99

Ref https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala#L219

Spark format: BloomFilterImpl.writeTo() (4+4 bytes) + BitArray.writeTo() (4 bytes + bits)

What changes are included in this PR?

Detects Spark format (buffer size = 12 + expected_bits_size)
Extracts bits data by skipping 12-byte header if Spark format
Returns bits as-is if Comet format

How are these changes tested?

Spark SQL test

Handle Spark's full serialization format (12-byte header + bits) in merge_filter() to support Spark partial / Comet final execution. The fix automatically detects the format and extracts bits data accordingly. Fixes apache#2889

andygrove · 2026-01-02T15:06:47Z

Thanks @Shekharrajak. Looks like you need to run make format to fix build failures

Shekharrajak · 2026-01-03T14:31:51Z

Thanks @Shekharrajak. Looks like you need to run make format to fix build failures

Done. Please help in triggering the workflow. Thanks!

andygrove · 2026-01-05T21:00:28Z

@Shekharrajak there are compilation errors

Shekharrajak · 2026-01-06T12:21:08Z

@andygrove , now it is looking fine locally. Do we have a way to run all the workflow checks to run locally so that we will be make sure everything is fine , before running the workflow in GitHub ?

codecov-commenter · 2026-01-06T15:35:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.55%. Comparing base (f09f8af) to head (030c67b).
⚠️ Report is 822 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3003      +/-   ##
============================================
+ Coverage     56.12%   59.55%   +3.43%     
- Complexity      976     1379     +403     
============================================
  Files           119      167      +48     
  Lines         11743    15496    +3753     
  Branches       2251     2569     +318     
============================================
+ Hits           6591     9229    +2638     
- Misses         4012     4970     +958     
- Partials       1140     1297     +157

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mbutrovich · 2026-01-06T15:45:51Z

@andygrove , now it is looking fine locally. Do we have a way to run all the workflow checks to run locally so that we will be make sure everything is fine , before running the workflow in GitHub ?

I think in the compilation error case that should be pretty reproducible locally. I definitely recommend running cargo clippy in native first, since that'll catch native compilation and linting errors.

mbutrovich · 2026-01-06T17:13:38Z

In the absence of any new tests, it feels like we should be relaxing a fallback constraint in operators.scala or modifying existing tests to exercise this behavior. Otherwise I suspect we're still falling back. @andygrove do you recall where we might want to make changes to test this behavior?

Shekharrajak · 2026-01-06T18:27:20Z

I think this is the condition: https://github.com/apache/datafusion-comet/blob/main/spark/src/main/scala/org/apache/spark/sql/comet/operators.scala#L1074

mbutrovich

Thanks @Shekharrajak! Can you remove any relevant fallbacks and/or modify tests so we know that we're exercising this behavior?

minor change

5994c3f

mbutrovich self-requested a review January 5, 2026 17:06

Fix Rust lifetime and borrow checker errors in merge_filter

030c67b

mbutrovich requested changes Jan 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix BloomFilter buffer incompatibility between Spark and Comet #3003

Fix BloomFilter buffer incompatibility between Spark and Comet #3003

Uh oh!

Shekharrajak commented Dec 28, 2025 •

edited

Loading

Uh oh!

andygrove commented Jan 2, 2026

Uh oh!

Shekharrajak commented Jan 3, 2026

Uh oh!

andygrove commented Jan 5, 2026

Uh oh!

Shekharrajak commented Jan 6, 2026

Uh oh!

codecov-commenter commented Jan 6, 2026 •

edited

Loading

Uh oh!

mbutrovich commented Jan 6, 2026

Uh oh!

mbutrovich commented Jan 6, 2026

Uh oh!

Shekharrajak commented Jan 6, 2026

Uh oh!

mbutrovich left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix BloomFilter buffer incompatibility between Spark and Comet #3003

Are you sure you want to change the base?

Fix BloomFilter buffer incompatibility between Spark and Comet #3003

Uh oh!

Conversation

Shekharrajak commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove commented Jan 2, 2026

Uh oh!

Shekharrajak commented Jan 3, 2026

Uh oh!

andygrove commented Jan 5, 2026

Uh oh!

Shekharrajak commented Jan 6, 2026

Uh oh!

codecov-commenter commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mbutrovich commented Jan 6, 2026

Uh oh!

mbutrovich commented Jan 6, 2026

Uh oh!

Shekharrajak commented Jan 6, 2026

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Shekharrajak commented Dec 28, 2025 •

edited

Loading

codecov-commenter commented Jan 6, 2026 •

edited

Loading