Skip to content

Conversation

@Shekharrajak
Copy link
Contributor

@Shekharrajak Shekharrajak commented Dec 28, 2025

Handle Spark's full serialization format (12-byte header + bits) in merge_filter() to support Spark partial / Comet final execution. The fix automatically detects the format and extracts bits data accordingly.

Fixes #2889

Rationale for this change

Spark's serialize() returns full format: 12-byte header (version + numHashFunctions + numWords) + bits data
Comet's state_as_bytes() returns bits data only
When Spark partial sends full format, Comet's merge_filter() expects bits-only, causing mismatch

Ref https://github.com/apache/spark/blob/master/common/sketch/src/main/java/org/apache/spark/util/sketch/BitArray.java#L99

Ref https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala#L219

Spark format: BloomFilterImpl.writeTo() (4+4 bytes) + BitArray.writeTo() (4 bytes + bits)

What changes are included in this PR?

Detects Spark format (buffer size = 12 + expected_bits_size)
Extracts bits data by skipping 12-byte header if Spark format
Returns bits as-is if Comet format

How are these changes tested?

Spark SQL test

Handle Spark's full serialization format (12-byte header + bits) in merge_filter()
to support Spark partial / Comet final execution. The fix automatically detects
the format and extracts bits data accordingly.

Fixes apache#2889
@andygrove
Copy link
Member

Thanks @Shekharrajak. Looks like you need to run make format to fix build failures

@Shekharrajak
Copy link
Contributor Author

Thanks @Shekharrajak. Looks like you need to run make format to fix build failures

Done. Please help in triggering the workflow. Thanks!

@mbutrovich mbutrovich self-requested a review January 5, 2026 17:06
@andygrove
Copy link
Member

@Shekharrajak there are compilation errors

@Shekharrajak
Copy link
Contributor Author

@andygrove , now it is looking fine locally. Do we have a way to run all the workflow checks to run locally so that we will be make sure everything is fine , before running the workflow in GitHub ?

@codecov-commenter
Copy link

codecov-commenter commented Jan 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.55%. Comparing base (f09f8af) to head (030c67b).
⚠️ Report is 822 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3003      +/-   ##
============================================
+ Coverage     56.12%   59.55%   +3.43%     
- Complexity      976     1379     +403     
============================================
  Files           119      167      +48     
  Lines         11743    15496    +3753     
  Branches       2251     2569     +318     
============================================
+ Hits           6591     9229    +2638     
- Misses         4012     4970     +958     
- Partials       1140     1297     +157     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mbutrovich
Copy link
Contributor

@andygrove , now it is looking fine locally. Do we have a way to run all the workflow checks to run locally so that we will be make sure everything is fine , before running the workflow in GitHub ?

I think in the compilation error case that should be pretty reproducible locally. I definitely recommend running cargo clippy in native first, since that'll catch native compilation and linting errors.

@mbutrovich
Copy link
Contributor

In the absence of any new tests, it feels like we should be relaxing a fallback constraint in operators.scala or modifying existing tests to exercise this behavior. Otherwise I suspect we're still falling back. @andygrove do you recall where we might want to make changes to test this behavior?

@Shekharrajak
Copy link
Contributor Author

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Shekharrajak! Can you remove any relevant fallbacks and/or modify tests so we know that we're exercising this behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bloom filter intermediate aggregate buffers are not compatible between Spark and Comet

4 participants