Add bloom filter pruning on the indexed parquet read path#21904
Add bloom filter pruning on the indexed parquet read path#21904Bukhtawar wants to merge 3 commits into
Conversation
PR Reviewer Guide 🔍(Review updated until commit c6b93d9)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to c6b93d9 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 5fa55f5
Suggestions up to commit 5d5ed91
Suggestions up to commit 2b97c12
Suggestions up to commit 55b2ccc
Suggestions up to commit 83e2fc5
|
|
Persistent review updated to latest commit 55b2ccc |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #21904 +/- ##
============================================
- Coverage 73.51% 73.49% -0.03%
+ Complexity 75582 75553 -29
============================================
Files 6034 6034
Lines 342661 342661
Branches 49294 49294
============================================
- Hits 251918 251827 -91
- Misses 70712 70810 +98
+ Partials 20031 20024 -7 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
55b2ccc to
2b97c12
Compare
|
Persistent review updated to latest commit 2b97c12 |
2b97c12 to
7911c00
Compare
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit 7911c00. 'Diff too large, requires skip by maintainers after manual review' Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
Integrate per-row-group bloom filter checks into the indexed execution prefetch phase. Reuses DataFusion's PruningPredicate::prune() with a BloomFilterStatistics PruningStatistics impl — same pattern DataFusion uses internally for its vanilla ParquetExec bloom filter pruning. When enabled, the PruningPredicate's literal_columns() identifies which columns to check, bloom filter bytes are read from the object store, and PruningPredicate::prune() evaluates whether the RG can be skipped. If pruned, the FFM collector call, page pruning, and decode are all avoided. Java: - Add datafusion.indexed.bloom_filter_on_read (Dynamic, NodeScope, default true) - Wire through WireConfigSnapshot at offset 72 (i32, 0/1) Rust: - New module: indexed_table/bloom_pruner.rs - BloomFilterStatistics implements PruningStatistics (same as DataFusion) - check_scalar: type-aware bloom check (aligned with DataFusion's impl) - bloom_prune_rg: loads per-column Sbbf, calls predicate.prune() - SingleCollectorEvaluator: calls bloom_prune_rg at top of prefetch_rg - DatafusionQueryConfig: bloom_filter_on_read field in wire format Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
7911c00 to
5d5ed91
Compare
|
Persistent review updated to latest commit 5d5ed91 |
|
❌ Gradle check result for 5d5ed91: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Two benchmarks measuring bloom filter pruning performance:
1. bloom_filter_bench: Micro-benchmark of bloom_prune_rg in isolation.
Measures IO + check cost per RG for present/absent/no-bloom scenarios.
2. bloom_filter_e2e_bench: End-to-end benchmark of full prefetch_rg
pipeline with a mock collector simulating 1ms FFM latency.
Demonstrates 7x speedup when bloom filter prunes an absent value.
Run: cargo bench --bench bloom_filter_bench
cargo bench --bench bloom_filter_e2e_bench
Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
5d5ed91 to
5fa55f5
Compare
|
Persistent review updated to latest commit 5fa55f5 |
|
❌ Gradle check result for 5fa55f5: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit c6b93d9 |
Integrate per-row-group bloom filter checks into the indexed execution prefetch phase. When enabled, equality predicates (=, IN) are checked against parquet bloom filters before invoking the Lucene collector FFM call. If the bloom filter proves the queried value is absent from a row group, the entire RG is skipped — saving the FFM round-trip, page pruning, and parquet decode.
Java:
Rust:
Design:
Description
[Describe what this change achieves]
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.