-
Notifications
You must be signed in to change notification settings - Fork 272
perf: [iceberg] Remove IcebergFileStream, use iceberg-rust's parallelization, bump iceberg-rust to latest, cache SchemaAdapter #3051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
mbutrovich
merged 20 commits into
apache:main
from
mbutrovich:more_more_iceberg_file_stream
Jan 20, 2026
Merged
perf: [iceberg] Remove IcebergFileStream, use iceberg-rust's parallelization, bump iceberg-rust to latest, cache SchemaAdapter #3051
mbutrovich
merged 20 commits into
apache:main
from
mbutrovich:more_more_iceberg_file_stream
Jan 20, 2026
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3051 +/- ##
============================================
+ Coverage 56.12% 60.02% +3.89%
- Complexity 976 1429 +453
============================================
Files 119 170 +51
Lines 11743 15746 +4003
Branches 2251 2602 +351
============================================
+ Hits 6591 9451 +2860
- Misses 4012 4976 +964
- Partials 1140 1319 +179 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
# Conflicts: # native/core/src/execution/operators/iceberg_scan.rs
…s, which in turn is set to match Spark's spark.task.cpus.
liurenjie1024
added a commit
to apache/iceberg-rust
that referenced
this pull request
Jan 20, 2026
…oid waker churn and add determinism to FileScanTask processing (#2020) ## Which issue does this PR close? - N/A. ## What changes are included in this PR? - Due to the way Comet maps DataFusion `SessionContext`, the tokio runtime, and Spark Tasks, we see frequent waker churn when concurrency is set to 1 in the `ArrowReader`. This adds a fast path that does not use `try_flatten_unordered` and its internal `replace_waker` calls. - This also prevents tasks from being reordered at runtime. Several Iceberg Java tests expect specific query results without an `ORDER BY`, so this enables those tests to keep working when concurrency is set to 1. See apache/datafusion-comet#3051 and <img width="3804" height="754" alt="flamegraph" src="https://github.com/user-attachments/assets/26b93e85-5835-4bf4-b7f1-b136face940d" /> ## Are these changes tested? New test for determinism, also running the entire Iceberg Java Spark suite via Comet in apache/datafusion-comet#3051. --------- Co-authored-by: Renjie Liu <liurenjie2008@gmail.com>
Contributor
Author
|
apache/iceberg-rust#2020 merged so we should be good to review this now. |
andygrove
approved these changes
Jan 20, 2026
Contributor
Author
|
Thanks for the review @andygrove! Will merge once CI goes green again. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
N/A
Rationale for this change
Profiling Iceberg native scans revealed significant overhead in async stream polling, particularly:
tokio::drop_wakerandtokio::park::cloneconsuming substantial time inIcebergStreamWrapper::poll_nextfutures_util::stream::flatten_unordered::SharedPollState::{start_polling,stop_polling}showing lock contentionI think this is due to competing parallelization logic:
IcebergFileStreampassed oneFileScanTaskat a time to iceberg-rust, causingflatten_unorderedto coordinate parallelization across a single task (pure overhead). Stream nesting created excessive waker churn.What changes are included in this PR?
IcebergFileStreamand pass allFileScanTasks directly to iceberg-rust. I tried this in the past but I can't remember why I abandoned it. Let's try again.How are these changes tested?
Existing tests.