feat(theta): introduce intersection theta set operation#100
feat(theta): introduce intersection theta set operation#100tisonkun merged 11 commits intoapache:mainfrom
Conversation
bb8b9fd to
9f17971
Compare
…hTable used for set operation
9f17971 to
054107c
Compare
There was a problem hiding this comment.
Pull request overview
This PR refactors the ThetaHashTable and introduces the ThetaSketchView trait to prepare for implementing set operations (union, intersection, difference) on theta sketches. The changes improve the API design by separating hash computation from insertion logic and clarifying the semantics of emptiness.
Changes:
- Introduced
ThetaSketchViewtrait providing a unified read-only interface for both mutableThetaSketchand immutableCompactThetaSketch - Refactored
ThetaHashTableto separate hash computation (hash) from insertion with theta screening (try_insert_hash) - Changed
is_emptysemantics to track logical emptiness (whether updates were attempted) rather than physical emptiness (whether entries exist) - Renamed
num_entriestonum_retainedfor clarity - Added
new_with_stateconstructor andfrom_partsfactory method for creating tables/sketches with explicit state
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| datasketches/src/theta/sketch.rs | Introduced ThetaSketchView trait, implemented it for ThetaSketch and CompactThetaSketch, added seed_hash() method, refactored update() to use try_insert(), added from_parts() constructor |
| datasketches/src/theta/mod.rs | Exported ThetaSketchView trait |
| datasketches/src/theta/hash_table.rs | Split hash_and_screen into hash() and try_insert_hash(), added is_empty field for logical emptiness, renamed num_entries to num_retained, added new_with_state() constructor, made REBUILD_THRESHOLD pub(crate), updated all tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ordered, | ||
| empty, | ||
| } | ||
| CompactThetaSketch::from_parts(entries, theta, self.table.seed_hash(), ordered, empty) |
There was a problem hiding this comment.
The refactoring to use from_parts is good, but note that the empty parameter passed to from_parts is computed from entries.is_empty() (line 239), which checks if there are retained hashes. However, with the new is_empty semantics in ThetaHashTable that tracks logical emptiness (whether any updates were attempted), there's now a potential inconsistency: if a ThetaSketch has been updated but all values were screened out, self.table.is_empty() would return false, but the CompactThetaSketch created here would have empty = true. This may be intentional for CompactThetaSketch, but it creates a semantic difference between sketch.is_empty() and sketch.compact(false).is_empty().
tisonkun
left a comment
There was a problem hiding this comment.
I suggest you directly implement all the requirements (set ops).
These abstractions look intermediate - I can't judge whether they are correct isolated.
Have push commit for intersection op |
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 7 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
This PR: