fix(trie): collect_reachable empty-subtree check must use the right d…#714
Conversation
…epth `collect_reachable` is the GC live-set builder. It walks every surviving committed root and inserts every reachable node + leaf-value hash into a HashSet that `gc_orphaned_nodes` uses to decide what to delete. The early-out check was `hash == empty_hash(0)`, but `empty_hash(d)` is depth-specific — each level has its own canonical "empty subtree" sentinel, computed bottom-up. Hitting an empty child at depth > 0 didn't match `empty_hash(0)`, so the walk fell through to `load_node`, which returns `Ok(None)` for those (empty subtrees are never materialised in `trie_nodes`), and the function silently exited without recording anything in `live`. That alone doesn't corrupt anything in the steady state, but it silently diverged from `walk_verify` (which has always taken `depth` and checks `hash == empty_hash(depth.min(256))` — see tree.rs:571). Two functions reasoning about reachability with different definitions of "empty" is the kind of thing that bites later, and on 2026-05-24 it showed up: same value-hash `516a0dc27c...` surfaced as an orphan-value on one validator (h=2215224) and an orphan-node on a fullnode (h=2131563) — different roles, different heights, same hash, all on the same v2.2.15 binary. Each node's prune ran at its own 5000-block boundary with a slightly different live-set snapshot. This commit aligns `collect_reachable` with `walk_verify`: take an explicit `depth` parameter, check `hash == empty_hash(depth)`, and pass `depth + 1` on each recurse. Top-level callers pass 0 (root). NOTE: this fix is necessary but possibly not sufficient for the underlying production symptom. The same incident may also involve the trie writes not being fully atomic across `trie_nodes` + `trie_values` across an unclean shutdown — vps3 had just finalised h=2215224 milliseconds before the crash, exactly the window where a non-atomic write would leave the leaf node persisted but the value not. That's a storage-layer follow-up tracked separately. Includes regression test `test_prune_survives_multi_key_at_varied_depths` that builds a multi-key trie, rotates values to make v1 unreachable from v2, runs `prune(keep=0)`, then asserts every v2 key still resolves and `verify_integrity` passes.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'sentrix-trie benches'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.50.
| Benchmark suite | Current: 26b82a8 | Previous: c6ae261 | Ratio |
|---|---|---|---|
insert_single |
169836 ns/iter (± 22784) |
83470 ns/iter (± 10721) |
2.03 |
This comment was automatically generated by workflow using github-action-benchmark.
📝 WalkthroughWalkthrough
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/sentrix-trie/src/tree.rs (1)
732-744: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueMinor inconsistency with
walk_verify's defensive depth bound.The fix correctly addresses the depth-specific empty-hash check. However,
walk_verify(line 571) usesdepth.min(256)as a defensive bound, whilecollect_reachableusesdepthdirectly. In a well-formed 256-level SMT, depth should never exceed 256, so this doesn't affect correctness. For defensive consistency between the two tree-walking functions, consider aligning the pattern.Per coding guidelines, this is a suggestion only—not a required change.
🔧 Optional: align with walk_verify's defensive bound
fn collect_reachable( &self, hash: NodeHash, depth: usize, live: &mut std::collections::HashSet<NodeHash>, ) -> SentrixResult<()> { use crate::node::empty_hash; // Skip empty subtrees (depth-specific) and already-visited nodes - if hash == empty_hash(depth) || live.contains(&hash) { + if hash == empty_hash(depth.min(256)) || live.contains(&hash) { return Ok(()); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/sentrix-trie/src/tree.rs` around lines 732 - 744, collect_reachable currently uses depth directly when checking empty_hash and recursing, while walk_verify defensively bounds depth with depth.min(256); to align behavior, change collect_reachable to use a bounded depth (e.g., let d = depth.min(256)) when calling empty_hash and when recursing (or apply .min(256) to recursive calls), keeping references to empty_hash(depth), collect_reachable(left/right, depth + 1, live) and the TrieNode::Internal match so the defensive bound mirrors walk_verify's pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/sentrix-trie/src/tree.rs`:
- Around line 732-744: collect_reachable currently uses depth directly when
checking empty_hash and recursing, while walk_verify defensively bounds depth
with depth.min(256); to align behavior, change collect_reachable to use a
bounded depth (e.g., let d = depth.min(256)) when calling empty_hash and when
recursing (or apply .min(256) to recursive calls), keeping references to
empty_hash(depth), collect_reachable(left/right, depth + 1, live) and the
TrieNode::Internal match so the defensive bound mirrors walk_verify's pattern.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: ce42db11-8923-4bb5-95b4-196914dcfe38
📒 Files selected for processing (1)
crates/sentrix-trie/src/tree.rs
…epth
collect_reachableis the GC live-set builder. It walks every surviving committed root and inserts every reachable node + leaf-value hash into a HashSet thatgc_orphaned_nodesuses to decide what to delete.The early-out check was
hash == empty_hash(0), butempty_hash(d)is depth-specific — each level has its own canonical "empty subtree" sentinel, computed bottom-up. Hitting an empty child at depth > 0 didn't matchempty_hash(0), so the walk fell through toload_node, which returnsOk(None)for those (empty subtrees are never materialised intrie_nodes), and the function silently exited without recording anything inlive.That alone doesn't corrupt anything in the steady state, but it silently diverged from
walk_verify(which has always takendepthand checkshash == empty_hash(depth.min(256))— see tree.rs:571). Two functions reasoning about reachability with different definitions of "empty" is the kind of thing that bites later, and on 2026-05-24 it showed up: same value-hash516a0dc27c...surfaced as an orphan-value on one validator (h=2215224) and an orphan-node on a fullnode (h=2131563) — different roles, different heights, same hash, all on the same v2.2.15 binary. Each node's prune ran at its own 5000-block boundary with a slightly different live-set snapshot.This commit aligns
collect_reachablewithwalk_verify: take an explicitdepthparameter, checkhash == empty_hash(depth), and passdepth + 1on each recurse. Top-level callers pass 0 (root).NOTE: this fix is necessary but possibly not sufficient for the underlying production symptom. The same incident may also involve the trie writes not being fully atomic across
trie_nodes+trie_valuesacross an unclean shutdown — vps3 had just finalised h=2215224 milliseconds before the crash, exactly the window where a non-atomic write would leave the leaf node persisted but the value not. That's a storage-layer follow-up tracked separately.Includes regression test
test_prune_survives_multi_key_at_varied_depthsthat builds a multi-key trie, rotates values to make v1 unreachable from v2, runsprune(keep=0), then asserts every v2 key still resolves andverify_integritypasses.Summary
Risk tier
Check ONE:
sentrix-core,sentrix-trie,sentrix-staking,sentrix-bft),block_executor,apply_block_*,state_rootpathRequired by tier
🟢 Low — minimum bar
🟡 Medium — adds
#[test]in same PR🟠 High — adds
🔴 Critical — adds
Test plan
Rollback plan
Related
Summary by CodeRabbit