Skip to content

fix(trie): collect_reachable empty-subtree check must use the right d…#714

Merged
github-actions[bot] merged 1 commit into
mainfrom
fix/trie-gc-empty-hash-depth-check
May 25, 2026
Merged

fix(trie): collect_reachable empty-subtree check must use the right d…#714
github-actions[bot] merged 1 commit into
mainfrom
fix/trie-gc-empty-hash-depth-check

Conversation

@satyakwok
Copy link
Copy Markdown
Collaborator

@satyakwok satyakwok commented May 25, 2026

…epth

collect_reachable is the GC live-set builder. It walks every surviving committed root and inserts every reachable node + leaf-value hash into a HashSet that gc_orphaned_nodes uses to decide what to delete.

The early-out check was hash == empty_hash(0), but empty_hash(d) is depth-specific — each level has its own canonical "empty subtree" sentinel, computed bottom-up. Hitting an empty child at depth > 0 didn't match empty_hash(0), so the walk fell through to load_node, which returns Ok(None) for those (empty subtrees are never materialised in trie_nodes), and the function silently exited without recording anything in live.

That alone doesn't corrupt anything in the steady state, but it silently diverged from walk_verify (which has always taken depth and checks hash == empty_hash(depth.min(256)) — see tree.rs:571). Two functions reasoning about reachability with different definitions of "empty" is the kind of thing that bites later, and on 2026-05-24 it showed up: same value-hash 516a0dc27c... surfaced as an orphan-value on one validator (h=2215224) and an orphan-node on a fullnode (h=2131563) — different roles, different heights, same hash, all on the same v2.2.15 binary. Each node's prune ran at its own 5000-block boundary with a slightly different live-set snapshot.

This commit aligns collect_reachable with walk_verify: take an explicit depth parameter, check hash == empty_hash(depth), and pass depth + 1 on each recurse. Top-level callers pass 0 (root).

NOTE: this fix is necessary but possibly not sufficient for the underlying production symptom. The same incident may also involve the trie writes not being fully atomic across trie_nodes + trie_values across an unclean shutdown — vps3 had just finalised h=2215224 milliseconds before the crash, exactly the window where a non-atomic write would leave the leaf node persisted but the value not. That's a storage-layer follow-up tracked separately.

Includes regression test test_prune_survives_multi_key_at_varied_depths that builds a multi-key trie, rotates values to make v1 unreachable from v2, runs prune(keep=0), then asserts every v2 key still resolves and verify_integrity passes.

Summary

Risk tier

Check ONE:

  • 🟢 Low — docs, tools/, tests/, CI configs, dependency patch bumps in dev-only crates, comments
  • 🟡 Medium — non-consensus production code (RPC handlers, network plumbing, observability, ops scripts)
  • 🟠 High — consensus-critical crates (sentrix-core, sentrix-trie, sentrix-staking, sentrix-bft), block_executor, apply_block_*, state_root path
  • 🔴 Critical — Voyager activation, fork-height changes, hard-fork rollouts, anything that flips env vars on mainnet

Required by tier

🟢 Low — minimum bar

  • CI green (tests + clippy + audit + gitleaks)

🟡 Medium — adds

  • New public function or behavioural change has at least one corresponding #[test] in same PR
  • Brief description of how this was tested (manual run, integration test, etc.)

🟠 High — adds

  • Regression test that fails on main and passes with this change — paste test name in PR body
  • Designed against documented invariant (link the audit/runbook/design doc)
  • Fresh-brain review by someone other than the author (per the consensus-change review checklist)
  • Single conceptual unit per PR (no bundling — bundling consensus changes burned us on v2.1.12 → 2026-04-25 livelock)

🔴 Critical — adds

  • Testnet rehearsal completed with success criteria + log evidence linked here
  • Bake window observed: minimum 2h on testnet at the same configuration before mainnet
  • Coordinated rollback plan documented in PR body — exact commands operator runs if it fails
  • Operator sign-off at activation moment (not just PR approval — separate moment for the actual flip)

Test plan

Rollback plan

Related

Summary by CodeRabbit

  • Bug Fixes
    • Fixed a data pruning issue that could cause keys to become inaccessible at certain structural depths. The trie now correctly maintains data integrity across all levels during pruning operations.

Review Change Stack

…epth

`collect_reachable` is the GC live-set builder. It walks every surviving
committed root and inserts every reachable node + leaf-value hash into a
HashSet that `gc_orphaned_nodes` uses to decide what to delete.

The early-out check was `hash == empty_hash(0)`, but `empty_hash(d)` is
depth-specific — each level has its own canonical "empty subtree"
sentinel, computed bottom-up. Hitting an empty child at depth > 0 didn't
match `empty_hash(0)`, so the walk fell through to `load_node`, which
returns `Ok(None)` for those (empty subtrees are never materialised in
`trie_nodes`), and the function silently exited without recording
anything in `live`.

That alone doesn't corrupt anything in the steady state, but it
silently diverged from `walk_verify` (which has always taken `depth` and
checks `hash == empty_hash(depth.min(256))` — see tree.rs:571). Two
functions reasoning about reachability with different definitions of
"empty" is the kind of thing that bites later, and on 2026-05-24 it
showed up: same value-hash `516a0dc27c...` surfaced as an orphan-value
on one validator (h=2215224) and an orphan-node on a fullnode
(h=2131563) — different roles, different heights, same hash, all on the
same v2.2.15 binary. Each node's prune ran at its own 5000-block
boundary with a slightly different live-set snapshot.

This commit aligns `collect_reachable` with `walk_verify`: take an
explicit `depth` parameter, check `hash == empty_hash(depth)`, and pass
`depth + 1` on each recurse. Top-level callers pass 0 (root).

NOTE: this fix is necessary but possibly not sufficient for the
underlying production symptom. The same incident may also involve the
trie writes not being fully atomic across `trie_nodes` + `trie_values`
across an unclean shutdown — vps3 had just finalised h=2215224 milliseconds
before the crash, exactly the window where a non-atomic write would
leave the leaf node persisted but the value not. That's a storage-layer
follow-up tracked separately.

Includes regression test `test_prune_survives_multi_key_at_varied_depths`
that builds a multi-key trie, rotates values to make v1 unreachable
from v2, runs `prune(keep=0)`, then asserts every v2 key still resolves
and `verify_integrity` passes.
@github-actions github-actions Bot enabled auto-merge (squash) May 25, 2026 14:04
@codecov
Copy link
Copy Markdown

codecov Bot commented May 25, 2026

Codecov Report

❌ Patch coverage is 97.05882% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/sentrix-trie/src/tree.rs 97.05% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'sentrix-trie benches'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.50.

Benchmark suite Current: 26b82a8 Previous: c6ae261 Ratio
insert_single 169836 ns/iter (± 22784) 83470 ns/iter (± 10721) 2.03

This comment was automatically generated by workflow using github-action-benchmark.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

📝 Walkthrough

Walkthrough

SentrixTrie::prune() now tracks depth when building its "live set" of reachable nodes. The collect_reachable() helper signature is extended to accept a depth parameter starting at 0 for the root. Empty-subtree checks are changed from root-only sentinel detection (empty_hash(0)) to depth-specific detection (empty_hash(depth)) to correctly skip canonical empty hashes at all tree levels. Recursive calls propagate depth + 1 to maintain correct empty-subtree skipping at deeper levels. A new test validates that pruning with keep=0 preserves all inserted keys across varied subtree depths and passes integrity verification.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • sentrix-labs/sentrix#711: Both PRs modify SentrixTrie::prune()'s "live set" reachability walk in crates/sentrix-trie/src/tree.rs, with this PR adding depth-aware empty-subtree detection and the other augmenting coverage of concurrently committed roots.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description covers the technical problem, root cause, fix details, regression test, and risk tier; however, the 'Summary' section under 'Risk tier' is blank and several required high-tier checkboxes are unchecked. Complete the 'Summary' section, check the appropriate 'Risk tier' box (🟠 High for consensus-critical trie changes), and verify all high-tier required items are addressed (regression test name, invariant documentation, fresh-brain review, single-unit scope).
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly references the main fix: aligning the empty-subtree check in collect_reachable to use the correct depth parameter, which is the core change.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/trie-gc-empty-hash-depth-check

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/sentrix-trie/src/tree.rs (1)

732-744: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Minor inconsistency with walk_verify's defensive depth bound.

The fix correctly addresses the depth-specific empty-hash check. However, walk_verify (line 571) uses depth.min(256) as a defensive bound, while collect_reachable uses depth directly. In a well-formed 256-level SMT, depth should never exceed 256, so this doesn't affect correctness. For defensive consistency between the two tree-walking functions, consider aligning the pattern.

Per coding guidelines, this is a suggestion only—not a required change.

🔧 Optional: align with walk_verify's defensive bound
     fn collect_reachable(
         &self,
         hash: NodeHash,
         depth: usize,
         live: &mut std::collections::HashSet<NodeHash>,
     ) -> SentrixResult<()> {
         use crate::node::empty_hash;
         // Skip empty subtrees (depth-specific) and already-visited nodes
-        if hash == empty_hash(depth) || live.contains(&hash) {
+        if hash == empty_hash(depth.min(256)) || live.contains(&hash) {
             return Ok(());
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/sentrix-trie/src/tree.rs` around lines 732 - 744, collect_reachable
currently uses depth directly when checking empty_hash and recursing, while
walk_verify defensively bounds depth with depth.min(256); to align behavior,
change collect_reachable to use a bounded depth (e.g., let d = depth.min(256))
when calling empty_hash and when recursing (or apply .min(256) to recursive
calls), keeping references to empty_hash(depth), collect_reachable(left/right,
depth + 1, live) and the TrieNode::Internal match so the defensive bound mirrors
walk_verify's pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/sentrix-trie/src/tree.rs`:
- Around line 732-744: collect_reachable currently uses depth directly when
checking empty_hash and recursing, while walk_verify defensively bounds depth
with depth.min(256); to align behavior, change collect_reachable to use a
bounded depth (e.g., let d = depth.min(256)) when calling empty_hash and when
recursing (or apply .min(256) to recursive calls), keeping references to
empty_hash(depth), collect_reachable(left/right, depth + 1, live) and the
TrieNode::Internal match so the defensive bound mirrors walk_verify's pattern.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: ce42db11-8923-4bb5-95b4-196914dcfe38

📥 Commits

Reviewing files that changed from the base of the PR and between 7e4b8cb and 26b82a8.

📒 Files selected for processing (1)
  • crates/sentrix-trie/src/tree.rs

@github-actions github-actions Bot merged commit cfdb43f into main May 25, 2026
18 checks passed
github-actions Bot pushed a commit that referenced this pull request May 25, 2026
Ships PRs #711 (trie prune race augment-walk), #712 (B3b total_minted
self-heal), #713 (pay_one_signer sorted delegators), #714
(collect_reachable depth-aware empty check). See CHANGELOG entry for
the per-PR narrative and the 2026-05-24 incident root-cause analysis.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant