fix(trie): collect_reachable empty-subtree check must use the right d… by satyakwok · Pull Request #714 · sentrix-labs/sentrix

satyakwok · 2026-05-25T14:04:19Z

…epth

collect_reachable is the GC live-set builder. It walks every surviving committed root and inserts every reachable node + leaf-value hash into a HashSet that gc_orphaned_nodes uses to decide what to delete.

The early-out check was hash == empty_hash(0), but empty_hash(d) is depth-specific — each level has its own canonical "empty subtree" sentinel, computed bottom-up. Hitting an empty child at depth > 0 didn't match empty_hash(0), so the walk fell through to load_node, which returns Ok(None) for those (empty subtrees are never materialised in trie_nodes), and the function silently exited without recording anything in live.

That alone doesn't corrupt anything in the steady state, but it silently diverged from walk_verify (which has always taken depth and checks hash == empty_hash(depth.min(256)) — see tree.rs:571). Two functions reasoning about reachability with different definitions of "empty" is the kind of thing that bites later, and on 2026-05-24 it showed up: same value-hash 516a0dc27c... surfaced as an orphan-value on one validator (h=2215224) and an orphan-node on a fullnode (h=2131563) — different roles, different heights, same hash, all on the same v2.2.15 binary. Each node's prune ran at its own 5000-block boundary with a slightly different live-set snapshot.

This commit aligns collect_reachable with walk_verify: take an explicit depth parameter, check hash == empty_hash(depth), and pass depth + 1 on each recurse. Top-level callers pass 0 (root).

NOTE: this fix is necessary but possibly not sufficient for the underlying production symptom. The same incident may also involve the trie writes not being fully atomic across trie_nodes + trie_values across an unclean shutdown — vps3 had just finalised h=2215224 milliseconds before the crash, exactly the window where a non-atomic write would leave the leaf node persisted but the value not. That's a storage-layer follow-up tracked separately.

Includes regression test test_prune_survives_multi_key_at_varied_depths that builds a multi-key trie, rotates values to make v1 unreachable from v2, runs prune(keep=0), then asserts every v2 key still resolves and verify_integrity passes.

Summary

Risk tier

Check ONE:

🟢 Low — docs, tools/, tests/, CI configs, dependency patch bumps in dev-only crates, comments
🟡 Medium — non-consensus production code (RPC handlers, network plumbing, observability, ops scripts)
🟠 High — consensus-critical crates (sentrix-core, sentrix-trie, sentrix-staking, sentrix-bft), block_executor, apply_block_*, state_root path
🔴 Critical — Voyager activation, fork-height changes, hard-fork rollouts, anything that flips env vars on mainnet

Required by tier

🟢 Low — minimum bar

CI green (tests + clippy + audit + gitleaks)

🟡 Medium — adds

New public function or behavioural change has at least one corresponding #[test] in same PR
Brief description of how this was tested (manual run, integration test, etc.)

🟠 High — adds

Regression test that fails on main and passes with this change — paste test name in PR body
Designed against documented invariant (link the audit/runbook/design doc)
Fresh-brain review by someone other than the author (per the consensus-change review checklist)
Single conceptual unit per PR (no bundling — bundling consensus changes burned us on v2.1.12 → 2026-04-25 livelock)

🔴 Critical — adds

Testnet rehearsal completed with success criteria + log evidence linked here
Bake window observed: minimum 2h on testnet at the same configuration before mainnet
Coordinated rollback plan documented in PR body — exact commands operator runs if it fails
Operator sign-off at activation moment (not just PR approval — separate moment for the actual flip)

Test plan

Rollback plan

Summary by CodeRabbit

Bug Fixes
- Fixed a data pruning issue that could cause keys to become inaccessible at certain structural depths. The trie now correctly maintains data integrity across all levels during pruning operations.

…epth `collect_reachable` is the GC live-set builder. It walks every surviving committed root and inserts every reachable node + leaf-value hash into a HashSet that `gc_orphaned_nodes` uses to decide what to delete. The early-out check was `hash == empty_hash(0)`, but `empty_hash(d)` is depth-specific — each level has its own canonical "empty subtree" sentinel, computed bottom-up. Hitting an empty child at depth > 0 didn't match `empty_hash(0)`, so the walk fell through to `load_node`, which returns `Ok(None)` for those (empty subtrees are never materialised in `trie_nodes`), and the function silently exited without recording anything in `live`. That alone doesn't corrupt anything in the steady state, but it silently diverged from `walk_verify` (which has always taken `depth` and checks `hash == empty_hash(depth.min(256))` — see tree.rs:571). Two functions reasoning about reachability with different definitions of "empty" is the kind of thing that bites later, and on 2026-05-24 it showed up: same value-hash `516a0dc27c...` surfaced as an orphan-value on one validator (h=2215224) and an orphan-node on a fullnode (h=2131563) — different roles, different heights, same hash, all on the same v2.2.15 binary. Each node's prune ran at its own 5000-block boundary with a slightly different live-set snapshot. This commit aligns `collect_reachable` with `walk_verify`: take an explicit `depth` parameter, check `hash == empty_hash(depth)`, and pass `depth + 1` on each recurse. Top-level callers pass 0 (root). NOTE: this fix is necessary but possibly not sufficient for the underlying production symptom. The same incident may also involve the trie writes not being fully atomic across `trie_nodes` + `trie_values` across an unclean shutdown — vps3 had just finalised h=2215224 milliseconds before the crash, exactly the window where a non-atomic write would leave the leaf node persisted but the value not. That's a storage-layer follow-up tracked separately. Includes regression test `test_prune_survives_multi_key_at_varied_depths` that builds a multi-key trie, rotates values to make v1 unreachable from v2, runs `prune(keep=0)`, then asserts every v2 key still resolves and `verify_integrity` passes.

codecov · 2026-05-25T14:06:02Z

Codecov Report

❌ Patch coverage is 97.05882% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
crates/sentrix-trie/src/tree.rs	97.05%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'sentrix-trie benches'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.50.

Benchmark suite	Current: `26b82a8`	Previous: `c6ae261`	Ratio
`insert_single`	`169836` ns/iter (`± 22784`)	`83470` ns/iter (`± 10721`)	`2.03`

This comment was automatically generated by workflow using github-action-benchmark.

coderabbitai · 2026-05-25T14:08:07Z

📝 Walkthrough

Walkthrough

SentrixTrie::prune() now tracks depth when building its "live set" of reachable nodes. The collect_reachable() helper signature is extended to accept a depth parameter starting at 0 for the root. Empty-subtree checks are changed from root-only sentinel detection (empty_hash(0)) to depth-specific detection (empty_hash(depth)) to correctly skip canonical empty hashes at all tree levels. Recursive calls propagate depth + 1 to maintain correct empty-subtree skipping at deeper levels. A new test validates that pruning with keep=0 preserves all inserted keys across varied subtree depths and passes integrity verification.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

sentrix-labs/sentrix#711: Both PRs modify SentrixTrie::prune()'s "live set" reachability walk in crates/sentrix-trie/src/tree.rs, with this PR adding depth-aware empty-subtree detection and the other augmenting coverage of concurrently committed roots.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description covers the technical problem, root cause, fix details, regression test, and risk tier; however, the 'Summary' section under 'Risk tier' is blank and several required high-tier checkboxes are unchecked.	Complete the 'Summary' section, check the appropriate 'Risk tier' box (🟠 High for consensus-critical trie changes), and verify all high-tier required items are addressed (regression test name, invariant documentation, fresh-brain review, single-unit scope).

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly references the main fix: aligning the empty-subtree check in collect_reachable to use the correct depth parameter, which is the core change.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/trie-gc-empty-hash-depth-check

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/sentrix-trie/src/tree.rs (1)
732-744: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Minor inconsistency with walk_verify's defensive depth bound.

The fix correctly addresses the depth-specific empty-hash check. However, walk_verify (line 571) uses depth.min(256) as a defensive bound, while collect_reachable uses depth directly. In a well-formed 256-level SMT, depth should never exceed 256, so this doesn't affect correctness. For defensive consistency between the two tree-walking functions, consider aligning the pattern.

Per coding guidelines, this is a suggestion only—not a required change.
🔧 Optional: align with walk_verify's defensive bound
     fn collect_reachable(
         &self,
         hash: NodeHash,
         depth: usize,
         live: &mut std::collections::HashSet<NodeHash>,
     ) -> SentrixResult<()> {
         use crate::node::empty_hash;
         // Skip empty subtrees (depth-specific) and already-visited nodes
-        if hash == empty_hash(depth) || live.contains(&hash) {
+        if hash == empty_hash(depth.min(256)) || live.contains(&hash) {
             return Ok(());
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/sentrix-trie/src/tree.rs` around lines 732 - 744, collect_reachable
currently uses depth directly when checking empty_hash and recursing, while
walk_verify defensively bounds depth with depth.min(256); to align behavior,
change collect_reachable to use a bounded depth (e.g., let d = depth.min(256))
when calling empty_hash and when recursing (or apply .min(256) to recursive
calls), keeping references to empty_hash(depth), collect_reachable(left/right,
depth + 1, live) and the TrieNode::Internal match so the defensive bound mirrors
walk_verify's pattern.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/sentrix-trie/src/tree.rs`:
- Around line 732-744: collect_reachable currently uses depth directly when
checking empty_hash and recursing, while walk_verify defensively bounds depth
with depth.min(256); to align behavior, change collect_reachable to use a
bounded depth (e.g., let d = depth.min(256)) when calling empty_hash and when
recursing (or apply .min(256) to recursive calls), keeping references to
empty_hash(depth), collect_reachable(left/right, depth + 1, live) and the
TrieNode::Internal match so the defensive bound mirrors walk_verify's pattern.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: ce42db11-8923-4bb5-95b4-196914dcfe38

📥 Commits

Reviewing files that changed from the base of the PR and between 7e4b8cb and 26b82a8.

📒 Files selected for processing (1)

crates/sentrix-trie/src/tree.rs

Ships PRs #711 (trie prune race augment-walk), #712 (B3b total_minted self-heal), #713 (pay_one_signer sorted delegators), #714 (collect_reachable depth-aware empty check). See CHANGELOG entry for the per-PR narrative and the 2026-05-24 incident root-cause analysis.

github-actions Bot enabled auto-merge (squash) May 25, 2026 14:04

github-actions Bot reviewed May 25, 2026

View reviewed changes

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

github-actions Bot merged commit cfdb43f into main May 25, 2026
18 checks passed

satyakwok mentioned this pull request May 25, 2026

chore: bump workspace 2.2.15 -> 2.2.16 #715

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(trie): collect_reachable empty-subtree check must use the right d…#714

fix(trie): collect_reachable empty-subtree check must use the right d…#714
github-actions[bot] merged 1 commit into
mainfrom
fix/trie-gc-empty-hash-depth-check

satyakwok commented May 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

codecov Bot commented May 25, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

coderabbitai Bot commented May 25, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

satyakwok commented May 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Risk tier

Required by tier

🟢 Low — minimum bar

🟡 Medium — adds

🟠 High — adds

🔴 Critical — adds

Test plan

Rollback plan

Related

Summary by CodeRabbit

Uh oh!

codecov Bot commented May 25, 2026

Codecov Report

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

coderabbitai Bot commented May 25, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

satyakwok commented May 25, 2026 •

edited by coderabbitai Bot

Loading