fix(core): B3 reconcile skips missing trie nodes instead of crashing … by satyakwok · Pull Request #696 · sentrix-labs/sentrix

satyakwok · 2026-05-20T04:56:13Z

…boot

The trie-reconcile pass at boot used to propagate any trie.get error all the way out of load_blockchain, which turned a single dangling trie node into an unbootable validator — even when the other 99.99% of the trie was healthy and the chain was actively producing blocks before the restart.

Hit live on testnet 2026-05-20 during the v2.2.12 RPC swap. One address (0x044daee895f2bb2ffecd4dd2e771f5ce405ab0ea at h=5003961) had a dangling reference to node 314e57bd...; the validator had been up for 5 hours serving that exact state without complaint, but the next boot crashed on phase-1 trie scan. All four testnet validators were stuck because they share the same trie shape.

The reconcile already has well-defined behavior for "no trie leaf for this address" — phase-2 keeps the blob view as canonical. Treat a failed trie.get the same way: log the gap, count it, surface a summary warning if any were skipped, then continue. Touched accounts will rewrite the trie node on next apply, so the gap closes on its own without operator intervention.

This is strictly a fail-soft on a previously fatal path. Healthy chains never hit it; the only behavioral change is that broken state loads instead of refusing to load.

Summary

Risk tier

Check ONE:

🟢 Low — docs, tools/, tests/, CI configs, dependency patch bumps in dev-only crates, comments
🟡 Medium — non-consensus production code (RPC handlers, network plumbing, observability, ops scripts)
🟠 High — consensus-critical crates (sentrix-core, sentrix-trie, sentrix-staking, sentrix-bft), block_executor, apply_block_*, state_root path
🔴 Critical — Voyager activation, fork-height changes, hard-fork rollouts, anything that flips env vars on mainnet

Required by tier

🟢 Low — minimum bar

CI green (tests + clippy + audit + gitleaks)

🟡 Medium — adds

New public function or behavioural change has at least one corresponding #[test] in same PR
Brief description of how this was tested (manual run, integration test, etc.)

🟠 High — adds

Regression test that fails on main and passes with this change — paste test name in PR body
Designed against documented invariant (link the audit/runbook/design doc)
Fresh-brain review by someone other than the author (per the consensus-change review checklist)
Single conceptual unit per PR (no bundling — bundling consensus changes burned us on v2.1.12 → 2026-04-25 livelock)

🔴 Critical — adds

Testnet rehearsal completed with success criteria + log evidence linked here
Bake window observed: minimum 2h on testnet at the same configuration before mainnet
Coordinated rollback plan documented in PR body — exact commands operator runs if it fails
Operator sign-off at activation moment (not just PR approval — separate moment for the actual flip)

Test plan

Rollback plan

Summary by CodeRabbit

Bug Fixes
- Startup is more resilient: individual trie/account lookup failures no longer abort initialization. The system logs warnings for each skipped address and leaves cached account values unchanged until they can be updated.
- Emits an aggregate warning after scanning when any addresses were skipped due to missing or corrupted trie nodes.

codecov · 2026-05-20T04:58:04Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

coderabbitai · 2026-05-20T04:58:09Z

📝 Walkthrough

Walkthrough

This PR modifies the reconcile_accounts_from_trie function in crates/sentrix-core/src/storage.rs to handle trie lookup failures resiliently. Instead of aborting blockchain load when a trie lookup fails, the code now logs a warning per failing address, tracks the count of missing trie nodes, and continues iteration by inserting None for that account so phase 2 reconciliation preserves the blob's cached values until the account is next accessed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	Required sections are incomplete. Summary section is blank, Test plan section has only a dash with no actual test details, and Rollback plan (recommended for High-tier PRs) is missing.	Fill in the Summary section with 1-3 sentences, provide specific test plan details (especially the regression test name that fails on main), and add a rollback plan for the High-tier consensus change.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: reconcile now skips missing trie nodes instead of crashing boot. It accurately reflects the primary fix.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/b3-reconcile-fail-soft-on-missing-trie

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/sentrix-core/src/storage.rs (1)

367-369: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update docstring to reflect fail-soft behavior.

Lines 367-369 still describe the old hard-fail semantics ("Trie-lookup errors (Err(_)) propagate: a corrupted trie is a hard-fail, not silent fallback"), but the implementation now catches errors and continues. This contradiction could mislead future maintainers.

📝 Proposed docstring update

-/// Trie-lookup errors (`Err(_)`) propagate: a corrupted trie is a
-/// hard-fail, not silent fallback.
+/// Trie-lookup errors (`Err(_)`) are logged and skipped: a missing or
+/// corrupted trie node for a single address no longer aborts boot.
+/// The address is treated as having no trie leaf (blob value preserved),
+/// and the next block touching that account will rewrite the trie entry.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/sentrix-core/src/storage.rs` around lines 367 - 369, Replace the
outdated sentence "Trie-lookup errors (`Err(_)`) propagate: a corrupted trie is
a hard-fail, not silent fallback." in the docstring with a short note that the
implementation now treats trie-lookup errors as fail-soft: errors are caught and
the code continues using fallback behavior (i.e., corrupted trie entries are
skipped/ignored rather than propagating). Update the wording in
crates/sentrix-core/src/storage.rs so the docstring for the trie lookup behavior
reflects the current catch-and-continue semantics.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/sentrix-core/src/storage.rs`:
- Around line 367-369: Replace the outdated sentence "Trie-lookup errors
(`Err(_)`) propagate: a corrupted trie is a hard-fail, not silent fallback." in
the docstring with a short note that the implementation now treats trie-lookup
errors as fail-soft: errors are caught and the code continues using fallback
behavior (i.e., corrupted trie entries are skipped/ignored rather than
propagating). Update the wording in crates/sentrix-core/src/storage.rs so the
docstring for the trie lookup behavior reflects the current catch-and-continue
semantics.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: ea6d6dc0-7e42-4c63-bb5e-f5c22711c346

📥 Commits

Reviewing files that changed from the base of the PR and between 3debf85 and a517677.

📒 Files selected for processing (1)

crates/sentrix-core/src/storage.rs

…boot The trie-reconcile pass at boot used to propagate any `trie.get` error all the way out of `load_blockchain`, which turned a single dangling trie node into an unbootable validator — even when the other 99.99% of the trie was healthy and the chain was actively producing blocks before the restart. Hit live on testnet 2026-05-20 during a binary swap. One address (0x044daee895f2bb2ffecd4dd2e771f5ce405ab0ea at h=5003961) had a dangling reference to node 314e57bd...; the validator had been up for 5 hours serving that exact state without complaint, but the next boot crashed on phase-1 trie scan. All four testnet validators were stuck because they shared the same trie shape. The reconcile already has well-defined behavior for "no trie leaf for this address" — phase-2 keeps the blob view as canonical. Treat a failed `trie.get` the same way: log the gap, count it, surface a summary warning if any were skipped, then continue. Touched accounts will rewrite the trie node on next apply, so the gap closes on its own without operator intervention. This is strictly a fail-soft on a previously fatal path. Healthy chains never hit it; the only behavioral change is that broken state loads instead of refusing to load. Also updates the `reconcile_accounts_from_trie` docstring so it matches the new catch-and-continue semantics (was still claiming hard-fail).

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/sentrix-core/src/storage.rs`:
- Around line 425-449: The current catch-all for trie.get(&key) treats all
errors as recoverable gaps; change the Err branch to match on the SentrixError
enum: if Err(SentrixError::Internal(msg)) and msg.contains("missing node") then
log the missing-node warning, increment trie_gaps and continue (as you do now),
but if Err(SentrixError::StorageError(_)) or
Err(SentrixError::SerializationError(_)) then propagate the error (return Err(e)
or use ?), since those indicate broken storage/serialization; keep the same
handling for successful Ok(leaf) and subsequent account_value_decode usage and
push into trie_values.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 57aeb45f-d992-4541-8699-d8a225d3bd29

📥 Commits

Reviewing files that changed from the base of the PR and between a517677 and a74d53f.

📒 Files selected for processing (1)

crates/sentrix-core/src/storage.rs

coderabbitai · 2026-05-21T00:01:27Z

+        let mut trie_gaps: usize = 0;
        for addr in &addrs {
            let key = address_to_key(addr);
-            let leaf = trie.get(&key).map_err(|e| {
-                SentrixError::Internal(format!(
-                    "B3 reconcile: trie lookup for {addr} failed at h={}: {e}",
-                    bc.chain.last().map(|b| b.index).unwrap_or(0)
-                ))
-            })?;
+            let leaf = match trie.get(&key) {
+                Ok(leaf) => leaf,
+                Err(e) => {
+                    tracing::warn!(
+                        "B3 reconcile: trie lookup for {addr} failed at h={}: {e} — \
+                         skipping reconcile for this address (will rewrite on next touch)",
+                        bc.chain.last().map(|b| b.index).unwrap_or(0)
+                    );
+                    trie_gaps += 1;
+                    None
+                }
+            };
            let decoded = leaf.and_then(|bytes| account_value_decode(&bytes));
            trie_values.push((addr.clone(), decoded));
        }
+        if trie_gaps > 0 {
+            tracing::warn!(
+                "B3 reconcile: skipped {trie_gaps}/{} addresses due to missing trie nodes \
+                 at h={}; chain will continue producing — touched accounts repair themselves",
+                addrs.len(),
+                bc.chain.last().map(|b| b.index).unwrap_or(0)
+            );


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail TRIE_FILES=$(fd -i 'trie|error' crates --extension rs) echo "=== trie.get definitions ===" rg -n -C4 '\bfn\s+get\s*\(' $TRIE_FILES echo echo "=== trie-related error enums / aliases / variants ===" rg -n -C4 '\b(enum|type)\s+\w*Error\b|Missing|Dangling|Corrupt|Decode|Storage|Io' $TRIE_FILES echo echo "=== existing trie.get call sites ===" rg -n -C4 'trie\.get\s*\(' crates/sentrix-core/src/storage.rs $TRIE_FILES

Repository: sentrix-labs/sentrix

Length of output: 93

🏁 Script executed:

# Check repository structure and find trie-related files find . -type f -name "*.rs" | grep -i trie | head -20

Repository: sentrix-labs/sentrix

Length of output: 454

🏁 Script executed:

# Also check Cargo.toml files to understand dependencies find . -name "Cargo.toml" | xargs grep -l "trie\|sentrix" | head -10

Repository: sentrix-labs/sentrix

Length of output: 374

🏁 Script executed:

# Check the storage.rs file context around the error handling sed -n '420,455p' crates/sentrix-core/src/storage.rs

Repository: sentrix-labs/sentrix

Length of output: 1744

🏁 Script executed:

# Look for trie-related modules in the codebase rg "mod trie|use.*trie" crates --type rs | head -20

Repository: sentrix-labs/sentrix

Length of output: 91

🏁 Script executed:

# Correct the previous command rg "mod trie|use.*trie" crates/sentrix-core/src/storage.rs -A2 -B2

Repository: sentrix-labs/sentrix

Length of output: 280

🏁 Script executed:

# Check the trie module's error type definition cat crates/sentrix-trie/src/lib.rs

Repository: sentrix-labs/sentrix

Length of output: 777

🏁 Script executed:

# Look for get function and error types in tree.rs rg -n "fn get|enum.*Error|type.*Error" crates/sentrix-trie/src/tree.rs -A3

Repository: sentrix-labs/sentrix

Length of output: 228

🏁 Script executed:

# Check tree.rs for the get implementation head -200 crates/sentrix-trie/src/tree.rs | tail -150

Repository: sentrix-labs/sentrix

Length of output: 7105

🏁 Script executed:

# Get the full get() implementation sed -n '200,280p' crates/sentrix-trie/src/tree.rs

Repository: sentrix-labs/sentrix

Length of output: 3312

🏁 Script executed:

# Find SentrixResult and SentrixError definitions rg "type SentrixResult|enum SentrixError" crates/sentrix-trie/src -A20

Repository: sentrix-labs/sentrix

Length of output: 46

🏁 Script executed:

# Look for where SentrixError is defined in the entire repo rg "enum SentrixError" crates -A30

Repository: sentrix-labs/sentrix

Length of output: 2299

🏁 Script executed:

# Check sentrix-primitives or common error handling fd -i error crates -name "*.rs" -type f | head -10

Repository: sentrix-labs/sentrix

Length of output: 295

🏁 Script executed:

# Get the full SentrixError enum cat crates/sentrix-primitives/src/error.rs | head -100

Repository: sentrix-labs/sentrix

Length of output: 2063

🏁 Script executed:

# Also check if there are more variants rg "SentrixError::" crates/sentrix-primitives/src/error.rs

Repository: sentrix-labs/sentrix

Length of output: 170

🏁 Script executed:

# Search for all uses of SentrixError in the trie module to understand all variants rg "SentrixError::" crates/sentrix-trie/src -B1 -A1

Repository: sentrix-labs/sentrix

Length of output: 12687

🏁 Script executed:

# Verify what cache.get_node returns and how it handles errors sed -n '1,150p' crates/sentrix-trie/src/cache.rs | grep -A30 "fn get_node"

Repository: sentrix-labs/sentrix

Length of output: 1434

🏁 Script executed:

# Check if there are patterns elsewhere in storage.rs for selective error handling rg "match.*Err|SentrixError::" crates/sentrix-core/src/storage.rs -B2 -A3 | head -80

Repository: sentrix-labs/sentrix

Length of output: 3173

🏁 Script executed:

# Check if there's a specific error message pattern for "missing node" vs other Internal errors rg "trie: missing node" crates/sentrix-trie/src -B2 -A2

Repository: sentrix-labs/sentrix

Length of output: 2842

🏁 Script executed:

# Look for any pattern matching on error messages elsewhere in the codebase rg "SentrixError::Internal.*missing|if.*contains.*missing" crates/sentrix-core/src -B2 -A2

Repository: sentrix-labs/sentrix

Length of output: 46

🏁 Script executed:

# Check the full context of the reconcile function to understand the intent sed -n '360,380p' crates/sentrix-core/src/storage.rs

Repository: sentrix-labs/sentrix

Length of output: 1169

🏁 Script executed:

# Check if cache.get_node can return StorageError by looking at the full chain sed -n '1,50p' crates/sentrix-trie/src/cache.rs

Repository: sentrix-labs/sentrix

Length of output: 2244

🏁 Script executed:

# Verify load_node implementation which is called by cache.get_node rg "fn load_node" crates/sentrix-trie/src/storage.rs -A20

Repository: sentrix-labs/sentrix

Length of output: 895

🏁 Script executed:

# Check what other errors could be returned from trie operations rg "map_err.*SentrixError" crates/sentrix-trie/src/storage.rs | head -20

Repository: sentrix-labs/sentrix

Length of output: 1510

Distinguish missing-node errors from storage/serialization failures in error handling.

Lines 428-439 catch all trie.get() errors identically. However, trie.get() can fail with three distinct error types: SentrixError::Internal("trie: missing node ...") (node not found), SentrixError::StorageError(...) (I/O failure), and SentrixError::SerializationError(...) (deserialization failure). The current code treats them all as recoverable gaps, but storage and serialization errors indicate a fundamentally broken trie and should not be suppressed.

Match specifically on SentrixError::Internal with a message pattern check for "missing node", and let StorageError and SerializationError propagate. This preserves the fail-soft behavior for dangling trie nodes while preventing boot with a corrupted or unreadable trie.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/sentrix-core/src/storage.rs` around lines 425 - 449, The current catch-all for trie.get(&key) treats all errors as recoverable gaps; change the Err branch to match on the SentrixError enum: if Err(SentrixError::Internal(msg)) and msg.contains("missing node") then log the missing-node warning, increment trie_gaps and continue (as you do now), but if Err(SentrixError::StorageError(_)) or Err(SentrixError::SerializationError(_)) then propagate the error (return Err(e) or use ?), since those indicate broken storage/serialization; keep the same handling for successful Ok(leaf) and subsequent account_value_decode usage and push into trie_values.

2.2.14 went live on mainnet 2026-05-21 17:02 UTC carrying PRs #692/#696/#702/#703/#704. Document the five fixes + flag that the eth_call + logsBloom fixes on this branch are scheduled for 2.2.15.

* fix(rpc): eth_getTransactionCount + eth_call + logsBloom length Three discrete chain-RPC fixes bundled because they share one cause (strict-but-non-spec behavior breaking off-the-shelf EVM agents). All discovered together standing up the Hyperlane relayer for the Base Sepolia ↔ Sentrix Testnet bridge (audit fix H-4). 1) eth_getTransactionCount accepts any block tag. Relayer + ethers + viem all pin nonce queries to a recent past block. A stale nonce is self-correcting (chain rejects wrong-nonce tx, caller retries) so this method serves current nonce regardless of block tag. 2) eth_call accepts any block tag. Same pattern: Hyperlane queries Mailbox.delivered(msgId) and recipientIsm(addr) at past blocks. Returns current-state. The strict gate stays on eth_getBalance / eth_getCode / eth_getStorageAt where wrong = wrong protocol decision. 3) Empty logsBloom is now actually 256 bytes (Ethereum spec). The EMPTY_LOGS_BLOOM const was 304 bytes (608 hex chars) because of an off-by-one in the original hand-typed string. The doc comment said 256 but the literal was 304. ethers' fee-oracle middleware strict-parses Block.logsBloom — Hyperlane gas estimation SerdeJson'd on this with "invalid length 608, expected 256 bytes" before any process() tx could be submitted. Bumps workspace 2.2.13 -> 2.2.14. * docs(changelog): add 2.2.14 entry — five RPC compat fixes 2.2.14 went live on mainnet 2026-05-21 17:02 UTC carrying PRs #692/#696/#702/#703/#704. Document the five fixes + flag that the eth_call + logsBloom fixes on this branch are scheduled for 2.2.15.

…bytes (#707) * fix(rpc): eth_getTransactionCount + eth_call + logsBloom length Three discrete chain-RPC fixes bundled because they share one cause (strict-but-non-spec behavior breaking off-the-shelf EVM agents). All discovered together standing up the Hyperlane relayer for the Base Sepolia ↔ Sentrix Testnet bridge (audit fix H-4). 1) eth_getTransactionCount accepts any block tag. Relayer + ethers + viem all pin nonce queries to a recent past block. A stale nonce is self-correcting (chain rejects wrong-nonce tx, caller retries) so this method serves current nonce regardless of block tag. 2) eth_call accepts any block tag. Same pattern: Hyperlane queries Mailbox.delivered(msgId) and recipientIsm(addr) at past blocks. Returns current-state. The strict gate stays on eth_getBalance / eth_getCode / eth_getStorageAt where wrong = wrong protocol decision. 3) Empty logsBloom is now actually 256 bytes (Ethereum spec). The EMPTY_LOGS_BLOOM const was 304 bytes (608 hex chars) because of an off-by-one in the original hand-typed string. The doc comment said 256 but the literal was 304. ethers' fee-oracle middleware strict-parses Block.logsBloom — Hyperlane gas estimation SerdeJson'd on this with "invalid length 608, expected 256 bytes" before any process() tx could be submitted. Bumps workspace 2.2.13 -> 2.2.14. * docs(changelog): add 2.2.14 entry — five RPC compat fixes 2.2.14 went live on mainnet 2026-05-21 17:02 UTC carrying PRs #692/#696/#702/#703/#704. Document the five fixes + flag that the eth_call + logsBloom fixes on this branch are scheduled for 2.2.15.

github-actions Bot enabled auto-merge (squash) May 20, 2026 04:56

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

satyakwok force-pushed the fix/b3-reconcile-fail-soft-on-missing-trie branch from a517677 to a74d53f Compare May 20, 2026 23:55

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

github-actions Bot merged commit a06fa94 into main May 21, 2026
14 checks passed

coderabbitai Bot mentioned this pull request May 25, 2026

fix(core): B3b — recompute total_minted from blocks when blob is stale #712

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): B3 reconcile skips missing trie nodes instead of crashing …#696

fix(core): B3 reconcile skips missing trie nodes instead of crashing …#696
github-actions[bot] merged 1 commit into
mainfrom
fix/b3-reconcile-fail-soft-on-missing-trie

satyakwok commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

codecov Bot commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

satyakwok commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Risk tier

Required by tier

🟢 Low — minimum bar

🟡 Medium — adds

🟠 High — adds

🔴 Critical — adds

Test plan

Rollback plan

Related

Summary by CodeRabbit

Uh oh!

codecov Bot commented May 20, 2026

Codecov Report

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

satyakwok commented May 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading