fix(core): B3 reconcile skips missing trie nodes instead of crashing …#696
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
📝 WalkthroughWalkthroughThis PR modifies the Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/sentrix-core/src/storage.rs (1)
367-369:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate docstring to reflect fail-soft behavior.
Lines 367-369 still describe the old hard-fail semantics ("Trie-lookup errors (
Err(_)) propagate: a corrupted trie is a hard-fail, not silent fallback"), but the implementation now catches errors and continues. This contradiction could mislead future maintainers.📝 Proposed docstring update
-/// Trie-lookup errors (`Err(_)`) propagate: a corrupted trie is a -/// hard-fail, not silent fallback. +/// Trie-lookup errors (`Err(_)`) are logged and skipped: a missing or +/// corrupted trie node for a single address no longer aborts boot. +/// The address is treated as having no trie leaf (blob value preserved), +/// and the next block touching that account will rewrite the trie entry.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/sentrix-core/src/storage.rs` around lines 367 - 369, Replace the outdated sentence "Trie-lookup errors (`Err(_)`) propagate: a corrupted trie is a hard-fail, not silent fallback." in the docstring with a short note that the implementation now treats trie-lookup errors as fail-soft: errors are caught and the code continues using fallback behavior (i.e., corrupted trie entries are skipped/ignored rather than propagating). Update the wording in crates/sentrix-core/src/storage.rs so the docstring for the trie lookup behavior reflects the current catch-and-continue semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/sentrix-core/src/storage.rs`:
- Around line 367-369: Replace the outdated sentence "Trie-lookup errors
(`Err(_)`) propagate: a corrupted trie is a hard-fail, not silent fallback." in
the docstring with a short note that the implementation now treats trie-lookup
errors as fail-soft: errors are caught and the code continues using fallback
behavior (i.e., corrupted trie entries are skipped/ignored rather than
propagating). Update the wording in crates/sentrix-core/src/storage.rs so the
docstring for the trie lookup behavior reflects the current catch-and-continue
semantics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: ea6d6dc0-7e42-4c63-bb5e-f5c22711c346
📒 Files selected for processing (1)
crates/sentrix-core/src/storage.rs
…boot The trie-reconcile pass at boot used to propagate any `trie.get` error all the way out of `load_blockchain`, which turned a single dangling trie node into an unbootable validator — even when the other 99.99% of the trie was healthy and the chain was actively producing blocks before the restart. Hit live on testnet 2026-05-20 during a binary swap. One address (0x044daee895f2bb2ffecd4dd2e771f5ce405ab0ea at h=5003961) had a dangling reference to node 314e57bd...; the validator had been up for 5 hours serving that exact state without complaint, but the next boot crashed on phase-1 trie scan. All four testnet validators were stuck because they shared the same trie shape. The reconcile already has well-defined behavior for "no trie leaf for this address" — phase-2 keeps the blob view as canonical. Treat a failed `trie.get` the same way: log the gap, count it, surface a summary warning if any were skipped, then continue. Touched accounts will rewrite the trie node on next apply, so the gap closes on its own without operator intervention. This is strictly a fail-soft on a previously fatal path. Healthy chains never hit it; the only behavioral change is that broken state loads instead of refusing to load. Also updates the `reconcile_accounts_from_trie` docstring so it matches the new catch-and-continue semantics (was still claiming hard-fail).
a517677 to
a74d53f
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/sentrix-core/src/storage.rs`:
- Around line 425-449: The current catch-all for trie.get(&key) treats all
errors as recoverable gaps; change the Err branch to match on the SentrixError
enum: if Err(SentrixError::Internal(msg)) and msg.contains("missing node") then
log the missing-node warning, increment trie_gaps and continue (as you do now),
but if Err(SentrixError::StorageError(_)) or
Err(SentrixError::SerializationError(_)) then propagate the error (return Err(e)
or use ?), since those indicate broken storage/serialization; keep the same
handling for successful Ok(leaf) and subsequent account_value_decode usage and
push into trie_values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 57aeb45f-d992-4541-8699-d8a225d3bd29
📒 Files selected for processing (1)
crates/sentrix-core/src/storage.rs
| let mut trie_gaps: usize = 0; | ||
| for addr in &addrs { | ||
| let key = address_to_key(addr); | ||
| let leaf = trie.get(&key).map_err(|e| { | ||
| SentrixError::Internal(format!( | ||
| "B3 reconcile: trie lookup for {addr} failed at h={}: {e}", | ||
| bc.chain.last().map(|b| b.index).unwrap_or(0) | ||
| )) | ||
| })?; | ||
| let leaf = match trie.get(&key) { | ||
| Ok(leaf) => leaf, | ||
| Err(e) => { | ||
| tracing::warn!( | ||
| "B3 reconcile: trie lookup for {addr} failed at h={}: {e} — \ | ||
| skipping reconcile for this address (will rewrite on next touch)", | ||
| bc.chain.last().map(|b| b.index).unwrap_or(0) | ||
| ); | ||
| trie_gaps += 1; | ||
| None | ||
| } | ||
| }; | ||
| let decoded = leaf.and_then(|bytes| account_value_decode(&bytes)); | ||
| trie_values.push((addr.clone(), decoded)); | ||
| } | ||
| if trie_gaps > 0 { | ||
| tracing::warn!( | ||
| "B3 reconcile: skipped {trie_gaps}/{} addresses due to missing trie nodes \ | ||
| at h={}; chain will continue producing — touched accounts repair themselves", | ||
| addrs.len(), | ||
| bc.chain.last().map(|b| b.index).unwrap_or(0) | ||
| ); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
TRIE_FILES=$(fd -i 'trie|error' crates --extension rs)
echo "=== trie.get definitions ==="
rg -n -C4 '\bfn\s+get\s*\(' $TRIE_FILES
echo
echo "=== trie-related error enums / aliases / variants ==="
rg -n -C4 '\b(enum|type)\s+\w*Error\b|Missing|Dangling|Corrupt|Decode|Storage|Io' $TRIE_FILES
echo
echo "=== existing trie.get call sites ==="
rg -n -C4 'trie\.get\s*\(' crates/sentrix-core/src/storage.rs $TRIE_FILESRepository: sentrix-labs/sentrix
Length of output: 93
🏁 Script executed:
# Check repository structure and find trie-related files
find . -type f -name "*.rs" | grep -i trie | head -20Repository: sentrix-labs/sentrix
Length of output: 454
🏁 Script executed:
# Also check Cargo.toml files to understand dependencies
find . -name "Cargo.toml" | xargs grep -l "trie\|sentrix" | head -10Repository: sentrix-labs/sentrix
Length of output: 374
🏁 Script executed:
# Check the storage.rs file context around the error handling
sed -n '420,455p' crates/sentrix-core/src/storage.rsRepository: sentrix-labs/sentrix
Length of output: 1744
🏁 Script executed:
# Look for trie-related modules in the codebase
rg "mod trie|use.*trie" crates --type rs | head -20Repository: sentrix-labs/sentrix
Length of output: 91
🏁 Script executed:
# Correct the previous command
rg "mod trie|use.*trie" crates/sentrix-core/src/storage.rs -A2 -B2Repository: sentrix-labs/sentrix
Length of output: 280
🏁 Script executed:
# Check the trie module's error type definition
cat crates/sentrix-trie/src/lib.rsRepository: sentrix-labs/sentrix
Length of output: 777
🏁 Script executed:
# Look for get function and error types in tree.rs
rg -n "fn get|enum.*Error|type.*Error" crates/sentrix-trie/src/tree.rs -A3Repository: sentrix-labs/sentrix
Length of output: 228
🏁 Script executed:
# Check tree.rs for the get implementation
head -200 crates/sentrix-trie/src/tree.rs | tail -150Repository: sentrix-labs/sentrix
Length of output: 7105
🏁 Script executed:
# Get the full get() implementation
sed -n '200,280p' crates/sentrix-trie/src/tree.rsRepository: sentrix-labs/sentrix
Length of output: 3312
🏁 Script executed:
# Find SentrixResult and SentrixError definitions
rg "type SentrixResult|enum SentrixError" crates/sentrix-trie/src -A20Repository: sentrix-labs/sentrix
Length of output: 46
🏁 Script executed:
# Look for where SentrixError is defined in the entire repo
rg "enum SentrixError" crates -A30Repository: sentrix-labs/sentrix
Length of output: 2299
🏁 Script executed:
# Check sentrix-primitives or common error handling
fd -i error crates -name "*.rs" -type f | head -10Repository: sentrix-labs/sentrix
Length of output: 295
🏁 Script executed:
# Get the full SentrixError enum
cat crates/sentrix-primitives/src/error.rs | head -100Repository: sentrix-labs/sentrix
Length of output: 2063
🏁 Script executed:
# Also check if there are more variants
rg "SentrixError::" crates/sentrix-primitives/src/error.rsRepository: sentrix-labs/sentrix
Length of output: 170
🏁 Script executed:
# Search for all uses of SentrixError in the trie module to understand all variants
rg "SentrixError::" crates/sentrix-trie/src -B1 -A1Repository: sentrix-labs/sentrix
Length of output: 12687
🏁 Script executed:
# Verify what cache.get_node returns and how it handles errors
sed -n '1,150p' crates/sentrix-trie/src/cache.rs | grep -A30 "fn get_node"Repository: sentrix-labs/sentrix
Length of output: 1434
🏁 Script executed:
# Check if there are patterns elsewhere in storage.rs for selective error handling
rg "match.*Err|SentrixError::" crates/sentrix-core/src/storage.rs -B2 -A3 | head -80Repository: sentrix-labs/sentrix
Length of output: 3173
🏁 Script executed:
# Check if there's a specific error message pattern for "missing node" vs other Internal errors
rg "trie: missing node" crates/sentrix-trie/src -B2 -A2Repository: sentrix-labs/sentrix
Length of output: 2842
🏁 Script executed:
# Look for any pattern matching on error messages elsewhere in the codebase
rg "SentrixError::Internal.*missing|if.*contains.*missing" crates/sentrix-core/src -B2 -A2Repository: sentrix-labs/sentrix
Length of output: 46
🏁 Script executed:
# Check the full context of the reconcile function to understand the intent
sed -n '360,380p' crates/sentrix-core/src/storage.rsRepository: sentrix-labs/sentrix
Length of output: 1169
🏁 Script executed:
# Check if cache.get_node can return StorageError by looking at the full chain
sed -n '1,50p' crates/sentrix-trie/src/cache.rsRepository: sentrix-labs/sentrix
Length of output: 2244
🏁 Script executed:
# Verify load_node implementation which is called by cache.get_node
rg "fn load_node" crates/sentrix-trie/src/storage.rs -A20Repository: sentrix-labs/sentrix
Length of output: 895
🏁 Script executed:
# Check what other errors could be returned from trie operations
rg "map_err.*SentrixError" crates/sentrix-trie/src/storage.rs | head -20Repository: sentrix-labs/sentrix
Length of output: 1510
Distinguish missing-node errors from storage/serialization failures in error handling.
Lines 428-439 catch all trie.get() errors identically. However, trie.get() can fail with three distinct error types: SentrixError::Internal("trie: missing node ...") (node not found), SentrixError::StorageError(...) (I/O failure), and SentrixError::SerializationError(...) (deserialization failure). The current code treats them all as recoverable gaps, but storage and serialization errors indicate a fundamentally broken trie and should not be suppressed.
Match specifically on SentrixError::Internal with a message pattern check for "missing node", and let StorageError and SerializationError propagate. This preserves the fail-soft behavior for dangling trie nodes while preventing boot with a corrupted or unreadable trie.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/sentrix-core/src/storage.rs` around lines 425 - 449, The current
catch-all for trie.get(&key) treats all errors as recoverable gaps; change the
Err branch to match on the SentrixError enum: if
Err(SentrixError::Internal(msg)) and msg.contains("missing node") then log the
missing-node warning, increment trie_gaps and continue (as you do now), but if
Err(SentrixError::StorageError(_)) or Err(SentrixError::SerializationError(_))
then propagate the error (return Err(e) or use ?), since those indicate broken
storage/serialization; keep the same handling for successful Ok(leaf) and
subsequent account_value_decode usage and push into trie_values.
* fix(rpc): eth_getTransactionCount + eth_call + logsBloom length Three discrete chain-RPC fixes bundled because they share one cause (strict-but-non-spec behavior breaking off-the-shelf EVM agents). All discovered together standing up the Hyperlane relayer for the Base Sepolia ↔ Sentrix Testnet bridge (audit fix H-4). 1) eth_getTransactionCount accepts any block tag. Relayer + ethers + viem all pin nonce queries to a recent past block. A stale nonce is self-correcting (chain rejects wrong-nonce tx, caller retries) so this method serves current nonce regardless of block tag. 2) eth_call accepts any block tag. Same pattern: Hyperlane queries Mailbox.delivered(msgId) and recipientIsm(addr) at past blocks. Returns current-state. The strict gate stays on eth_getBalance / eth_getCode / eth_getStorageAt where wrong = wrong protocol decision. 3) Empty logsBloom is now actually 256 bytes (Ethereum spec). The EMPTY_LOGS_BLOOM const was 304 bytes (608 hex chars) because of an off-by-one in the original hand-typed string. The doc comment said 256 but the literal was 304. ethers' fee-oracle middleware strict-parses Block.logsBloom — Hyperlane gas estimation SerdeJson'd on this with "invalid length 608, expected 256 bytes" before any process() tx could be submitted. Bumps workspace 2.2.13 -> 2.2.14. * docs(changelog): add 2.2.14 entry — five RPC compat fixes 2.2.14 went live on mainnet 2026-05-21 17:02 UTC carrying PRs #692/#696/#702/#703/#704. Document the five fixes + flag that the eth_call + logsBloom fixes on this branch are scheduled for 2.2.15.
…bytes (#707) * fix(rpc): eth_getTransactionCount + eth_call + logsBloom length Three discrete chain-RPC fixes bundled because they share one cause (strict-but-non-spec behavior breaking off-the-shelf EVM agents). All discovered together standing up the Hyperlane relayer for the Base Sepolia ↔ Sentrix Testnet bridge (audit fix H-4). 1) eth_getTransactionCount accepts any block tag. Relayer + ethers + viem all pin nonce queries to a recent past block. A stale nonce is self-correcting (chain rejects wrong-nonce tx, caller retries) so this method serves current nonce regardless of block tag. 2) eth_call accepts any block tag. Same pattern: Hyperlane queries Mailbox.delivered(msgId) and recipientIsm(addr) at past blocks. Returns current-state. The strict gate stays on eth_getBalance / eth_getCode / eth_getStorageAt where wrong = wrong protocol decision. 3) Empty logsBloom is now actually 256 bytes (Ethereum spec). The EMPTY_LOGS_BLOOM const was 304 bytes (608 hex chars) because of an off-by-one in the original hand-typed string. The doc comment said 256 but the literal was 304. ethers' fee-oracle middleware strict-parses Block.logsBloom — Hyperlane gas estimation SerdeJson'd on this with "invalid length 608, expected 256 bytes" before any process() tx could be submitted. Bumps workspace 2.2.13 -> 2.2.14. * docs(changelog): add 2.2.14 entry — five RPC compat fixes 2.2.14 went live on mainnet 2026-05-21 17:02 UTC carrying PRs #692/#696/#702/#703/#704. Document the five fixes + flag that the eth_call + logsBloom fixes on this branch are scheduled for 2.2.15.
…boot
The trie-reconcile pass at boot used to propagate any
trie.geterror all the way out ofload_blockchain, which turned a single dangling trie node into an unbootable validator — even when the other 99.99% of the trie was healthy and the chain was actively producing blocks before the restart.Hit live on testnet 2026-05-20 during the v2.2.12 RPC swap. One address (0x044daee895f2bb2ffecd4dd2e771f5ce405ab0ea at h=5003961) had a dangling reference to node 314e57bd...; the validator had been up for 5 hours serving that exact state without complaint, but the next boot crashed on phase-1 trie scan. All four testnet validators were stuck because they share the same trie shape.
The reconcile already has well-defined behavior for "no trie leaf for this address" — phase-2 keeps the blob view as canonical. Treat a failed
trie.getthe same way: log the gap, count it, surface a summary warning if any were skipped, then continue. Touched accounts will rewrite the trie node on next apply, so the gap closes on its own without operator intervention.This is strictly a fail-soft on a previously fatal path. Healthy chains never hit it; the only behavioral change is that broken state loads instead of refusing to load.
Summary
Risk tier
Check ONE:
sentrix-core,sentrix-trie,sentrix-staking,sentrix-bft),block_executor,apply_block_*,state_rootpathRequired by tier
🟢 Low — minimum bar
🟡 Medium — adds
#[test]in same PR🟠 High — adds
🔴 Critical — adds
Test plan
Rollback plan
Related
Summary by CodeRabbit