[PM-21802] fix: R-012 on_finalize expect() panics in consensus-critical path#1438
[PM-21802] fix: R-012 on_finalize expect() panics in consensus-critical path#1438m2ux wants to merge 4 commits into
Conversation
Branch setup for replacing .expect() calls in pallet-midnight on_finalize with graceful error handling (shielded-security-engineering#116, PM-21802). Signed-off-by: Mike Clay <mike.clay@shielded.io>
Replace .expect() in pallet-midnight on_finalize with match-based error handling. On post_block_update failure, log a diagnostic with block number and error, and preserve the previous state key so block production continues without panicking. Refs PM-21802, R-012, issue #116. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Mike Clay <mike.clay@shielded.io>
Adds runtime change fragment under changes/runtime/changed/ documenting the on_finalize panic remediation for security audit finding R-012 (PM-21802). Signed-off-by: Mike Clay <mike.clay@shielded.io>
Adds the GitHub issue link (shieldedtech/shielded-security-engineering#116) to the audit-on-finalize-graceful-error-handling change fragment. Signed-off-by: Mike Clay <mike.clay@shielded.io>
| match LedgerApi::post_block_update(&state_key, block_context.clone()) { | ||
| Ok(state_root) => { | ||
| StateKey::<T>::put(state_root); | ||
| LedgerApi::flush_storage(); | ||
| }, | ||
| Err(e) => { | ||
| log::error!( | ||
| "on_finalize: post_block_update failed at block {:?}: {:?}; preserving previous state key", | ||
| _block, | ||
| e | ||
| ); | ||
| }, | ||
| } |
There was a problem hiding this comment.
I am not sure it is the right direction. Suppressing this error may have its consequences, like not reseting the block fullness/advancing the time or accepting a block that should be rejected. I am thinking rather about preventing this to fail or minimizing it as much as possible. Here are few conditions that can cause it to fail, like serialization of the key or BlockLimitExceededError. Maybe we can try to split Ledger::post_block_update into verification and execution steps, which will allow us to trigger verification in the places we can fail, like at the end of the apply_transaction
There was a problem hiding this comment.
Definitely we should not ignore a post_block_update failure - arguably, a panic is correct here - failing post_block_update is a fatal error
There was a problem hiding this comment.
I'm happy to adopt the prevention strategy from #1448 and close this PR if everyone agrees?
|
Superceded by #1448 |
Summary
Replace the
.expect()panic inpallet-midnight::on_finalizewith graceful error handling that logs diagnostic information and preserves block production continuity.🎫 Ticket | 📐 Engineering
Motivation
The
on_finalizehook in pallet-midnight commits pending ledger state changes and produces a new state root every block. Previously, this operation used.expect()which panics the runtime on failure. In Substrate, a panic insideon_finalizeis unrecoverable — it halts block production and can leave the ledger in an inconsistent state requiring manual operator intervention.While the original reasoning (fail-fast on corrupted state) is defensible,
LedgerApiErrorincludes variants likeLedgerCacheErrorandLedgerStateScaleDecodingErrorthat may represent transient or recoverable conditions. A cache error should not halt the entire network.Changes
.expect("Post block update failed")with amatchexpression that logs the error atlog::error!level with block number and error variant, then preserves the previous state key by skippingStateKey::putandflush_storage().post_block_updatefails).changes/runtime/changed/audit-on-finalize-graceful-error-handling.mddescribing the consensus-safety fix.cargo check, workspaceclippy, and pallet tests pass; full test suite was scoped to the affected pallet to avoid long workspace runs.📌 Submission Checklist
🔱 Fork Strategy
🗹 TODO before merging