Skip to content

[PM-21802] fix: R-012 on_finalize expect() panics in consensus-critical path#1438

Closed
m2ux wants to merge 4 commits into
mainfrom
fix/116-r012-on-finalize-expect-panics
Closed

[PM-21802] fix: R-012 on_finalize expect() panics in consensus-critical path#1438
m2ux wants to merge 4 commits into
mainfrom
fix/116-r012-on-finalize-expect-panics

Conversation

@m2ux
Copy link
Copy Markdown
Contributor

@m2ux m2ux commented Apr 28, 2026

Summary

Replace the .expect() panic in pallet-midnight::on_finalize with graceful error handling that logs diagnostic information and preserves block production continuity.

🎫 Ticket | 📐 Engineering


Motivation

The on_finalize hook in pallet-midnight commits pending ledger state changes and produces a new state root every block. Previously, this operation used .expect() which panics the runtime on failure. In Substrate, a panic inside on_finalize is unrecoverable — it halts block production and can leave the ledger in an inconsistent state requiring manual operator intervention.

While the original reasoning (fail-fast on corrupted state) is defensible, LedgerApiError includes variants like LedgerCacheError and LedgerStateScaleDecodingError that may represent transient or recoverable conditions. A cache error should not halt the entire network.


Changes

  • on_finalize error handling — Replaced .expect("Post block update failed") with a match expression that logs the error at log::error! level with block number and error variant, then preserves the previous state key by skipping StateKey::put and flush_storage().
  • Tests — Added a unit test covering the error path (state key preservation when post_block_update fails).
  • Change fragment — Added changes/runtime/changed/audit-on-finalize-graceful-error-handling.md describing the consensus-safety fix.
  • Verification — Confirmed cargo check, workspace clippy, and pallet tests pass; full test suite was scoped to the affected pallet to avoid long workspace runs.

📌 Submission Checklist

  • Changes are backward-compatible (or flagged if breaking)
  • Pull request description explains why the change is needed
  • Self-reviewed the diff
  • I have included a change file
  • If the changes introduce a new feature, I have bumped the node minor version
  • Update documentation (if relevant)
  • No new todos introduced

🔱 Fork Strategy

  • Node Runtime Update
  • Node Client Update
  • Other
  • N/A

🗹 TODO before merging

  • Ready for review

Branch setup for replacing .expect() calls in pallet-midnight on_finalize
with graceful error handling (shielded-security-engineering#116, PM-21802).

Signed-off-by: Mike Clay <mike.clay@shielded.io>
@m2ux m2ux self-assigned this Apr 28, 2026
m2ux and others added 2 commits April 28, 2026 20:30
Replace .expect() in pallet-midnight on_finalize with match-based error
handling. On post_block_update failure, log a diagnostic with block
number and error, and preserve the previous state key so block
production continues without panicking.

Refs PM-21802, R-012, issue #116.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Mike Clay <mike.clay@shielded.io>
Adds runtime change fragment under changes/runtime/changed/ documenting the
on_finalize panic remediation for security audit finding R-012 (PM-21802).

Signed-off-by: Mike Clay <mike.clay@shielded.io>
@m2ux m2ux marked this pull request as ready for review April 29, 2026 08:15
@m2ux m2ux requested a review from a team as a code owner April 29, 2026 08:15
Adds the GitHub issue link (shieldedtech/shielded-security-engineering#116)
to the audit-on-finalize-graceful-error-handling change fragment.

Signed-off-by: Mike Clay <mike.clay@shielded.io>
@m2ux m2ux changed the title fix/116: R-012 on_finalize expect() panics in consensus-critical path [PM-21802] fix: R-012 on_finalize expect() panics in consensus-critical path Apr 29, 2026
Comment on lines +339 to +351
match LedgerApi::post_block_update(&state_key, block_context.clone()) {
Ok(state_root) => {
StateKey::<T>::put(state_root);
LedgerApi::flush_storage();
},
Err(e) => {
log::error!(
"on_finalize: post_block_update failed at block {:?}: {:?}; preserving previous state key",
_block,
e
);
},
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is the right direction. Suppressing this error may have its consequences, like not reseting the block fullness/advancing the time or accepting a block that should be rejected. I am thinking rather about preventing this to fail or minimizing it as much as possible. Here are few conditions that can cause it to fail, like serialization of the key or BlockLimitExceededError. Maybe we can try to split Ledger::post_block_update into verification and execution steps, which will allow us to trigger verification in the places we can fail, like at the end of the apply_transaction

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some PoC #1448

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely we should not ignore a post_block_update failure - arguably, a panic is correct here - failing post_block_update is a fatal error

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to adopt the prevention strategy from #1448 and close this PR if everyone agrees?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m2ux
Copy link
Copy Markdown
Contributor Author

m2ux commented May 6, 2026

Superceded by #1448

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants