Skip to content

feat: peer sync protection#1333

Draft
Dsorken wants to merge 1 commit intoReamLabs:masterfrom
Dsorken:bad-peer-root-protection
Draft

feat: peer sync protection#1333
Dsorken wants to merge 1 commit intoReamLabs:masterfrom
Dsorken:bad-peer-root-protection

Conversation

@Dsorken
Copy link
Contributor

@Dsorken Dsorken commented Mar 26, 2026

What was wrong?

Ream has not covered all bases where a bad or malicious peer feeding improper data into Ream via sync is protected. Currently if Ream is seeded a bad root from an improper checkpoint on genesis it will error and crash as the forward sync attempts to walk back to a previously stored root which does not exist at genesis. If the bad root is given during regular runtime it does not immediately error, however the queue responsible for the root currently sits in circulation and has no mechanism to kick it out, or to penalize and avoid the peer feeding the bad checkpoint.

How was it fixed?

There are several mechanisms added or modified to protect against this issue:

  1. common_checkpoint_by() update: Previously the method would return the highest checkpoint from all peers, however if a misbehaving client is ahead or differing from the canonical chain Ream would try to sync based on that client. It now instead checks its peers checkpoints and selects the most agreed upon checkpoint to treat as the canonical checkpoint to act on.
  2. Forward Sync upgrade: In previous behavior forward sync could still hit a wall when the queue starting root was missing, which was an edge case triggered by the walk back in the event of a root mismatch. It now returns the results ChainIncomplete to let Ream continue to process and attempt to requeue where it can now be handled by the service quarantine system.
  3. Checkpoint Queue recovery/quarantine: When a forward sync experiences either ChainIncomplete or RootMismatch there is a possibility of bad checkpoint data causing the issue at different stages of the forward sync. Each checkpoint root can then go through a selection of conditions where per failure it increases the failure count until it reaches an arbitrary maximum threshold. Should it reach the threshold and the concerned bad checkpoint refers to a slot higher than network finalized it is marked quarantined because it has failed to be found for each retry. From there if the root is somehow recovered through regular network process the quarantined root is reintroduced into possible queue circulation. However if network finalized either passes the queue slot while in quarantine or once the retry threshold is hit the root is permanently discarded. This prevents bad checkpoints from permanently circulating.
  4. Synced state backfill relaxing: Right now if the network head is slower to propagate into local storage there is a chance backfill sync will trigger unnecessarily because the sync tick interval and regular tick interval are asynchronous. As it stands this isn't really incorrect behavior, however it makes an unneeded sync attempt which increases the chance of fetching a bad checkpoint if they exist. By giving the head an arbitrary time to propagate through normal gossip it should remove the backfill sync from triggering when it is not needed. This change only applies to when Ream is in SYNCED state.

To-Do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant