Skip to content

Fix remaining synchronization bugs#3626

Merged
vicsn merged 14 commits intoProvableHQ:stagingfrom
kaimast:validator-deep-sync
Jun 12, 2025
Merged

Fix remaining synchronization bugs#3626
vicsn merged 14 commits intoProvableHQ:stagingfrom
kaimast:validator-deep-sync

Conversation

@kaimast
Copy link
Contributor

@kaimast kaimast commented May 7, 2025

This PR reverts the revert of PR #3543 and fixes two issues we detected during stress tests.

The changes are structured as follows:

  • db40b35 reverts the revert (undos commit). These changes have already been reviewed for the reverted PR.
  • 7f6762b ensures that, for validators, BFT's sync module processes new blocks, instead of them being processed by the generic BlockSync module.
  • 434a406 fixes another bug, that is unlikely to happen on mainnet but popped up during stress testing (see the description on "pending blocks" below).

Pending Blocks

Currently, validators have two different sync modes. For older blocks (>=100 rounds ago) they sync like clients performing only some of the DAG checks, for more recent blocks (<100 rounds ago) validators check that there are enough votes for each new block's leader certificate in some subsequent block. This check happens in snarkos_node_bft::sync::sync_storage_with_block.

Checking for sufficient votes means that, in addition to blocks added to the ledger, there is also a set of "pending blocks" which cannot be added to the ledger yet because they did not reach the availability threshold yet.
The generic BlockSync module, however, is unaware of these pending blocks which can cause issues, as outlined below.

Detected Issue and Proposed Fix

In the common case, there is only one pending block. However, during stress testing we triggered a case where a majority of validators restart after resetting their ledgers, and the most recent block did not receive sufficient votes. Here, there exists a block i-1 and a block i, where block i does not contain sufficient votes for i-1.
Usually, a subsequent block, here w.l.o.g. block i+1, contains sufficient votes to add all its pending predecessors to the ledger.

The issue we found is that, in this scenario, BlockSync considers itself more than one block behind and, thus, not fully synced. The network then cannot make progress to create block i+1 because most validators are still syncing, but without i+1 block i-1 never gets added to the ledger.
The final commit in this PR considers pending blocks as part of the sync state to address this.

Testing Plan

We already ran stresstests on this branch and did not detect any errors. However, I want to perform another run of stresstest to be on the safe side.

Notes

There is a larger PR in the works to perform the checks on pending blocks on all nodes, not just validators, and move more of the DAG checks into snarkVM. However, I want to merge this version first, as the delta to the last PR is fairly small, and the larger PR requires more extensive testing and benchmarking.

acoglio and others added 3 commits March 24, 2025 21:50
These are some preconditions, postconditions, and invariants in the syncing
code.

These are currently *not* formally proved -- reviewers should double-check
them. They are based on an examination of the code, with informal reasoning.
Signed-off-by: vicsn <victor.s.nicolaas@protonmail.com>
@kaimast kaimast changed the title Validator deep sync Fix remaining synchronization bugs May 7, 2025
@kaimast kaimast force-pushed the validator-deep-sync branch 2 times, most recently from c71c76d to 6188044 Compare May 7, 2025 02:08
kaimast added 3 commits May 9, 2025 10:24
 * Validators now correctly set themselves to synced, even if the most recent block has not reached the availability threshold yet.
 * Improved documentation and logging in block synchronization code.
@kaimast kaimast force-pushed the validator-deep-sync branch from 6188044 to 434a406 Compare May 9, 2025 17:25
@kaimast kaimast marked this pull request as ready for review May 19, 2025 21:16
@niklaslong niklaslong requested a review from ljedrz May 20, 2025 13:11
@ljedrz
Copy link
Collaborator

ljedrz commented May 20, 2025

Nit: the link to the 2nd commit should be 7f6762b.

Copy link
Collaborator

@ljedrz ljedrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@vicsn vicsn requested a review from raychu86 June 2, 2025 11:27
@vicsn vicsn removed the v3.8.0 label Jun 2, 2025
@vicsn vicsn removed the request for review from raychu86 June 2, 2025 15:53
acoglio
acoglio previously approved these changes Jun 7, 2025
@kaimast kaimast force-pushed the validator-deep-sync branch from e04ce9a to 4a4e5e0 Compare June 11, 2025 00:06
vicsn
vicsn previously approved these changes Jun 11, 2025
Copy link
Collaborator

@vicsn vicsn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not review the code in depth, but know the other reviewers did, and tests were run succesfully so approving. Amazing work folks!

@vicsn vicsn merged commit 58a302c into ProvableHQ:staging Jun 12, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants