Fix remaining synchronization bugs#3626
Merged
vicsn merged 14 commits intoProvableHQ:stagingfrom Jun 12, 2025
Merged
Conversation
These are some preconditions, postconditions, and invariants in the syncing code. These are currently *not* formally proved -- reviewers should double-check them. They are based on an examination of the code, with informal reasoning.
Signed-off-by: vicsn <victor.s.nicolaas@protonmail.com>
c71c76d to
6188044
Compare
This reverts commit 70d6a84.
* Validators now correctly set themselves to synced, even if the most recent block has not reached the availability threshold yet. * Improved documentation and logging in block synchronization code.
6188044 to
434a406
Compare
ljedrz
reviewed
May 20, 2025
ljedrz
reviewed
May 20, 2025
Collaborator
|
Nit: the link to the 2nd commit should be 7f6762b. |
vicsn
reviewed
Jun 2, 2025
acoglio
reviewed
Jun 6, 2025
acoglio
reviewed
Jun 6, 2025
acoglio
reviewed
Jun 7, 2025
acoglio
previously approved these changes
Jun 7, 2025
e04ce9a to
4a4e5e0
Compare
vicsn
previously approved these changes
Jun 11, 2025
vicsn
approved these changes
Jun 12, 2025
Collaborator
vicsn
left a comment
There was a problem hiding this comment.
Did not review the code in depth, but know the other reviewers did, and tests were run succesfully so approving. Amazing work folks!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR reverts the revert of PR #3543 and fixes two issues we detected during stress tests.
The changes are structured as follows:
Pending Blocks
Currently, validators have two different sync modes. For older blocks (>=100 rounds ago) they sync like clients performing only some of the DAG checks, for more recent blocks (<100 rounds ago) validators check that there are enough votes for each new block's leader certificate in some subsequent block. This check happens in
snarkos_node_bft::sync::sync_storage_with_block.Checking for sufficient votes means that, in addition to blocks added to the ledger, there is also a set of "pending blocks" which cannot be added to the ledger yet because they did not reach the availability threshold yet.
The generic BlockSync module, however, is unaware of these pending blocks which can cause issues, as outlined below.
Detected Issue and Proposed Fix
In the common case, there is only one pending block. However, during stress testing we triggered a case where a majority of validators restart after resetting their ledgers, and the most recent block did not receive sufficient votes. Here, there exists a block
i-1and a blocki, where blockidoes not contain sufficient votes fori-1.Usually, a subsequent block, here w.l.o.g. block
i+1, contains sufficient votes to add all its pending predecessors to the ledger.The issue we found is that, in this scenario, BlockSync considers itself more than one block behind and, thus, not fully synced. The network then cannot make progress to create block i+1 because most validators are still syncing, but without
i+1blocki-1never gets added to the ledger.The final commit in this PR considers pending blocks as part of the sync state to address this.
Testing Plan
We already ran stresstests on this branch and did not detect any errors. However, I want to perform another run of stresstest to be on the safe side.
Notes
There is a larger PR in the works to perform the checks on pending blocks on all nodes, not just validators, and move more of the DAG checks into snarkVM. However, I want to merge this version first, as the delta to the last PR is fairly small, and the larger PR requires more extensive testing and benchmarking.