Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not request blocks below the common number when syncing #2045
Do not request blocks below the common number when syncing #2045
Changes from 7 commits
5801994
e64695f
481e6e1
9e6f2a9
cf2494a
84cbd69
9e81f3c
590ebb8
61f89b0
4a838ea
ede8a20
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much does this help with the situation?
Based on our discussion in Element, I understood the issue to be in the overlapping ranges which for some reason don't get cleared and instead get downloaded again, something I still don't fully understand given what the logs said. It's all because of parallel downloads?
In #1915 (comment) you said the peers' common numbers are updated inconsistently (block is queued vs imported) so I'm wondering can we trust these common numbers with any further logic?
Thanks for adding comments for these cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this patch the issue is happening every time versi is scaled up. With this patch applied the only incident (literally two nodes banned) that I observed was after failed import when the sync was restarted.
It was not happening due to overlapping ranges in the first place, but because when some peer had a common number lower than others, we requested old ranges from that peer, but also made requests to other peers below their common numbers (either parallel requests or requests of adjacent block ranges).
I also worry that common numbers are not very trustworthy to rely on. Not sure if "probabilistic" approach to common numbers works good enough to consider this PR safe merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. I also observed the issue with the fix applied when versi was scaled before it was reset to a new chainspec, but it's not reproducible anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the syncing code and how it updates common number for peers but I don't think there is anything problematic happening, at least not anything I can see. Common number is updated for all peers that have already gone through ancestry search and the code verifies that the common number is not above peer's best number.
Do you have a concrete concern with the usage of common numbers or can we got ahead and merge this? I understood this has gone through extensive Versi testing already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my specific concern is that after we queue blocks for import in
on_block_queued
, we move common numbers of all peers forward to the number of the queued block (see here), not checking if the peers are actually on the same chain as the block imported.IIRC, there were also some other places where we just put something into the common number (our best block?), not actually checking if we are on the same chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced it is a problem. We've done an ancestry search for all the peers so they're all in the same chain and the new common block is only set if it's less than the peer's best number. What are the potential issues that can arise from this?
@arkpar do you have any comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't it be that the peer is on the same chain up to the last common number, but above the common number it's actually on a different fork up to it's best block? In this scenario it's unsafe to fast forward peer's common number between the last common number and peer's best number.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think anything's missing but I don't understand what is going to happen if this is merged as-is. Node attempts to download a range from peer which it doesn't have because the common number was unsafely modified? What happens then? What was happening in Westend sounded far worse because there's no alternative faulty behavior to weigh against it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible situation that bothers me — we won't be able to download a block range from a peer if it's common number is believed to be higher than it is. The bug this PR fixes might have facilitated block download in cases where the common number was estimated to be higher than it was in fact. With this PR merged, we will lose opportunity to download block ranges from such peers, and importing the range downloaded above the (incorrectly) estimated common number might fail, if the peer is in fact on a different fork.
The worst that might happen — we are stuck on a fork with common numbers of all peers incorrectly estimated to be higher than they are, downloading block ranges above these common numbers from the real canonical chain, but failing to import them because we lack some blocks in between. Something like #493.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed change looks correct to me. Indeed, there should be no requests for other peers below the common block. Duplicate requests may happen because the same block range is also requested with a fork download mechanism.
Common ancestor search can't be reliable when the node is syncing at full speed. While the ancestry search is happening the node may import thousands of blocks and then detected common block is obsolete and too far behind.
In general it is not possible to guarantee that block that we consider "common" is up to date. Other peers may reorg at any time wihouth notification and the common number may indeed decrease.
The whole notion of "common number" and "best chain" is really obsolete and should be removed for sync 2.0. There's just the finalized chain (which can be dowloaded in parallel, with gaps) and a buch of unfinalized forks, all threated as equal.