Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not request blocks below the common number when syncing #2045

Merged
merged 11 commits into from
Nov 3, 2023
232 changes: 223 additions & 9 deletions substrate/client/network/sync/src/blocks.rs
Original file line number Diff line number Diff line change
Expand Up @@ -123,20 +123,36 @@ impl<B: BlockT> BlockCollection<B> {
let first_different = common + <NumberFor<B>>::one();
let count = (count as u32).into();
let (mut range, downloading) = {
// Iterate through the ranges in `self.blocks` looking for a range to download
let mut downloading_iter = self.blocks.iter().peekable();
let mut prev: Option<(&NumberFor<B>, &BlockRangeState<B>)> = None;
loop {
let next = downloading_iter.next();
break match (prev, next) {
// If we are already downloading this range, request it from `max_parallel`
// peers (`max_parallel = 5` by default).
altonen marked this conversation as resolved.
Show resolved Hide resolved
// Do not request already downloading range from peers with common number above
// the range start.
(Some((start, &BlockRangeState::Downloading { ref len, downloading })), _)
if downloading < max_parallel =>
if downloading < max_parallel && *start >= first_different =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much does this help with the situation?

Based on our discussion in Element, I understood the issue to be in the overlapping ranges which for some reason don't get cleared and instead get downloaded again, something I still don't fully understand given what the logs said. It's all because of parallel downloads?

In #1915 (comment) you said the peers' common numbers are updated inconsistently (block is queued vs imported) so I'm wondering can we trust these common numbers with any further logic?

Thanks for adding comments for these cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this patch the issue is happening every time versi is scaled up. With this patch applied the only incident (literally two nodes banned) that I observed was after failed import when the sync was restarted.

It was not happening due to overlapping ranges in the first place, but because when some peer had a common number lower than others, we requested old ranges from that peer, but also made requests to other peers below their common numbers (either parallel requests or requests of adjacent block ranges).

I also worry that common numbers are not very trustworthy to rely on. Not sure if "probabilistic" approach to common numbers works good enough to consider this PR safe merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. I also observed the issue with the fix applied when versi was scaled before it was reset to a new chainspec, but it's not reproducible anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the syncing code and how it updates common number for peers but I don't think there is anything problematic happening, at least not anything I can see. Common number is updated for all peers that have already gone through ancestry search and the code verifies that the common number is not above peer's best number.

Do you have a concrete concern with the usage of common numbers or can we got ahead and merge this? I understood this has gone through extensive Versi testing already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a concrete concern with the usage of common numbers or can we got ahead and merge this? I understood this has gone through extensive Versi testing already

Yes, my specific concern is that after we queue blocks for import in on_block_queued, we move common numbers of all peers forward to the number of the queued block (see here), not checking if the peers are actually on the same chain as the block imported.

IIRC, there were also some other places where we just put something into the common number (our best block?), not actually checking if we are on the same chain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced it is a problem. We've done an ancestry search for all the peers so they're all in the same chain and the new common block is only set if it's less than the peer's best number. What are the potential issues that can arise from this?

@arkpar do you have any comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't it be that the peer is on the same chain up to the last common number, but above the common number it's actually on a different fork up to it's best block? In this scenario it's unsafe to fast forward peer's common number between the last common number and peer's best number.

Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think anything's missing but I don't understand what is going to happen if this is merged as-is. Node attempts to download a range from peer which it doesn't have because the common number was unsafely modified? What happens then? What was happening in Westend sounded far worse because there's no alternative faulty behavior to weigh against it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible situation that bothers me — we won't be able to download a block range from a peer if it's common number is believed to be higher than it is. The bug this PR fixes might have facilitated block download in cases where the common number was estimated to be higher than it was in fact. With this PR merged, we will lose opportunity to download block ranges from such peers, and importing the range downloaded above the (incorrectly) estimated common number might fail, if the peer is in fact on a different fork.

The worst that might happen — we are stuck on a fork with common numbers of all peers incorrectly estimated to be higher than they are, downloading block ranges above these common numbers from the real canonical chain, but failing to import them because we lack some blocks in between. Something like #493.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed change looks correct to me. Indeed, there should be no requests for other peers below the common block. Duplicate requests may happen because the same block range is also requested with a fork download mechanism.

Here is this another common number update after ancestor search that I don't understand:

Common ancestor search can't be reliable when the node is syncing at full speed. While the ancestry search is happening the node may import thousands of blocks and then detected common block is obsolete and too far behind.

In general it is not possible to guarantee that block that we consider "common" is up to date. Other peers may reorg at any time wihouth notification and the common number may indeed decrease.

The whole notion of "common number" and "best chain" is really obsolete and should be removed for sync 2.0. There's just the finalized chain (which can be dowloaded in parallel, with gaps) and a buch of unfinalized forks, all threated as equal.

(*start..*start + *len, downloading),
(Some((start, r)), Some((next_start, _))) if *start + r.len() < *next_start =>
(*start + r.len()..cmp::min(*next_start, *start + r.len() + count), 0), // gap
(Some((start, r)), None) => (*start + r.len()..*start + r.len() + count, 0), /* last range */
(None, None) => (first_different..first_different + count, 0), /* empty */
// If there is a gap between ranges requested, download this gap unless the peer
// has common number above the gap start
(Some((start, r)), Some((next_start, _)))
if *start + r.len() < *next_start &&
*start + r.len() >= first_different =>
(*start + r.len()..cmp::min(*next_start, *start + r.len() + count), 0),
// Download `count` blocks after the last range requested unless the peer
// has common number above this new range
(Some((start, r)), None) if *start + r.len() >= first_different =>
(*start + r.len()..*start + r.len() + count, 0),
// If there are no ranges currently requested, download `count` blocks after
// `common` number
(None, None) => (first_different..first_different + count, 0),
// If the first range starts above `common + 1`, download the gap at the start
(None, Some((start, _))) if *start > first_different =>
(first_different..cmp::min(first_different + count, *start), 0), /* gap at the start */
(first_different..cmp::min(first_different + count, *start), 0),
// Move on to the next range pair
_ => {
prev = next;
continue
Expand Down Expand Up @@ -365,10 +381,10 @@ mod test {
bc.blocks.insert(114305, BlockRangeState::Complete(blocks));

let peer0 = PeerId::random();
assert_eq!(bc.needed_blocks(peer0, 128, 10000, 000, 1, 200), Some(1..100));
assert_eq!(bc.needed_blocks(peer0, 128, 10000, 600, 1, 200), None); // too far ahead
assert_eq!(bc.needed_blocks(peer0, 128, 10000, 0, 1, 200), Some(1..100));
assert_eq!(bc.needed_blocks(peer0, 128, 10000, 0, 1, 200), None); // too far ahead
assert_eq!(
bc.needed_blocks(peer0, 128, 10000, 600, 1, 200000),
bc.needed_blocks(peer0, 128, 10000, 0, 1, 200000),
altonen marked this conversation as resolved.
Show resolved Hide resolved
Some(100 + 128..100 + 128 + 128)
);
}
Expand Down Expand Up @@ -431,4 +447,202 @@ mod test {
assert!(bc.blocks.is_empty());
assert!(bc.queued_blocks.is_empty());
}

#[test]
fn downloaded_range_is_requested_from_max_parallel_peers() {
let mut bc = BlockCollection::new();
assert!(is_empty(&bc));

let count = 5;
// identical ranges requested from 2 peers
let max_parallel = 2;
let max_ahead = 200;

let peer1 = PeerId::random();
let peer2 = PeerId::random();
let peer3 = PeerId::random();

// common for all peers
let best = 100;
let common = 10;

assert_eq!(
bc.needed_blocks(peer1, count, best, common, max_parallel, max_ahead),
Some(11..16)
);
assert_eq!(
bc.needed_blocks(peer2, count, best, common, max_parallel, max_ahead),
Some(11..16)
);
assert_eq!(
bc.needed_blocks(peer3, count, best, common, max_parallel, max_ahead),
Some(16..21)
);
}
#[test]
fn downloaded_range_not_requested_from_peers_with_higher_common_number() {
// A peer connects with a common number falling behind our best number
// (either a fork or lagging behind).
// We request a range from this peer starting at its common number + 1.
// Even though we have less than `max_parallel` downloads, we do not request
// this range from peers with a common number above the start of this range.

let mut bc = BlockCollection::new();
assert!(is_empty(&bc));

let count = 5;
let max_parallel = 2;
let max_ahead = 200;

let peer1 = PeerId::random();
let peer1_best = 20;
let peer1_common = 10;

// `peer2` has first different above the start of the range downloaded from `peer1`
let peer2 = PeerId::random();
let peer2_best = 20;
let peer2_common = 11; // first_different = 12

assert_eq!(
bc.needed_blocks(peer1, count, peer1_best, peer1_common, max_parallel, max_ahead),
Some(11..16),
);
assert_eq!(
bc.needed_blocks(peer2, count, peer2_best, peer2_common, max_parallel, max_ahead),
Some(16..21),
);
}

#[test]
fn gap_above_common_number_requested() {
let mut bc = BlockCollection::new();
assert!(is_empty(&bc));

let count = 5;
let best = 30;
// We need at least 3 ranges requested to have a gap, so to minimize the number of peers
// set `max_parallel = 1`
let max_parallel = 1;
let max_ahead = 200;

let peer1 = PeerId::random();
let peer2 = PeerId::random();
let peer3 = PeerId::random();

let common = 10;
assert_eq!(
bc.needed_blocks(peer1, count, best, common, max_parallel, max_ahead),
Some(11..16),
);
assert_eq!(
bc.needed_blocks(peer2, count, best, common, max_parallel, max_ahead),
Some(16..21),
);
assert_eq!(
bc.needed_blocks(peer3, count, best, common, max_parallel, max_ahead),
Some(21..26),
);

// For some reason there is now a gap at 16..21. We just disconnect `peer2`, but it might
// also happen that 16..21 received first and got imported if our best is actually >= 15.
bc.clear_peer_download(&peer2);

// Some peer connects with common number below the gap. The gap is requested from it.
assert_eq!(
bc.needed_blocks(peer2, count, best, common, max_parallel, max_ahead),
Some(16..21),
);
}

#[test]
fn gap_below_common_number_not_requested() {
let mut bc = BlockCollection::new();
assert!(is_empty(&bc));

let count = 5;
let best = 30;
// We need at least 3 ranges requested to have a gap, so to minimize the number of peers
// set `max_parallel = 1`
let max_parallel = 1;
let max_ahead = 200;

let peer1 = PeerId::random();
let peer2 = PeerId::random();
let peer3 = PeerId::random();

let common = 10;
assert_eq!(
bc.needed_blocks(peer1, count, best, common, max_parallel, max_ahead),
Some(11..16),
);
assert_eq!(
bc.needed_blocks(peer2, count, best, common, max_parallel, max_ahead),
Some(16..21),
);
assert_eq!(
bc.needed_blocks(peer3, count, best, common, max_parallel, max_ahead),
Some(21..26),
);

// For some reason there is now a gap at 16..21. We just disconnect `peer2`, but it might
// also happen that 16..21 received first and got imported if our best is actually >= 15.
bc.clear_peer_download(&peer2);

// Some peer connects with common number above the gap. The gap is not requested from it.
let common = 23;
assert_eq!(
bc.needed_blocks(peer2, count, best, common, max_parallel, max_ahead),
Some(26..31), // not 16..21
);
}

#[test]
fn range_at_the_end_above_common_number_requested() {
let mut bc = BlockCollection::new();
assert!(is_empty(&bc));

let count = 5;
let best = 30;
let max_parallel = 1;
let max_ahead = 200;

let peer1 = PeerId::random();
let peer2 = PeerId::random();

let common = 10;
assert_eq!(
bc.needed_blocks(peer1, count, best, common, max_parallel, max_ahead),
Some(11..16),
);
assert_eq!(
bc.needed_blocks(peer2, count, best, common, max_parallel, max_ahead),
Some(16..21),
);
}

#[test]
fn range_at_the_end_below_common_number_not_requested() {
let mut bc = BlockCollection::new();
assert!(is_empty(&bc));

let count = 5;
let best = 30;
let max_parallel = 1;
let max_ahead = 200;

let peer1 = PeerId::random();
let peer2 = PeerId::random();

let common = 10;
assert_eq!(
bc.needed_blocks(peer1, count, best, common, max_parallel, max_ahead),
Some(11..16),
);

let common = 20;
assert_eq!(
bc.needed_blocks(peer2, count, best, common, max_parallel, max_ahead),
Some(21..26), // not 16..21
);
}
}
Loading