-
Notifications
You must be signed in to change notification settings - Fork 767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
network/strategy: Backoff and ban overloaded peers to avoid submitting the same request multiple times #5029
Conversation
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
The CI pipeline was cancelled due to failure one of the required jobs. |
Signed-off-by: Alexandru Vasile <[email protected]>
…block-req-banned-timeout
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Tested for around 7 days in kusama: 3 | warn | .* banned, disconnecting, reason: .* ( Peer disconnected with inflight after backoffs )
2 | warn | .* banned, disconnecting, reason: .* ( Same block request multiple times )
1 | warn | .* banned, disconnecting, reason: .* ( same small block request multiple times )
TLDR; only 3 out of 1978 backoff incidents resulted in banning from our side, while the others recovered. This makes things a bit better in not overloading peers, while not adding too much complexity. Another place to look in the future would be disconnecting peers, that can happen on a request failure. In the litep2p cases most requests are reported as rejected, which doesn't say exactly what happened and if the error is recoverable or not (paritytech/litep2p#188). However that investigation can happen at a later time. |
let should_ban = state.num_disconnects() >= MAX_NUM_DISCONNECTS; | ||
log::debug!( | ||
target: LOG_TARGET, | ||
"Remove known peer {peer} state: {state:?}, should ban: {should_ban}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The state is only removed below if should_ban
is true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed I see the confusion, the function was called remove
before the last refactor :D will adjust the log here.
We are losing track of the peer when:
- reaching LRU capacity (configured 512)
- the peer is reported as disconnected (tracking then shifts to the peerstore)
Thanks for catching this!
substrate/client/network/sync/src/strategy/disconnected_peers.rs
Outdated
Show resolved
Hide resolved
substrate/client/network/sync/src/strategy/disconnected_peers.rs
Outdated
Show resolved
Hide resolved
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍
|
||
/// The state of a disconnected peer with a request in flight. | ||
#[derive(Debug)] | ||
struct DisconnectedState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: may be call the struct SlowPeers
to highlight the fact we only track disconnections / connection time outs during requests?
substrate/client/network/sync/src/strategy/disconnected_peers.rs
Outdated
Show resolved
Hide resolved
@@ -250,6 +251,7 @@ pub struct ChainSync<B: BlockT, Client> { | |||
client: Arc<Client>, | |||
/// The active peers that we are using to sync and their PeerSync status | |||
peers: HashMap<PeerId, PeerSync<B>>, | |||
disconnected_peers: DisconnectedPeers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: may be call it slow_peers
to highlight we track only disconnected peers during active requests, not all disconnected peers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense, the primary focus of the backoff is to handle slow peers 🤔 Although, I think there might be a chance that the remote peer disconnects immediately because it cannot handle the request for some reason (ie the request resulted in a possible recoverable error). This gives other peers priority to handle the request, while the one that disconnected is backoff for a while
if !peer.state.is_available() || !allowed_requests.contains(&id) { | ||
if !peer.state.is_available() || | ||
!allowed_requests.contains(&id) || | ||
!disconnected_peers.is_peer_available(&id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this can then be !slow_peers.is_peer_available(&id)
, corresponding to the PR description.
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
DQ: If I see it correctly, we only handle the case where a peer disconnects while we have send a request to it and not received anything until now.
What do we do in cases where a request fails for example because of a timeout? Would it be fine to request again in that case?
title: Backoff slow peers to avoid duplicate requests | ||
|
||
doc: | ||
- audience: Node Dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this should also have Node Operator, as this is a operator facing bug. Meaning that operators might have seen this duplicate request error message in their logs. I think it might even be useful to include a reference to the logs message this aims to tackle.
Coming back to this, I had a look at some older logs that we have:
In those cases, we are still backing off the peer. The peer request is timedout, then we disconnect the peer and later the disconnect is reported back to strategies. I think this should be fine for most slow cases. However, I think we are running into a scenario from the logs where the timeout happens because we submitted a request just before the 20 seconds timeout expired. |
Validated the timeout case:
|
* master: (51 commits) Remove unused feature gated code from the minimal template (#5237) make polkadot-parachain startup errors pretty (#5214) Coretime auto-renew (#4424) network/strategy: Backoff and ban overloaded peers to avoid submitting the same request multiple times (#5029) Fix frame crate usage doc (#5222) beefy: Tolerate pruned state on runtime API call (#5197) rpc: Enable ChainSpec for polkadot-parachain (#5205) Add an adapter for configuring AssetExchanger (#5130) Replace env_logger with sp_tracing (#5065) Adjust sync templates flow to use new release branch (#5182) litep2p/discovery: Publish authority records with external addresses only (#5176) Run UI tests in CI for some other crates (#5167) Remove `pallet::getter` usage from the pallet-balances (#4967) pallet-timestamp: `UnixTime::now` implementation logs error only if called at genesis (#5055) [CI] Cache try-runtime check (#5179) [Backport] version bumps and the prdocs reordering from stable2407 (#5178) [subsystem-benchmark] Update availability-distribution-regression-bench baseline after recent subsystem changes (#5180) Remove pallet::getter usage from proxy (#4963) Remove pallet::getter macro usage from pallet-election-provider-multi-phase (#4487) [email protected] (#5177) ...
…g the same request multiple times (paritytech#5029) This PR avoids submitting the same block or state request multiple times to the same slow peer. Previously, we submitted the same request to the same slow peer, which resulted in reputation bans on the slow peer side. Furthermore, the strategy selected the same slow peer multiple times to submit queries to, although a better candidate may exist. Instead, in this PR we: - introduce a `DisconnectedPeers` via LRU with 512 peer capacity to only track the state of disconnected peers with a request in flight - when the `DisconnectedPeers` detects a peer disconnected with a request in flight, the peer is backed off - on the first disconnection: 60 seconds - on second disconnection: 120 seconds - on the third disconnection the peer is banned, and the peer remains banned until the peerstore decays its reputation This PR lifts the pressure from overloaded nodes that cannot process requests in due time. And if a peer is detected to be slow after backoffs, the peer is banned. Theoretically, submitting the same request multiple times can still happen when: - (a) we backoff and ban the peer - (b) the network does not discover other peers -- this may also be a test net - (c) the peer gets reconnected after the reputation decay and is still slow to respond Aims to improve: - paritytech#4924 - paritytech#531 Next Steps: - Investigate the network after this is deployed, possibly bumping the keep-alive timeout or seeing if there's something else misbehaving This PR builds on top of: - paritytech#4987 ### Testing Done - Added a couple of unit tests where test harness were set in place - Local testnet ```bash 13:13:25.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Added first time peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD 13:14:39.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 2, last_disconnect: Instant { tv_sec: 93355, tv_nsec: 942016062 } }, should ban: false 13:16:49.107 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 3, last_disconnect: Instant { tv_sec: 93485, tv_nsec: 947551051 } }, should ban: true 13:16:49.108 WARN tokio-runtime-worker peerset: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648 to -2147483648. Reason: Slow peer after backoffs. Banned, disconnecting. ``` cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <[email protected]>
This PR avoids submitting the same block or state request multiple times to the same slow peer.
Previously, we submitted the same request to the same slow peer, which resulted in reputation bans on the slow peer side.
Furthermore, the strategy selected the same slow peer multiple times to submit queries to, although a better candidate may exist.
Instead, in this PR we:
DisconnectedPeers
via LRU with 512 peer capacity to only track the state of disconnected peers with a request in flightDisconnectedPeers
detects a peer disconnected with a request in flight, the peer is backed offThis PR lifts the pressure from overloaded nodes that cannot process requests in due time.
And if a peer is detected to be slow after backoffs, the peer is banned.
Theoretically, submitting the same request multiple times can still happen when:
Aims to improve:
Next Steps:
This PR builds on top of:
Testing Done
Added a couple of unit tests where test harness were set in place
Local testnet
cc @paritytech/networking