Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network/strategy: Backoff and ban overloaded peers to avoid submitting the same request multiple times #5029

Merged
merged 25 commits into from
Aug 5, 2024

Conversation

lexnv
Copy link
Contributor

@lexnv lexnv commented Jul 16, 2024

This PR avoids submitting the same block or state request multiple times to the same slow peer.

Previously, we submitted the same request to the same slow peer, which resulted in reputation bans on the slow peer side.
Furthermore, the strategy selected the same slow peer multiple times to submit queries to, although a better candidate may exist.

Instead, in this PR we:

  • introduce a DisconnectedPeers via LRU with 512 peer capacity to only track the state of disconnected peers with a request in flight
  • when the DisconnectedPeers detects a peer disconnected with a request in flight, the peer is backed off
    • on the first disconnection: 60 seconds
    • on second disconnection: 120 seconds
    • on the third disconnection the peer is banned, and the peer remains banned until the peerstore decays its reputation

This PR lifts the pressure from overloaded nodes that cannot process requests in due time.
And if a peer is detected to be slow after backoffs, the peer is banned.

Theoretically, submitting the same request multiple times can still happen when:

  • (a) we backoff and ban the peer
  • (b) the network does not discover other peers -- this may also be a test net
  • (c) the peer gets reconnected after the reputation decay and is still slow to respond

Aims to improve:

Next Steps:

  • Investigate the network after this is deployed, possibly bumping the keep-alive timeout or seeing if there's something else misbehaving

This PR builds on top of:

Testing Done

  • Added a couple of unit tests where test harness were set in place

  • Local testnet

13:13:25.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Added first time peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD

13:14:39.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 2, last_disconnect: Instant { tv_sec: 93355, tv_nsec: 942016062 } }, should ban: false

13:16:49.107 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 3, last_disconnect: Instant { tv_sec: 93485, tv_nsec: 947551051 } }, should ban: true

13:16:49.108  WARN tokio-runtime-worker peerset: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648 to -2147483648. Reason: Slow peer after backoffs. Banned, disconnecting.

cc @paritytech/networking

@lexnv lexnv added A1-insubstantial Pull request requires no code review (e.g., a sub-repository hash update). R0-silent Changes should not be mentioned in any release notes I5-enhancement An additional feature request. D1-medium Can be fixed by a coder with good Rust knowledge but little knowledge of the codebase. labels Jul 16, 2024
@lexnv lexnv self-assigned this Jul 16, 2024
lexnv added 4 commits July 16, 2024 14:46
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable 3/3
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6714708

@lexnv
Copy link
Contributor Author

lexnv commented Jul 30, 2024

Tested for around 7 days in kusama:

3          | warn       | .* banned, disconnecting, reason: .* ( Peer disconnected with inflight after backoffs    )
2          | warn       | .* banned, disconnecting, reason: .* ( Same block request multiple times    )
1          | warn       | .* banned, disconnecting, reason: .* ( same small block request multiple times    )
  • Have detected 3 occurrences where different peers were banned and did not recover after the backoffs.
  • Encountered 3 occurrences of "same block" (2 for blocks and 1 for small block) from peers that don't have this backoff strategy enabled.
  • Have backoff-off a total of 1978 peers over the lifetime of the node which recovered after 1 backoff period. Which would have otherwise resulted in overloading the nodes and in potentially other "same node requests" warnings.
Added peer 12D3KooWPycfFryUnM73FYX2GtTRCWgYSvVR9BdhjdHfQsUSB51x for the first time
Added peer 12D3KooWKSxWR4DztMySFP5bqEuabwTXGSdjGFbKrjNrwakezm1t for the first time
Added peer 12D3KooWFkQ6kpypoubNxArj7Bh3S26ZW22EVwmzgruDfccK5L2M for the first time
Added peer 12D3KooWFCzn6iLLhZktdmktszZZYtCRMJmqEYaUWAaeS5QZcLQJ for the first time
Added peer 12D3KooWLy64aBQgr7D69vjoiC9EiK3zGzBuAErRoZEunkHGL5tS for the first time
Added peer 12D3KooWNgBTwW3GNERNTtTVTxKV7uiM48t3ADYimUwkSBGcmzqC for the first time
Added peer 12D3KooWPycfFryUnM73FYX2GtTRCWgYSvVR9BdhjdHfQsUSB51x for the first time
Added peer 12D3KooWMJ9K555gcAWyRdTLEUWNB6GTrJcy9CCusPBHGvUK7i7F for the first time
Added peer 12D3KooWFGjQPK4nus67n2jVLsJdgnHxNcspRxi2B9nuSJ8jseqg for the first time
...

TLDR; only 3 out of 1978 backoff incidents resulted in banning from our side, while the others recovered. This makes things a bit better in not overloading peers, while not adding too much complexity.

Another place to look in the future would be disconnecting peers, that can happen on a request failure. In the litep2p cases most requests are reported as rejected, which doesn't say exactly what happened and if the error is recoverable or not (paritytech/litep2p#188). However that investigation can happen at a later time.

@lexnv lexnv added the T0-node This PR/Issue is related to the topic “node”. label Jul 30, 2024
@lexnv lexnv requested a review from a team July 30, 2024 13:57
let should_ban = state.num_disconnects() >= MAX_NUM_DISCONNECTS;
log::debug!(
target: LOG_TARGET,
"Remove known peer {peer} state: {state:?}, should ban: {should_ban}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state is only removed below if should_ban is true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I see the confusion, the function was called remove before the last refactor :D will adjust the log here.
We are losing track of the peer when:

  • reaching LRU capacity (configured 512)
  • the peer is reported as disconnected (tracking then shifts to the peerstore)
    Thanks for catching this!

lexnv added 2 commits July 30, 2024 18:43
@lexnv lexnv removed the R0-silent Changes should not be mentioned in any release notes label Jul 30, 2024
Signed-off-by: Alexandru Vasile <[email protected]>
Copy link
Contributor

@dmitry-markin dmitry-markin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

substrate/client/network/sync/src/strategy/state.rs Outdated Show resolved Hide resolved
substrate/client/network/sync/src/strategy/warp.rs Outdated Show resolved Hide resolved

/// The state of a disconnected peer with a request in flight.
#[derive(Debug)]
struct DisconnectedState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: may be call the struct SlowPeers to highlight the fact we only track disconnections / connection time outs during requests?

@@ -250,6 +251,7 @@ pub struct ChainSync<B: BlockT, Client> {
client: Arc<Client>,
/// The active peers that we are using to sync and their PeerSync status
peers: HashMap<PeerId, PeerSync<B>>,
disconnected_peers: DisconnectedPeers,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: may be call it slow_peers to highlight we track only disconnected peers during active requests, not all disconnected peers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, the primary focus of the backoff is to handle slow peers 🤔 Although, I think there might be a chance that the remote peer disconnects immediately because it cannot handle the request for some reason (ie the request resulted in a possible recoverable error). This gives other peers priority to handle the request, while the one that disconnected is backoff for a while

if !peer.state.is_available() || !allowed_requests.contains(&id) {
if !peer.state.is_available() ||
!allowed_requests.contains(&id) ||
!disconnected_peers.is_peer_available(&id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this can then be !slow_peers.is_peer_available(&id), corresponding to the PR description.

@lexnv lexnv changed the title network/strategy: Backoff and ban slow peers to avoid submitting the same request multiple times network/strategy: Backoff and ban overloaded peers to avoid submitting the same request multiple times Aug 1, 2024
Copy link
Contributor

@skunert skunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

DQ: If I see it correctly, we only handle the case where a peer disconnects while we have send a request to it and not received anything until now.

What do we do in cases where a request fails for example because of a timeout? Would it be fine to request again in that case?

title: Backoff slow peers to avoid duplicate requests

doc:
- audience: Node Dev
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should also have Node Operator, as this is a operator facing bug. Meaning that operators might have seen this duplicate request error message in their logs. I think it might even be useful to include a reference to the logs message this aims to tackle.

@lexnv
Copy link
Contributor Author

lexnv commented Aug 1, 2024

What do we do in cases where a request fails for example because of a timeout? Would it be fine to request again in that case?

Coming back to this, I had a look at some older logs that we have:

2024-07-23 18:38:18.869 DEBUG tokio-runtime-worker sync: Pre-validating received block announcement 0x1afbfb4b8f9862a8bf1a82750af18a393c41b885a9fd72f1deeb6cae0425b4bf with number 24169952 from 12D3KooWE1tq9ZL9AAxMiUBBqy1ENmh5pwfWabnoBPMo8gFPXhn6
2024-07-23 18:38:18.869 TRACE tokio-runtime-worker sync: Block announcement validated successfully: from 12D3KooWE1tq9ZL9AAxMiUBBqy1ENmh5pwfWabnoBPMo8gFPXhn6: AnnouncementSummary { block_hash: 0x1afbfb4b8f9862a8bf1a82750af18a393c41b885a9fd72f1deeb6cae0425b4bf, number: 24169952, parent_hash: 0xdf6f7654c69eaddbe94e7db7d38cb79f74181765a6cde6c690b6d67c2d474d49, state: Some(Best) }. Local best: true.
2024-07-23 18:38:18.869 TRACE tokio-runtime-worker sync: 12D3KooWE1tq9ZL9AAxMiUBBqy1ENmh5pwfWabnoBPMo8gFPXhn6 Ignoring transactions while major syncing
..

2024-07-23 18:38:20.768 DEBUG tokio-runtime-worker sync: Request to peer PeerId("12D3KooWE1tq9ZL9AAxMiUBBqy1ENmh5pwfWabnoBPMo8gFPXhn6") failed: Network(Timeout).
2024-07-23 18:38:20.768 DEBUG tokio-runtime-worker sync: 12D3KooWE1tq9ZL9AAxMiUBBqy1ENmh5pwfWabnoBPMo8gFPXhn6 disconnected
2024-07-23 18:38:20.768 DEBUG tokio-runtime-worker sync::persistent_peer_state: Added peer 12D3KooWE1tq9ZL9AAxMiUBBqy1ENmh5pwfWabnoBPMo8gFPXhn6 for the first time

In those cases, we are still backing off the peer. The peer request is timedout, then we disconnect the peer and later the disconnect is reported back to strategies. I think this should be fine for most slow cases.

However, I think we are running into a scenario from the logs where the timeout happens because we submitted a request just before the 20 seconds timeout expired.

@lexnv
Copy link
Contributor Author

lexnv commented Aug 5, 2024

Validated the timeout case:

2024-08-02 16:43:21.214 TRACE tokio-runtime-worker sync: New WarpProofRequest to 12D3KooWL4KPqfAsPE2aY1g5Zo1CxsDwcdJ7mmAghK7cg6M2fdbD, begin hash: 0xb0a8…dafe.    
2024-08-02 16:43:21.214 TRACE tokio-runtime-worker sync: Processed `ChainSyncAction::SendWarpProofRequest` to 12D3KooWL4KPqfAsPE2aY1g5Zo1CxsDwcdJ7mmAghK7cg6M2fdbD, request: WarpProofRequest { begin: 0xb0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe }.
2024-08-02 16:43:21.214 TRACE tokio-runtime-worker sub-libp2p::request-response: /b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/sync/warp: send request to PeerId("12D3KooWL4KPqfAsPE2aY1g5Zo1CxsDwcdJ7mmAghK7cg6M2fdbD") (fallback None) (dial options: Reject)    

2024-08-02 16:43:24.420 TRACE tokio-runtime-worker sync: 12D3KooWL4KPqfAsPE2aY1g5Zo1CxsDwcdJ7mmAghK7cg6M2fdbD Ignoring transactions while major syncing    
2024-08-02 16:43:24.420 TRACE tokio-runtime-worker sync: 12D3KooWL4KPqfAsPE2aY1g5Zo1CxsDwcdJ7mmAghK7cg6M2fdbD Ignoring transactions while major syncing    

2024-08-02 16:43:25.217 TRACE tokio-runtime-worker sync: Block announcement validated successfully: from 12D3KooWL4KPqfAsPE2aY1g5Zo1CxsDwcdJ7mmAghK7cg6M2fdbD: AnnouncementSummary { block_hash: 0x87a833d73f39cf7b5f0dc5fb6ef17d26fd4b4e38640e08a4ec44f1f2cabce8b8, number: 24311078, parent_hash: 0xcc23e1fc80eb287c3f23e0686cdc8e26a14c7d4ade7f3d260acfbb5e11a3f173, state: Some(Best) }. Local best: true.    

2024-08-02 16:43:30.879 DEBUG tokio-runtime-worker sync: Pre-validating received block announcement 0xd793510541eb81890d43ec2df01e2c8d2ac90031c9c66ce6c7838db3dd0e954e with number 24311079 from 12D3KooWL4KPqfAsPE2aY1g5Zo1CxsDwcdJ7mmAghK7cg6M2fdbD    
2024-08-02 16:43:30.879 TRACE tokio-runtime-worker sync: Block announcement validated successfully: from 12D3KooWL4KPqfAsPE2aY1g5Zo1CxsDwcdJ7mmAghK7cg6M2fdbD: AnnouncementSummary { block_hash: 0xd793510541eb81890d43ec2df01e2c8d2ac90031c9c66ce6c7838db3dd0e954e, number: 24311079, parent_hash: 0x87a833d73f39cf7b5f0dc5fb6ef17d26fd4b4e38640e08a4ec44f1f2cabce8b8, state: Some(Best) }. Local best: true.    

2024-08-02 16:43:31.281 DEBUG tokio-runtime-worker sub-libp2p::request-response: OnHeap("/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/sync/warp"): request failed for PeerId("12D3KooWL4KPqfAsPE2aY1g5Zo1CxsDwcdJ7mmAghK7cg6M2fdbD") (RequestId(0)): Timeout    
2024-08-02 16:43:31.282 DEBUG tokio-runtime-worker sync: Request to peer PeerId("12D3KooWL4KPqfAsPE2aY1g5Zo1CxsDwcdJ7mmAghK7cg6M2fdbD") failed: Network(Timeout).    
2024-08-02 16:43:31.282 DEBUG tokio-runtime-worker sync: 12D3KooWL4KPqfAsPE2aY1g5Zo1CxsDwcdJ7mmAghK7cg6M2fdbD disconnected    
2024-08-02 16:43:31.282 DEBUG tokio-runtime-worker sync::disconnected_peers: Added peer 12D3KooWL4KPqfAsPE2aY1g5Zo1CxsDwcdJ7mmAghK7cg6M2fdbD for the first time    
  • 43:21: We send peer A a request over /sync/warp (configured with 10s timeouts)
  • (in the meanwhile) peer A is active over block-announces protocol (different proto)
  • 43:31: Request to peer A failed: Network(Timeout)
  • peer is added to the backoff list

@lexnv lexnv added this pull request to the merge queue Aug 5, 2024
Merged via the queue into master with commit 6619277 Aug 5, 2024
163 of 164 checks passed
@lexnv lexnv deleted the lexnv/fix-multiple-block-req-banned-timeout branch August 5, 2024 10:21
ordian added a commit that referenced this pull request Aug 6, 2024
* master: (51 commits)
  Remove unused feature gated code from the minimal template (#5237)
  make polkadot-parachain startup errors pretty (#5214)
  Coretime auto-renew (#4424)
  network/strategy: Backoff and ban overloaded peers to avoid submitting the same request multiple times (#5029)
  Fix frame crate usage doc (#5222)
  beefy: Tolerate pruned state on runtime API call (#5197)
  rpc: Enable ChainSpec for polkadot-parachain (#5205)
  Add an adapter for configuring AssetExchanger (#5130)
  Replace env_logger with sp_tracing (#5065)
  Adjust sync templates flow to use new release branch (#5182)
  litep2p/discovery: Publish authority records with external addresses only (#5176)
  Run UI tests in CI for some other crates (#5167)
  Remove `pallet::getter` usage from the pallet-balances (#4967)
  pallet-timestamp: `UnixTime::now` implementation logs error only if called at genesis (#5055)
  [CI] Cache try-runtime check (#5179)
  [Backport] version bumps and the prdocs reordering from stable2407 (#5178)
  [subsystem-benchmark] Update availability-distribution-regression-bench baseline after recent subsystem changes (#5180)
  Remove pallet::getter usage from proxy (#4963)
  Remove pallet::getter macro usage from pallet-election-provider-multi-phase (#4487)
  [email protected] (#5177)
  ...
dharjeezy pushed a commit to dharjeezy/polkadot-sdk that referenced this pull request Aug 28, 2024
…g the same request multiple times (paritytech#5029)

This PR avoids submitting the same block or state request multiple times
to the same slow peer.

Previously, we submitted the same request to the same slow peer, which
resulted in reputation bans on the slow peer side.
Furthermore, the strategy selected the same slow peer multiple times to
submit queries to, although a better candidate may exist.

Instead, in this PR we:
- introduce a `DisconnectedPeers` via LRU with 512 peer capacity to only
track the state of disconnected peers with a request in flight
- when the `DisconnectedPeers` detects a peer disconnected with a
request in flight, the peer is backed off
  - on the first disconnection: 60 seconds
  - on second disconnection: 120 seconds
- on the third disconnection the peer is banned, and the peer remains
banned until the peerstore decays its reputation
  
This PR lifts the pressure from overloaded nodes that cannot process
requests in due time.
And if a peer is detected to be slow after backoffs, the peer is banned.

Theoretically, submitting the same request multiple times can still
happen when:
- (a) we backoff and ban the peer 
- (b) the network does not discover other peers -- this may also be a
test net
- (c) the peer gets reconnected after the reputation decay and is still
slow to respond



Aims to improve:
- paritytech#4924
- paritytech#531

Next Steps:
- Investigate the network after this is deployed, possibly bumping the
keep-alive timeout or seeing if there's something else misbehaving




This PR builds on top of:
- paritytech#4987


### Testing Done
- Added a couple of unit tests where test harness were set in place

- Local testnet

```bash
13:13:25.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Added first time peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD

13:14:39.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 2, last_disconnect: Instant { tv_sec: 93355, tv_nsec: 942016062 } }, should ban: false

13:16:49.107 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 3, last_disconnect: Instant { tv_sec: 93485, tv_nsec: 947551051 } }, should ban: true

13:16:49.108  WARN tokio-runtime-worker peerset: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648 to -2147483648. Reason: Slow peer after backoffs. Banned, disconnecting.
```

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A1-insubstantial Pull request requires no code review (e.g., a sub-repository hash update). D1-medium Can be fixed by a coder with good Rust knowledge but little knowledge of the codebase. I5-enhancement An additional feature request. T0-node This PR/Issue is related to the topic “node”.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants