net/libp2p: Enforce outbound request-response timeout limits#7222
net/libp2p: Enforce outbound request-response timeout limits#7222
Conversation
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
alexggh
left a comment
There was a problem hiding this comment.
Overall, looks good to me, one thing I would do is burn it in on the kusama validator over the weekend, just to see if this triggers in real work conditions, our assumption it doesn't but we might be surprised.
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
| use std::{iter, time::Duration}; | ||
|
|
||
| struct TokioExecutor(tokio::runtime::Runtime); | ||
| struct TokioExecutor; |
There was a problem hiding this comment.
I don't understand why we need this change.
There was a problem hiding this comment.
Just a tiny change to not pass the runtime when building the swarm and let tokio::spawn below do the job
failure Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
…nv/enforce-timeouts
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
bkchr
left a comment
There was a problem hiding this comment.
I would change the collect & remove code path to a simple retain.
| .iter() | ||
| .filter_map(|(id, req)| { |
There was a problem hiding this comment.
Why not retain? Then we don't need collect & remove.
There was a problem hiding this comment.
There's a tiny borrow-checker issue coming from the sender consuming self: retain will give us &mut req and the req.response oneshot::Sender consumes self when we send out the send(Err(RequestFailure::Network(OutboundFailure::Timeout))).
We could still avoid the extra alloc by adding an Option over the sender, will have a look if it complicates the code, thanks
| .collect(); | ||
|
|
||
| if !timedout_requests.is_empty() { | ||
| log::warn!( |
There was a problem hiding this comment.
So this could only be triggered by a bug in libp2p/litep2p?
There was a problem hiding this comment.
Yep this could be triggered only by a bug in libp2p (libp2p/rust-libp2p#5429). I expected that the latest libp2p version would solve this issue, however, I could reproduce it on our libp2p kusama validator (and locally):
2025-01-21 05:43:14.885 WARN tokio-runtime-worker sub-libp2p: Requests [(ProtocolRequestId { protocol: OnHeap("/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/sync/2"), request_id: OutboundRequestId(53207) }, Some(20.479579184s))] force detected as timeout.
2025-01-21 05:43:18.178 WARN tokio-runtime-worker sub-libp2p: Received `RequestResponseEvent::OutboundFailure` with unexpected request id OutboundRequestId(53207) error Timeout from PeerId("12D3KooWJZuPc7KPtJcbYurX7brn3booPwbK53ucJhhPmuc1zZPt")
In this case libp2p produces a timeout event ~4s after we have detected the request as timed out. I'll also have a look at our poll implementation, maybe there's something there
Co-authored-by: Bastian Köcher <git@kchr.de>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
timeouts Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
|
All GitHub workflows were cancelled due to failure one of the required jobs. |
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
|
Created backport PR for
Please cherry-pick the changes locally and resolve any conflicts. git fetch origin backport-7222-to-stable2407
git worktree add --checkout .worktree/backport-7222-to-stable2407 backport-7222-to-stable2407
cd .worktree/backport-7222-to-stable2407
git reset --hard HEAD^
git cherry-pick -x fd64a1e7768ba6e8676cbbf25c4e821a901c0a7f
git push --force-with-lease |
|
Created backport PR for
Please cherry-pick the changes locally and resolve any conflicts. git fetch origin backport-7222-to-stable2409
git worktree add --checkout .worktree/backport-7222-to-stable2409 backport-7222-to-stable2409
cd .worktree/backport-7222-to-stable2409
git reset --hard HEAD^
git cherry-pick -x fd64a1e7768ba6e8676cbbf25c4e821a901c0a7f
git push --force-with-lease |
|
Created backport PR for
Please cherry-pick the changes locally and resolve any conflicts. git fetch origin backport-7222-to-stable2412
git worktree add --checkout .worktree/backport-7222-to-stable2412 backport-7222-to-stable2412
cd .worktree/backport-7222-to-stable2412
git reset --hard HEAD^
git cherry-pick -x fd64a1e7768ba6e8676cbbf25c4e821a901c0a7f
git push --force-with-lease |
…imeout limits (#7302) Backport #7222 into `stable2412` from lexnv. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot. <!-- # To be used by other automation, do not modify: original-pr-number: #${pull_number} --> --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> Co-authored-by: Alexandru Vasile <60601340+lexnv@users.noreply.github.com> Co-authored-by: Alexandru Vasile <alexandru.vasile@parity.io>
…Enforce outbound request-response timeout limits (paritytech#7222)
This PR enforces that outbound requests are finished within the specified protocol timeout.
The stable2412 version running libp2p 0.52.4 contains a bug which does not track request timeouts properly:
The issue has been detected while submitting libp2p -> litep2p requests in kusama. This aims to check that pending outbound requests have not timedout. Although the issue has been fixed in libp2p, there might be other cases where this may happen. For example:
For more context see: #7076 (comment)
RequestFinished { error: .. }. That event is only used internally by substrate to increment metrics. However, we don't have the peer information available to propagate the event properly when we force-timeout the request. Considering this should most likely not happen in production (origin/master) and that we'll be able to extract information by warnings, I would say this is a good tradeoff for code simplicity:polkadot-sdk/substrate/client/network/src/service.rs
Line 1543 in 06e3b5c
Testing
Added a new test to ensure the timeout is reached properly, even if libp2p does not produce a response in due time.
I've also transitioned the tests to using
tokio::testdue to a limitation of CIcc @paritytech/networking