increased quic connection timeout by lijunwangs · Pull Request #29 · anza-xyz/solana-sdk

lijunwangs · 2025-02-12T03:13:15Z

This is to resurrect the PR anza-xyz/agave#4585 which was reverted due to CI failure.
The CI failure has been root caused due to defunct connections being used for longtime before the increased idle timeout value: anza-xyz/agave#4841.

For backward compatibility the QUIC_KEEP_ALIVE is kept as the old one -- to be increased once the network is upgraded to having the higher IDLE_TIMEOUT

…ing traffic overhead

lijunwangs · 2025-02-12T23:48:47Z

The timeout negotiation mechanism: -- it uses a shorter one of the server and client

/// Compute the negotiated idle timeout based on local and remote max_idle_timeout transport parameters.
///
/// According to the definition of max_idle_timeout, a value of 0 means the timeout is disabled; see https://www.rfc-editor.org/rfc/rfc9000#section-18.2-4.4.1.
///
/// According to the negotiation procedure, either the minimum of the timeouts or one specified is used as the negotiated value; see https://www.rfc-editor.org/rfc/rfc9000#section-10.1-2.
///
/// Returns the negotiated idle timeout as a Duration, or None when both endpoints have opted out of idle timeout.
fn negotiate_max_idle_timeout(x: Option, y: Option) -> Option {
match (x, y) {
(Some(VarInt(0)) | None, Some(VarInt(0)) | None) => None,
(Some(VarInt(0)) | None, Some(y)) => Some(Duration::from_millis(y.0)),
(Some(x), Some(VarInt(0)) | None) => Some(Duration::from_millis(x.0)),
(Some(x), Some(y)) => Some(Duration::from_millis(cmp::min(x, y).0)),
}
}

lijunwangs · 2025-02-13T00:03:44Z

However the keep_alive_interval is not a negotiated value -- and is set currently by configuration on the client side. This means the ping will be > than the negotiated value against an un-upgraded server and the connection may timeout if no explicit actives during that time.

/// Period of inactivity before sending a keep-alive packet
///
/// Keep-alive packets prevent an inactive but otherwise healthy connection from timing out.
///
/// `None` to disable, which is the default. Only one side of any given connection needs keep-alive
/// enabled for the connection to be preserved. Must be set lower than the idle_timeout of both
/// peers to be effective.
pub fn keep_alive_interval(&mut self, value: Option<Duration>) -> &mut Self {
    self.keep_alive_interval = value;
    self
}

lijunwangs · 2025-02-13T00:11:15Z

It is prudent to first increase the idle timeout follow by another PR to increase the keep_alive_interval for upgrade purpose.

lijunwangs · 2025-02-13T00:15:37Z

cc. @0x0ece

alexpyattaev

Increasing this timeout globally could make resource exhaustion attacks easier. Sending a ping every second is not so much traffic. I guess we can increase timeout when talking to trusted peers such as high-staked validators safely.

0x0ece · 2025-02-14T22:55:10Z

Sending a ping every second is not so much traffic.

Intuitively I think we all thought so. But in practice the majority of TPU traffic is pings (please feel free to dbl check if you have other metrics). Around leader slots, I see 10x pings than transactions. And outside leader slots of course there's just pings. The goal of the original PR was to reduce pings to a "reasonable" level (but it caused instability issues with tests).

Increasing this timeout globally could make resource exhaustion attacks easier.

There's already protection limiting the total number of connections. I agree that 60s makes it easier, but exhausting that number seems pretty trivial to me even with 2s timeout. And anyways, this only affects non-staked connections, because the counter for staked is independent.

lijunwangs · 2025-02-15T00:34:52Z

Increasing this timeout globally could make resource exhaustion attacks easier. Sending a ping every second is not so much traffic. I guess we can increase timeout when talking to trusted peers such as high-staked validators safely.

We need to guard against resource exhaustion regardless of the timeout -- we are limiting the maximum number connections open across IP and unique to an IP. They could keep the server using resources by simply sending PINGs as it is currently happening already. The goal of this change is to reduce overall network overhead in sending the ping traffic to maintain the connections.

alexpyattaev

Thank you, lgtm. For reference, connection amount limiting sits in https://github.com/anza-xyz/agave/blob/8e258a9325a3ab2c8b3db408ec11ef932349d800/streamer/src/nonblocking/quic.rs#L709

increased quic connection timeout and keep_alive interval to reduce p…

d84678e

…ing traffic overhead

lijunwangs force-pushed the reduce_quic_connection_keepalive_overhead branch from 6017593 to d84678e Compare February 12, 2025 22:57

keep QUIC_KEEP_ALIVE for now

10d678f

lijunwangs requested review from alessandrod and behzadnouri February 13, 2025 00:15

lijunwangs changed the title ~~increased quic connection timeout and keep_alive interval to reduce ping traffic overhead.~~ increased quic connection timeout Feb 13, 2025

0x0ece approved these changes Feb 13, 2025

View reviewed changes

lijunwangs requested a review from alexpyattaev February 14, 2025 19:53

alexpyattaev reviewed Feb 14, 2025

View reviewed changes

alexpyattaev approved these changes Feb 15, 2025

View reviewed changes

lijunwangs merged commit eee5cab into master Feb 15, 2025

lijunwangs mentioned this pull request Jun 19, 2025

Increase quic heartbeat interval to 45 second to reduce overhead #201

Merged

alexpyattaev mentioned this pull request Nov 21, 2025

Rethink idle and keep alive timeouts anza-xyz/agave#9213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increased quic connection timeout#29

increased quic connection timeout#29
lijunwangs merged 2 commits intomasterfrom
reduce_quic_connection_keepalive_overhead

lijunwangs commented Feb 12, 2025 •

edited

Loading

Uh oh!

lijunwangs commented Feb 12, 2025 •

edited

Loading

Uh oh!

lijunwangs commented Feb 13, 2025 •

edited

Loading

Uh oh!

lijunwangs commented Feb 13, 2025

Uh oh!

lijunwangs commented Feb 13, 2025

Uh oh!

alexpyattaev left a comment

Uh oh!

0x0ece commented Feb 14, 2025

Uh oh!

lijunwangs commented Feb 15, 2025

Uh oh!

alexpyattaev left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lijunwangs commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lijunwangs commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lijunwangs commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lijunwangs commented Feb 13, 2025

Uh oh!

lijunwangs commented Feb 13, 2025

Uh oh!

alexpyattaev left a comment

Choose a reason for hiding this comment

Uh oh!

0x0ece commented Feb 14, 2025

Uh oh!

lijunwangs commented Feb 15, 2025

Uh oh!

alexpyattaev left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lijunwangs commented Feb 12, 2025 •

edited

Loading

lijunwangs commented Feb 12, 2025 •

edited

Loading

lijunwangs commented Feb 13, 2025 •

edited

Loading