Skip to content

increased quic connection timeout#29

Merged
lijunwangs merged 2 commits intomasterfrom
reduce_quic_connection_keepalive_overhead
Feb 15, 2025
Merged

increased quic connection timeout#29
lijunwangs merged 2 commits intomasterfrom
reduce_quic_connection_keepalive_overhead

Conversation

@lijunwangs
Copy link
Copy Markdown
Contributor

@lijunwangs lijunwangs commented Feb 12, 2025

This is to resurrect the PR anza-xyz/agave#4585 which was reverted due to CI failure.
The CI failure has been root caused due to defunct connections being used for longtime before the increased idle timeout value: anza-xyz/agave#4841.

For backward compatibility the QUIC_KEEP_ALIVE is kept as the old one -- to be increased once the network is upgraded to having the higher IDLE_TIMEOUT

@lijunwangs lijunwangs force-pushed the reduce_quic_connection_keepalive_overhead branch from 6017593 to d84678e Compare February 12, 2025 22:57
@lijunwangs
Copy link
Copy Markdown
Contributor Author

lijunwangs commented Feb 12, 2025

The timeout negotiation mechanism: -- it uses a shorter one of the server and client

/// Compute the negotiated idle timeout based on local and remote max_idle_timeout transport parameters.
///
/// According to the definition of max_idle_timeout, a value of 0 means the timeout is disabled; see https://www.rfc-editor.org/rfc/rfc9000#section-18.2-4.4.1.
///
/// According to the negotiation procedure, either the minimum of the timeouts or one specified is used as the negotiated value; see https://www.rfc-editor.org/rfc/rfc9000#section-10.1-2.
///
/// Returns the negotiated idle timeout as a Duration, or None when both endpoints have opted out of idle timeout.
fn negotiate_max_idle_timeout(x: Option, y: Option) -> Option {
match (x, y) {
(Some(VarInt(0)) | None, Some(VarInt(0)) | None) => None,
(Some(VarInt(0)) | None, Some(y)) => Some(Duration::from_millis(y.0)),
(Some(x), Some(VarInt(0)) | None) => Some(Duration::from_millis(x.0)),
(Some(x), Some(y)) => Some(Duration::from_millis(cmp::min(x, y).0)),
}
}

@lijunwangs
Copy link
Copy Markdown
Contributor Author

lijunwangs commented Feb 13, 2025

However the keep_alive_interval is not a negotiated value -- and is set currently by configuration on the client side. This means the ping will be > than the negotiated value against an un-upgraded server and the connection may timeout if no explicit actives during that time.

/// Period of inactivity before sending a keep-alive packet
///
/// Keep-alive packets prevent an inactive but otherwise healthy connection from timing out.
///
/// `None` to disable, which is the default. Only one side of any given connection needs keep-alive
/// enabled for the connection to be preserved. Must be set lower than the idle_timeout of both
/// peers to be effective.
pub fn keep_alive_interval(&mut self, value: Option<Duration>) -> &mut Self {
    self.keep_alive_interval = value;
    self
}

@lijunwangs
Copy link
Copy Markdown
Contributor Author

It is prudent to first increase the idle timeout follow by another PR to increase the keep_alive_interval for upgrade purpose.

@lijunwangs
Copy link
Copy Markdown
Contributor Author

cc. @0x0ece

@lijunwangs lijunwangs changed the title increased quic connection timeout and keep_alive interval to reduce ping traffic overhead. increased quic connection timeout Feb 13, 2025
Copy link
Copy Markdown
Contributor

@alexpyattaev alexpyattaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing this timeout globally could make resource exhaustion attacks easier. Sending a ping every second is not so much traffic. I guess we can increase timeout when talking to trusted peers such as high-staked validators safely.

@0x0ece
Copy link
Copy Markdown
Contributor

0x0ece commented Feb 14, 2025

Sending a ping every second is not so much traffic.

Intuitively I think we all thought so. But in practice the majority of TPU traffic is pings (please feel free to dbl check if you have other metrics). Around leader slots, I see 10x pings than transactions. And outside leader slots of course there's just pings. The goal of the original PR was to reduce pings to a "reasonable" level (but it caused instability issues with tests).

Increasing this timeout globally could make resource exhaustion attacks easier.

There's already protection limiting the total number of connections. I agree that 60s makes it easier, but exhausting that number seems pretty trivial to me even with 2s timeout. And anyways, this only affects non-staked connections, because the counter for staked is independent.

@lijunwangs
Copy link
Copy Markdown
Contributor Author

Increasing this timeout globally could make resource exhaustion attacks easier. Sending a ping every second is not so much traffic. I guess we can increase timeout when talking to trusted peers such as high-staked validators safely.

We need to guard against resource exhaustion regardless of the timeout -- we are limiting the maximum number connections open across IP and unique to an IP. They could keep the server using resources by simply sending PINGs as it is currently happening already. The goal of this change is to reduce overall network overhead in sending the ping traffic to maintain the connections.

Copy link
Copy Markdown
Contributor

@alexpyattaev alexpyattaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants