increased quic connection timeout#29
Conversation
…ing traffic overhead
6017593 to
d84678e
Compare
|
The timeout negotiation mechanism: -- it uses a shorter one of the server and client /// Compute the negotiated idle timeout based on local and remote max_idle_timeout transport parameters. |
|
However the keep_alive_interval is not a negotiated value -- and is set currently by configuration on the client side. This means the ping will be > than the negotiated value against an un-upgraded server and the connection may timeout if no explicit actives during that time. |
|
It is prudent to first increase the idle timeout follow by another PR to increase the keep_alive_interval for upgrade purpose. |
|
cc. @0x0ece |
alexpyattaev
left a comment
There was a problem hiding this comment.
Increasing this timeout globally could make resource exhaustion attacks easier. Sending a ping every second is not so much traffic. I guess we can increase timeout when talking to trusted peers such as high-staked validators safely.
Intuitively I think we all thought so. But in practice the majority of TPU traffic is pings (please feel free to dbl check if you have other metrics). Around leader slots, I see 10x pings than transactions. And outside leader slots of course there's just pings. The goal of the original PR was to reduce pings to a "reasonable" level (but it caused instability issues with tests).
There's already protection limiting the total number of connections. I agree that 60s makes it easier, but exhausting that number seems pretty trivial to me even with 2s timeout. And anyways, this only affects non-staked connections, because the counter for staked is independent. |
We need to guard against resource exhaustion regardless of the timeout -- we are limiting the maximum number connections open across IP and unique to an IP. They could keep the server using resources by simply sending PINGs as it is currently happening already. The goal of this change is to reduce overall network overhead in sending the ping traffic to maintain the connections. |
alexpyattaev
left a comment
There was a problem hiding this comment.
Thank you, lgtm. For reference, connection amount limiting sits in https://github.com/anza-xyz/agave/blob/8e258a9325a3ab2c8b3db408ec11ef932349d800/streamer/src/nonblocking/quic.rs#L709
This is to resurrect the PR anza-xyz/agave#4585 which was reverted due to CI failure.
The CI failure has been root caused due to defunct connections being used for longtime before the increased idle timeout value: anza-xyz/agave#4841.
For backward compatibility the QUIC_KEEP_ALIVE is kept as the old one -- to be increased once the network is upgraded to having the higher IDLE_TIMEOUT