quic: ameliorating stream limit bug#18614
Conversation
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
| quic_config.set_max_time_before_crypto_handshake(crypto_timeout); | ||
| int32_t max_streams = | ||
| cluster.http3Options().quic_protocol_options().max_concurrent_streams().value(); | ||
| int32_t max_streams = getMaxStreams(cluster); |
|
|
||
| ActiveClient::ActiveClient(Envoy::Http::HttpConnPoolImplBase& parent, | ||
| Upstream::Host::CreateConnectionData& data) | ||
| : MultiplexedActiveClientBase(parent, getMaxStreams(parent.host()->cluster()), |
There was a problem hiding this comment.
Do MultiplexedActiveClientBase() and quic_config regard the value of getMaxStreams() differently which caused the stream accounting to be inconsistent? How does ActiveClient handle newStream exceeding limit? Does it close the request encoder immediately? If so can we just +1 of that value to pass into quic config so that we don't violate QUICHE assertion?
There was a problem hiding this comment.
I think I covered this in the debate on the main thread so I think it's resolved.
|
yeah I tried
and running with many entries still got QUICHE_BUG failure: outgoing_stream_count_ >= outgoing_max_streams_. Details: Attempt to allocate a new outgoing stream that would exceed the limit (100) If I had to guess I'd assume there's issues say when we've sent the data but not had it acked, and Envoy thinks the stream is done but QUIC does not, or some such. We do need to sort out what's going on (and I could probably use your help with that) but this at least stops it from crashing in production and switches to logging the failure and crashing only in debug mode. |
I think it's going to be tricky for Envoy and QUICHE to have the same stream counts. Here's the stream count in QUIC: I think the fix in this PR may well result in the connection being closed by the peer because too many streams get opened in violation of the spec :( (admittedly that might be better than a crash, so short term this may well be a fine idea but I think it does point to a potential issue) |
|
yeah I don't have it marked as fixing the crash because of this - still a real and serious bug. I think the key is that in the conn pool, we go from BUSY to READY when Envoy thinks the stream limit has gone down and READY to BUSY when it goes over the limit. I think instead it needs to be a joint decision, we move to READY when both Envoy and QUICE think there's capacity and we move to BUSY when either thinks there's not. The tricky thing for me is I'm not sure where in the QUIC/QUICHE library to stick that - figured we could chat at the next ucn meeting to sort out or have you or Dan instrument that half and you or I tie it into the conn pools. |
Oh interesting. I see. Ok, then I think there's a QuicSession method which might help here (Chrome uses this as it queues requests internally until there is a stream available). Perhaps the envoy session subclass can override this to get the signal that is needed to go from BUSY to READY? I'm not quite familiar with how that plumbs together, but I suspect this method is the missing piece. |
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
|
yeah I think that'll be the next step (thanks for the pointer!) but it's still not great because it'll mean sessions get queued if Envoy stream counts and QUIC stream counts are out of sync. may have to mull further |
| quic::QuicSpdyStream* CreateIncomingStream(quic::PendingStream* pending) override; | ||
| std::unique_ptr<quic::QuicCryptoClientStreamBase> CreateQuicCryptoStream() override; | ||
|
|
||
| bool ShouldCreateOutgoingBidirectionalStream() override { |
There was a problem hiding this comment.
Please add TODO(18160) to honor stream limit and state that current overriding may result in peer close connection.
There was a problem hiding this comment.
Would make sense as a TODO but I missed this and have the follow-up PR ready so I'm going to claim a fix counts as better than a TODO :-)
RyanTheOptimist
left a comment
There was a problem hiding this comment.
Just a couple comments but looks good otherwise.
| FREEBIND | ||
| FUZZER | ||
| FUZZERS | ||
| dereferencing |
Meaning we'll queue a request instead of opening a new connection? Hm. I wonder if we can "trick" the code that checks to see if the session is full to only switch from "full" to "not full" after OnCanCreateNewOutgoingStream() is received? Is that code in the pool itself, or in the ClusterManager? |
bef23af
|
Looks like this needs a main merge to fix the conflict. |
|
/wait |
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
RyanTheOptimist
left a comment
There was a problem hiding this comment.
LGTM. I do have one question about adding an ASSERT() but I'm fine either way. (Sounds like @danzh2010 might also have a question or two)
This is a 99% fix to the too many streams crash: we weren't setting the 100 stream limit in quiche, and the default of 0 for Envoy meant "allow thousands of streams" which resulted in stream creation failure and null pointer deref.
Unfortunately when aligning the stream limits, we still fail a check in quiche where the stream count in Envoy and the stream count in QUICHE don't match up. To overcome this I'm overriding ShouldCreateOutgoingBidirectionalStream which turns a fatal crash into a QUIC_BUG which seems preferable. Added a TODO into sorting out the underlying issue entirely.
Risk Level: low
Testing: new integration test
Part of #18160