Skip to content

quic: ameliorating stream limit bug#18614

Merged
alyssawilk merged 7 commits intoenvoyproxy:mainfrom
alyssawilk:max_streams
Oct 20, 2021
Merged

quic: ameliorating stream limit bug#18614
alyssawilk merged 7 commits intoenvoyproxy:mainfrom
alyssawilk:max_streams

Conversation

@alyssawilk
Copy link
Contributor

This is a 99% fix to the too many streams crash: we weren't setting the 100 stream limit in quiche, and the default of 0 for Envoy meant "allow thousands of streams" which resulted in stream creation failure and null pointer deref.
Unfortunately when aligning the stream limits, we still fail a check in quiche where the stream count in Envoy and the stream count in QUICHE don't match up. To overcome this I'm overriding ShouldCreateOutgoingBidirectionalStream which turns a fatal crash into a QUIC_BUG which seems preferable. Added a TODO into sorting out the underlying issue entirely.

Risk Level: low
Testing: new integration test
Part of #18160

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
quic_config.set_max_time_before_crypto_handshake(crypto_timeout);
int32_t max_streams =
cluster.http3Options().quic_protocol_options().max_concurrent_streams().value();
int32_t max_streams = getMaxStreams(cluster);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find!


ActiveClient::ActiveClient(Envoy::Http::HttpConnPoolImplBase& parent,
Upstream::Host::CreateConnectionData& data)
: MultiplexedActiveClientBase(parent, getMaxStreams(parent.host()->cluster()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do MultiplexedActiveClientBase() and quic_config regard the value of getMaxStreams() differently which caused the stream accounting to be inconsistent? How does ActiveClient handle newStream exceeding limit? Does it close the request encoder immediately? If so can we just +1 of that value to pass into quic config so that we don't violate QUICHE assertion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I covered this in the debate on the main thread so I think it's resolved.

@alyssawilk
Copy link
Contributor Author

yeah I tried

  • quic_config.SetMaxBidirectionalStreamsToSend(max_streams + 1);
  • quic_config.SetMaxUnidirectionalStreamsToSend(max_streams + 1);

and running with many entries still got

QUICHE_BUG failure: outgoing_stream_count_ >= outgoing_max_streams_. Details: Attempt to allocate a new outgoing stream that would exceed the limit (100)

If I had to guess I'd assume there's issues say when we've sent the data but not had it acked, and Envoy thinks the stream is done but QUIC does not, or some such. We do need to sort out what's going on (and I could probably use your help with that) but this at least stops it from crashing in production and switches to logging the failure and crashing only in debug mode.

@RyanTheOptimist
Copy link
Contributor

yeah I tried

  • quic_config.SetMaxBidirectionalStreamsToSend(max_streams + 1);
  • quic_config.SetMaxUnidirectionalStreamsToSend(max_streams + 1);

and running with many entries still got

QUICHE_BUG failure: outgoing_stream_count_ >= outgoing_max_streams_. Details: Attempt to allocate a new outgoing stream that would exceed the limit (100)

If I had to guess I'd assume there's issues say when we've sent the data but not had it acked, and Envoy thinks the stream is done but QUIC does not, or some such. We do need to sort out what's going on (and I could probably use your help with that) but this at least stops it from crashing in production and switches to logging the failure and crashing only in debug mode.

I think it's going to be tricky for Envoy and QUICHE to have the same stream counts. Here's the stream count in QUIC:

size_t QuicSession::GetNumActiveStreams() const {
  QUICHE_DCHECK_GE(
      static_cast<QuicStreamCount>(stream_map_.size()),
      num_static_streams_ + num_draining_streams_ + num_zombie_streams_);
  return stream_map_.size() - num_draining_streams_ - num_static_streams_ -
         num_zombie_streams_;
}

I think the fix in this PR may well result in the connection being closed by the peer because too many streams get opened in violation of the spec :( (admittedly that might be better than a crash, so short term this may well be a fine idea but I think it does point to a potential issue)

@alyssawilk
Copy link
Contributor Author

yeah I don't have it marked as fixing the crash because of this - still a real and serious bug.
I think we're going to have to get envoy and quic to agree on number of streams but want to land this first.

I think the key is that in the conn pool, we go from BUSY to READY when Envoy thinks the stream limit has gone down and READY to BUSY when it goes over the limit. I think instead it needs to be a joint decision, we move to READY when both Envoy and QUICE think there's capacity and we move to BUSY when either thinks there's not. The tricky thing for me is I'm not sure where in the QUIC/QUICHE library to stick that - figured we could chat at the next ucn meeting to sort out or have you or Dan instrument that half and you or I tie it into the conn pools.

@RyanTheOptimist
Copy link
Contributor

I think the key is that in the conn pool, we go from BUSY to READY when Envoy thinks the stream limit has gone down and READY to BUSY when it goes over the limit. I think instead it needs to be a joint decision, we move to READY when both Envoy and QUICE think there's capacity and we move to BUSY when either thinks there's not. The tricky thing for me is I'm not sure where in the QUIC/QUICHE library to stick that - figured we could chat at the next ucn meeting to sort out or have you or Dan instrument that half and you or I tie it into the conn pools.

Oh interesting. I see. Ok, then I think there's a QuicSession method which might help here (Chrome uses this as it queues requests internally until there is a stream available).

  // The default implementation does nothing. Subclasses should override if
  // for example they queue up stream requests.
  virtual void OnCanCreateNewOutgoingStream(bool /*unidirectional*/) {}

Perhaps the envoy session subclass can override this to get the signal that is needed to go from BUSY to READY? I'm not quite familiar with how that plumbs together, but I suspect this method is the missing piece.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk
Copy link
Contributor Author

yeah I think that'll be the next step (thanks for the pointer!) but it's still not great because it'll mean sessions get queued if Envoy stream counts and QUIC stream counts are out of sync. may have to mull further

quic::QuicSpdyStream* CreateIncomingStream(quic::PendingStream* pending) override;
std::unique_ptr<quic::QuicCryptoClientStreamBase> CreateQuicCryptoStream() override;

bool ShouldCreateOutgoingBidirectionalStream() override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add TODO(18160) to honor stream limit and state that current overriding may result in peer close connection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would make sense as a TODO but I missed this and have the follow-up PR ready so I'm going to claim a fix counts as better than a TODO :-)

danzh2010
danzh2010 previously approved these changes Oct 14, 2021
Copy link
Contributor

@RyanTheOptimist RyanTheOptimist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple comments but looks good otherwise.

FREEBIND
FUZZER
FUZZERS
dereferencing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol

@RyanTheOptimist
Copy link
Contributor

yeah I think that'll be the next step (thanks for the pointer!) but it's still not great because it'll mean sessions get queued if Envoy stream counts and QUIC stream counts are out of sync. may have to mull further

Meaning we'll queue a request instead of opening a new connection? Hm. I wonder if we can "trick" the code that checks to see if the session is full to only switch from "full" to "not full" after OnCanCreateNewOutgoingStream() is received? Is that code in the pool itself, or in the ClusterManager?

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk alyssawilk dismissed stale reviews from RyanTheOptimist and danzh2010 via bef23af October 18, 2021 15:55
@RyanTheOptimist
Copy link
Contributor

Looks like this needs a main merge to fix the conflict.

@RyanTheOptimist
Copy link
Contributor

/wait

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Copy link
Contributor

@RyanTheOptimist RyanTheOptimist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I do have one question about adding an ASSERT() but I'm fine either way. (Sounds like @danzh2010 might also have a question or two)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants