grid: Add a new class for tracking HTTP/3 status#16067
grid: Add a new class for tracking HTTP/3 status#16067alyssawilk merged 16 commits intoenvoyproxy:mainfrom
Conversation
Create a new BrokenHttp3Tracker class which can mark HTTP/3 as broken for a period of time, subject to exponential backoff. Signed-off-by: Ryan Hamilton <rch@google.com>
Signed-off-by: Ryan Hamilton <rch@google.com>
|
/assign @RenjieTang |
RenjieTang
left a comment
There was a problem hiding this comment.
Nice! some minor comments.
Signed-off-by: Ryan Hamilton <rch@google.com>
RyanTheOptimist
left a comment
There was a problem hiding this comment.
Thanks Renjie!
RenjieTang
left a comment
There was a problem hiding this comment.
LGTM with mod one nit.
antoniovicente
left a comment
There was a problem hiding this comment.
Ryan: Thanks for introducing retry behavior. CI seems to be showing some problems with the test. Also, it seems that some of the methods of the newly introduced object are not used by the connection pool grid.
| bool Http3StatusTracker::isHttp3Confirmed() const { return state_ == State::Confirmed; } | ||
|
|
||
| void Http3StatusTracker::markHttp3Broken() { | ||
| state_ = State::Broken; |
There was a problem hiding this comment.
What are valid values for state_ when entering this method?
There was a problem hiding this comment.
Good question! Mentally, I'm modeling this off of the similar code in Chrome. That code runs up at the request/response layer (the HttpNetworkTransaction) which is above the connection establishment layer. As such and given that requests happen in parallel, it's possible for basically any sequence of markBroken/markConfirmed calls to arrive in any order. I suspect that we'll eventually want something similar. But since we're not doing anything like that now, there's no need to permit such state transitions. So I've added ASSERT() calls to make it clear what the valid states are. Thanks for pointing this out.
(In any case, this should be reachable from any state other than broken)
There was a problem hiding this comment.
It may be possible to trigger this ASSERT, if there are 2 concurrent attempts to connect to the same endpoint. That said, it's possible that there are protections elsewhere to prevent this from happening or it is relatively unlikely to happen without a burst of requests for that service; we would need the number of requests to exceed the upstream's multiplexing factor and trigger creation of a second connection in order to meet demand.
There was a problem hiding this comment.
Ah! Excellent point! That's very true. Ok, in that case we're back to the Chrome situation where the parallelism means that we can really get any sequence of events in any order. I've removed the ASSERT() calls.
There was a problem hiding this comment.
Just for my own curiosity why would we have multiple concurrent connection attempts to the same endpoint on a given worker thread? Is this related to prefetching or part of the grid logic?
There was a problem hiding this comment.
I'm relatively unclear at this on exactly how this all plays together. From what Antonio said, it sound like if there were a sufficient number of simultaneous requests we might trigger the creation of a second attempt. The other case that I think I heard from alyssa is that it's possible to have multiple calls to ConnectivityGrid::newStream() happen before the first call finishes. This won't result in multiple TCP/QUIC connection attempts because the underlying connection pool will do the right thing. But I think this is transparent to the ConnectivtyGrid; each call to newStream creates a new WrapperCallbacks and (up to) 2 ConnectionAttempts which should mean it's possible to get unexpected state transitions. Happy to do something different if I'm not understanding.
There was a problem hiding this comment.
Yeah this sounds plausible to me: newStream is non-blocking (like most things in Envoy), so if multiple streams are established before the connection is established so you'd see multiple newStreams come in before the connection is established.
No change necessary, I was just curious how this all fit together :)
There was a problem hiding this comment.
Optional (I mean in this PR, not optional overall) I wonder if we should start landing docs on this as we start implementing the "real" logic.
Generally we land docs when we unhide config (#15926) but the failover logic is sufficiently complicated I think we could land docs for now in source/docs and then move them to docs/ when the PR lands. Your call if we do them now or in a future iteration :-)
There was a problem hiding this comment.
Docs definitely make sense and I'm happy to work on them. I think I'll do that in a follow-up since this PR is (hopefully) basically done at this point.
Signed-off-by: Ryan Hamilton <rch@google.com>
RyanTheOptimist
left a comment
There was a problem hiding this comment.
Thanks for the thoughtful review!
| bool Http3StatusTracker::isHttp3Confirmed() const { return state_ == State::Confirmed; } | ||
|
|
||
| void Http3StatusTracker::markHttp3Broken() { | ||
| state_ = State::Broken; |
There was a problem hiding this comment.
Good question! Mentally, I'm modeling this off of the similar code in Chrome. That code runs up at the request/response layer (the HttpNetworkTransaction) which is above the connection establishment layer. As such and given that requests happen in parallel, it's possible for basically any sequence of markBroken/markConfirmed calls to arrive in any order. I suspect that we'll eventually want something similar. But since we're not doing anything like that now, there's no need to permit such state transitions. So I've added ASSERT() calls to make it clear what the valid states are. Thanks for pointing this out.
(In any case, this should be reachable from any state other than broken)
Signed-off-by: Ryan Hamilton <rch@google.com>
Signed-off-by: Ryan Hamilton <rch@google.com>
snowp
left a comment
There was a problem hiding this comment.
Thanks this looks pretty good, just a few minor comments
| bool Http3StatusTracker::isHttp3Confirmed() const { return state_ == State::Confirmed; } | ||
|
|
||
| void Http3StatusTracker::markHttp3Broken() { | ||
| state_ = State::Broken; |
There was a problem hiding this comment.
Just for my own curiosity why would we have multiple concurrent connection attempts to the same endpoint on a given worker thread? Is this related to prefetching or part of the grid logic?
| class Http3StatusTrackerTest : public testing::Test { | ||
| public: | ||
| Http3StatusTrackerTest() | ||
| : timer_(new StrictMock<MockTimer>(&dispatcher_)), tracker_(dispatcher_) {} |
There was a problem hiding this comment.
I think we usually omit the StrictMock part since by default all the mocks are strict (we're not super consistent here so not a big deal)
There was a problem hiding this comment.
Wow, interesting! What's the magic that makes all the Mocks strict? Some #define something somewhere? Very cool. In any case, done.
There was a problem hiding this comment.
There was a problem hiding this comment.
Oooo! Thanks!
| } | ||
|
|
||
| TEST_F(Http3StatusTrackerTest, MarkBrokenWithBackoff) { | ||
| // markBroken will only be called when the time is not enabled. |
There was a problem hiding this comment.
From reading the test it seems like we're the ones calling markHttp3Broken, what does this comment refer to?
There was a problem hiding this comment.
Ah. I can remove this comment if it doesn't make sense. I wrote the comment because the MockTimer API confused me. I expected that expect() would return true if enableTimer() had been called but disableTimer() or invokeCallback() had not been. In other words, I didn't expect to need to mock out this method. So the comment was saying that when markBroken() is called in this tests, the timer will not have been enabled. Does that make sense? Would you recommend I rephrase or remove the comment?
There was a problem hiding this comment.
Oh I see, I would try to include something that indicates that we're talking about invariants for this is used by prod code. My initial read was that this was trying to explain what we should expect to see in this test, so I expected to see some EXPECT_CALL(.., markBroken()) kind of expectations
There was a problem hiding this comment.
Makes sense. That being said, I took several stabs at trying to write something up and each time ended up with a bit of an essay that didn't really seem to add any readability to the test, so I've just nuked the comment. (Really the problem was my lack of understanding of how MockTimer worked and so this is probably not the best place to address that :>)
Signed-off-by: Ryan Hamilton <rch@google.com>
| bool Http3StatusTracker::isHttp3Confirmed() const { return state_ == State::Confirmed; } | ||
|
|
||
| void Http3StatusTracker::markHttp3Broken() { | ||
| state_ = State::Broken; |
There was a problem hiding this comment.
I'm relatively unclear at this on exactly how this all plays together. From what Antonio said, it sound like if there were a sufficient number of simultaneous requests we might trigger the creation of a second attempt. The other case that I think I heard from alyssa is that it's possible to have multiple calls to ConnectivityGrid::newStream() happen before the first call finishes. This won't result in multiple TCP/QUIC connection attempts because the underlying connection pool will do the right thing. But I think this is transparent to the ConnectivtyGrid; each call to newStream creates a new WrapperCallbacks and (up to) 2 ConnectionAttempts which should mean it's possible to get unexpected state transitions. Happy to do something different if I'm not understanding.
| } | ||
|
|
||
| TEST_F(Http3StatusTrackerTest, MarkBrokenWithBackoff) { | ||
| // markBroken will only be called when the time is not enabled. |
There was a problem hiding this comment.
Ah. I can remove this comment if it doesn't make sense. I wrote the comment because the MockTimer API confused me. I expected that expect() would return true if enableTimer() had been called but disableTimer() or invokeCallback() had not been. In other words, I didn't expect to need to mock out this method. So the comment was saying that when markBroken() is called in this tests, the timer will not have been enabled. Does that make sense? Would you recommend I rephrase or remove the comment?
| class Http3StatusTrackerTest : public testing::Test { | ||
| public: | ||
| Http3StatusTrackerTest() | ||
| : timer_(new StrictMock<MockTimer>(&dispatcher_)), tracker_(dispatcher_) {} |
There was a problem hiding this comment.
Wow, interesting! What's the magic that makes all the Mocks strict? Some #define something somewhere? Very cool. In any case, done.
|
/retest |
|
Retrying Azure Pipelines: |
Signed-off-by: Ryan Hamilton <rch@google.com>
Signed-off-by: Ryan Hamilton <rch@google.com>
Signed-off-by: Ryan Hamilton <rch@google.com>
Signed-off-by: Ryan Hamilton <rch@google.com>
Signed-off-by: Ryan Hamilton <rch@google.com>
grid: Add a new class for tracking HTTP/3 status. Create a new Http3StatusTracker class which can mark HTTP/3 as broken for a period of time, subject to exponential backoff. Use this in ConnectivityGrid. Risk Level: Low Testing: New unit tests Docs Changes: N/A Release Notes: N/A Platform Specific Features: N/A Signed-off-by: Ryan Hamilton <rch@google.com> Signed-off-by: Gokul Nair <gnair@twitter.com>
grid: Add a new class for tracking HTTP/3 status.
Create a new Http3StatusTracker class which can mark HTTP/3 as broken
for a period of time, subject to exponential backoff. Use this in ConnectivityGrid.
Risk Level: Low
Testing: New unit tests
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A