Skip to content

grid: Add a new class for tracking HTTP/3 status#16067

Merged
alyssawilk merged 16 commits intoenvoyproxy:mainfrom
RyanTheOptimist:broken_tracker
Apr 28, 2021
Merged

grid: Add a new class for tracking HTTP/3 status#16067
alyssawilk merged 16 commits intoenvoyproxy:mainfrom
RyanTheOptimist:broken_tracker

Conversation

@RyanTheOptimist
Copy link
Copy Markdown
Contributor

@RyanTheOptimist RyanTheOptimist commented Apr 19, 2021

grid: Add a new class for tracking HTTP/3 status.

Create a new Http3StatusTracker class which can mark HTTP/3 as broken
for a period of time, subject to exponential backoff. Use this in ConnectivityGrid.

Risk Level: Low
Testing: New unit tests
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A

Create a new BrokenHttp3Tracker class which can mark HTTP/3 as broken
for a period of time, subject to exponential backoff.

Signed-off-by: Ryan Hamilton <rch@google.com>
Signed-off-by: Ryan Hamilton <rch@google.com>
Signed-off-by: Ryan Hamilton <rch@google.com>
@RyanTheOptimist
Copy link
Copy Markdown
Contributor Author

/assign @RenjieTang

Copy link
Copy Markdown
Contributor

@RenjieTang RenjieTang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! some minor comments.

@antoniovicente antoniovicente self-assigned this Apr 19, 2021
Signed-off-by: Ryan Hamilton <rch@google.com>
Copy link
Copy Markdown
Contributor Author

@RyanTheOptimist RyanTheOptimist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Renjie!

RenjieTang
RenjieTang previously approved these changes Apr 19, 2021
Copy link
Copy Markdown
Contributor

@RenjieTang RenjieTang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with mod one nit.

@RyanTheOptimist RyanTheOptimist changed the title grid: Add a new class for tracking HTTP/3 brokeness. grid: Add a new class for tracking HTTP/3 status Apr 19, 2021
Copy link
Copy Markdown
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ryan: Thanks for introducing retry behavior. CI seems to be showing some problems with the test. Also, it seems that some of the methods of the newly introduced object are not used by the connection pool grid.

bool Http3StatusTracker::isHttp3Confirmed() const { return state_ == State::Confirmed; }

void Http3StatusTracker::markHttp3Broken() {
state_ = State::Broken;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are valid values for state_ when entering this method?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Mentally, I'm modeling this off of the similar code in Chrome. That code runs up at the request/response layer (the HttpNetworkTransaction) which is above the connection establishment layer. As such and given that requests happen in parallel, it's possible for basically any sequence of markBroken/markConfirmed calls to arrive in any order. I suspect that we'll eventually want something similar. But since we're not doing anything like that now, there's no need to permit such state transitions. So I've added ASSERT() calls to make it clear what the valid states are. Thanks for pointing this out.

(In any case, this should be reachable from any state other than broken)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be possible to trigger this ASSERT, if there are 2 concurrent attempts to connect to the same endpoint. That said, it's possible that there are protections elsewhere to prevent this from happening or it is relatively unlikely to happen without a burst of requests for that service; we would need the number of requests to exceed the upstream's multiplexing factor and trigger creation of a second connection in order to meet demand.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Excellent point! That's very true. Ok, in that case we're back to the Chrome situation where the parallelism means that we can really get any sequence of events in any order. I've removed the ASSERT() calls.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own curiosity why would we have multiple concurrent connection attempts to the same endpoint on a given worker thread? Is this related to prefetching or part of the grid logic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm relatively unclear at this on exactly how this all plays together. From what Antonio said, it sound like if there were a sufficient number of simultaneous requests we might trigger the creation of a second attempt. The other case that I think I heard from alyssa is that it's possible to have multiple calls to ConnectivityGrid::newStream() happen before the first call finishes. This won't result in multiple TCP/QUIC connection attempts because the underlying connection pool will do the right thing. But I think this is transparent to the ConnectivtyGrid; each call to newStream creates a new WrapperCallbacks and (up to) 2 ConnectionAttempts which should mean it's possible to get unexpected state transitions. Happy to do something different if I'm not understanding.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this sounds plausible to me: newStream is non-blocking (like most things in Envoy), so if multiple streams are established before the connection is established so you'd see multiple newStreams come in before the connection is established.

No change necessary, I was just curious how this all fit together :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional (I mean in this PR, not optional overall) I wonder if we should start landing docs on this as we start implementing the "real" logic.

Generally we land docs when we unhide config (#15926) but the failover logic is sufficiently complicated I think we could land docs for now in source/docs and then move them to docs/ when the PR lands. Your call if we do them now or in a future iteration :-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs definitely make sense and I'm happy to work on them. I think I'll do that in a follow-up since this PR is (hopefully) basically done at this point.

Signed-off-by: Ryan Hamilton <rch@google.com>
Signed-off-by: Ryan Hamilton <rch@google.com>
Copy link
Copy Markdown
Contributor Author

@RyanTheOptimist RyanTheOptimist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thoughtful review!

bool Http3StatusTracker::isHttp3Confirmed() const { return state_ == State::Confirmed; }

void Http3StatusTracker::markHttp3Broken() {
state_ = State::Broken;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Mentally, I'm modeling this off of the similar code in Chrome. That code runs up at the request/response layer (the HttpNetworkTransaction) which is above the connection establishment layer. As such and given that requests happen in parallel, it's possible for basically any sequence of markBroken/markConfirmed calls to arrive in any order. I suspect that we'll eventually want something similar. But since we're not doing anything like that now, there's no need to permit such state transitions. So I've added ASSERT() calls to make it clear what the valid states are. Thanks for pointing this out.

(In any case, this should be reachable from any state other than broken)

Signed-off-by: Ryan Hamilton <rch@google.com>
Signed-off-by: Ryan Hamilton <rch@google.com>
Signed-off-by: Ryan Hamilton <rch@google.com>
antoniovicente
antoniovicente previously approved these changes Apr 21, 2021
Copy link
Copy Markdown
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Signed-off-by: Ryan Hamilton <rch@google.com>
Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this looks pretty good, just a few minor comments

bool Http3StatusTracker::isHttp3Confirmed() const { return state_ == State::Confirmed; }

void Http3StatusTracker::markHttp3Broken() {
state_ = State::Broken;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own curiosity why would we have multiple concurrent connection attempts to the same endpoint on a given worker thread? Is this related to prefetching or part of the grid logic?

class Http3StatusTrackerTest : public testing::Test {
public:
Http3StatusTrackerTest()
: timer_(new StrictMock<MockTimer>(&dispatcher_)), tracker_(dispatcher_) {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we usually omit the StrictMock part since by default all the mocks are strict (we're not super consistent here so not a big deal)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, interesting! What's the magic that makes all the Mocks strict? Some #define something somewhere? Very cool. In any case, done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooo! Thanks!

}

TEST_F(Http3StatusTrackerTest, MarkBrokenWithBackoff) {
// markBroken will only be called when the time is not enabled.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From reading the test it seems like we're the ones calling markHttp3Broken, what does this comment refer to?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I can remove this comment if it doesn't make sense. I wrote the comment because the MockTimer API confused me. I expected that expect() would return true if enableTimer() had been called but disableTimer() or invokeCallback() had not been. In other words, I didn't expect to need to mock out this method. So the comment was saying that when markBroken() is called in this tests, the timer will not have been enabled. Does that make sense? Would you recommend I rephrase or remove the comment?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, I would try to include something that indicates that we're talking about invariants for this is used by prod code. My initial read was that this was trying to explain what we should expect to see in this test, so I expected to see some EXPECT_CALL(.., markBroken()) kind of expectations

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. That being said, I took several stabs at trying to write something up and each time ended up with a bit of an essay that didn't really seem to add any readability to the test, so I've just nuked the comment. (Really the problem was my lack of understanding of how MockTimer worked and so this is probably not the best place to address that :>)

Signed-off-by: Ryan Hamilton <rch@google.com>
Copy link
Copy Markdown
Contributor Author

@RyanTheOptimist RyanTheOptimist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

bool Http3StatusTracker::isHttp3Confirmed() const { return state_ == State::Confirmed; }

void Http3StatusTracker::markHttp3Broken() {
state_ = State::Broken;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm relatively unclear at this on exactly how this all plays together. From what Antonio said, it sound like if there were a sufficient number of simultaneous requests we might trigger the creation of a second attempt. The other case that I think I heard from alyssa is that it's possible to have multiple calls to ConnectivityGrid::newStream() happen before the first call finishes. This won't result in multiple TCP/QUIC connection attempts because the underlying connection pool will do the right thing. But I think this is transparent to the ConnectivtyGrid; each call to newStream creates a new WrapperCallbacks and (up to) 2 ConnectionAttempts which should mean it's possible to get unexpected state transitions. Happy to do something different if I'm not understanding.

}

TEST_F(Http3StatusTrackerTest, MarkBrokenWithBackoff) {
// markBroken will only be called when the time is not enabled.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I can remove this comment if it doesn't make sense. I wrote the comment because the MockTimer API confused me. I expected that expect() would return true if enableTimer() had been called but disableTimer() or invokeCallback() had not been. In other words, I didn't expect to need to mock out this method. So the comment was saying that when markBroken() is called in this tests, the timer will not have been enabled. Does that make sense? Would you recommend I rephrase or remove the comment?

class Http3StatusTrackerTest : public testing::Test {
public:
Http3StatusTrackerTest()
: timer_(new StrictMock<MockTimer>(&dispatcher_)), tracker_(dispatcher_) {}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, interesting! What's the magic that makes all the Mocks strict? Some #define something somewhere? Very cool. In any case, done.

@RyanTheOptimist
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16067 (comment) was created by @RyanTheOptimist.

see: more, trace.

Signed-off-by: Ryan Hamilton <rch@google.com>
Signed-off-by: Ryan Hamilton <rch@google.com>
Signed-off-by: Ryan Hamilton <rch@google.com>
Signed-off-by: Ryan Hamilton <rch@google.com>
Signed-off-by: Ryan Hamilton <rch@google.com>
Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@alyssawilk alyssawilk merged commit 3756e5b into envoyproxy:main Apr 28, 2021
gokulnair pushed a commit to gokulnair/envoy that referenced this pull request May 6, 2021
grid: Add a new class for tracking HTTP/3 status.

Create a new Http3StatusTracker class which can mark HTTP/3 as broken
for a period of time, subject to exponential backoff. Use this in ConnectivityGrid.

Risk Level: Low
Testing: New unit tests
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A

Signed-off-by: Ryan Hamilton <rch@google.com>
Signed-off-by: Gokul Nair <gnair@twitter.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants