Skip to content

refactor router filter to store upstream requests in a list.#6540

Merged
mattklein123 merged 7 commits intoenvoyproxy:masterfrom
mpuncel:mpuncel/upstream-request-list
Apr 16, 2019
Merged

refactor router filter to store upstream requests in a list.#6540
mattklein123 merged 7 commits intoenvoyproxy:masterfrom
mpuncel:mpuncel/upstream-request-list

Conversation

@mpuncel
Copy link
Contributor

@mpuncel mpuncel commented Apr 10, 2019

This is in preparation for implementing #5841 which will introduce
request racing. As of this commit there is no situation where there will
be more than one upstream request in flight, however it organizes the
code in such a way that doing so will cause less code churn.

Signed-off-by: Michael Puncel mpuncel@squareup.com

Description: Change upstream request storage to list from pointer in router
Risk Level: Medium
Testing: Existing unit tests
Docs Changes: N/A
Release Notes: N/A

This is a subset of the changes in https://github.com/envoyproxy/envoy/pull/6228/files which is implementing #5841

This is in preparation for implementing envoyproxy#5841 which will introduce
request racing. As of this commit there is no situation where there will
be more than one upstream request in flight, however it organizes the
code in such a way that doing so will cause less code churn.

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@mpuncel
Copy link
Contributor Author

mpuncel commented Apr 10, 2019

I broke this out from #6228 at Alyssa's suggestion

mpuncel added 2 commits April 10, 2019 13:13
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@mpuncel
Copy link
Contributor Author

mpuncel commented Apr 10, 2019

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: build_image (failed build)
🔨 rebuilding ci/circleci: ipv6_tests (failed build)

🐱

Caused by: a #6540 (comment) was created by @mpuncel.

see: more, trace.

This is in preparation for there being multiple simultaneous requests in
the router filter

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@mpuncel
Copy link
Contributor Author

mpuncel commented Apr 11, 2019

added the watermark callbacks change as well

snowp
snowp previously approved these changes Apr 11, 2019
Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, seems like a good call to split this out from the other PR

Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for breaking this out - it really made it easier to reason about.

}

void ConnectionManagerImpl::ActiveStream::callLowWatermarkCallbacks() {
ASSERT(high_watermark_count_ > 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at this and I think there may be a pre-existing bug which worked OK before because there was one back-up cause, but will not work with two.

If you have the upstream connection call high watermark callbacks, and increment high_watermark_count_, then the hedge connection hits its watermark and increments high_watermark_count_, I don't think we want to resume by calling the low watermark callbacks until the count is back to 0.

If I'm correct here we may have been overenthusiastic resuming, but fixing will be a fairly high risk change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the reason the fix is high risk because the count might not reach 0 if there is a counting bug somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I mean you can land this and do the other separately but I don't think you can land your hedge fixes without both, and the fix is high risk because it may be masking other bugs.


void ConnectionManagerImpl::ActiveStream::callHighWatermarkCallbacks() {
++high_watermark_count_;
if (watermark_callbacks_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should enhance existing unit tests to have two subscribers, to regression test both get the callback

watermark_callbacks.onAboveWriteBufferHighWatermark();
}
}
void ConnectionManagerImpl::ActiveStreamDecoderFilter::removeDownstreamWatermarkCallbacks(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you poke through code and make sure if upstream connection 1 is above the high watermark (and causes the state to transition to high watermark) and upstream connection 2 ends up paused, that if upstream connection 1 goes away that it clears the state so that 2 ends up resuming? We want to make sure we don't get wedged here.

@mattklein123 mattklein123 self-assigned this Apr 11, 2019
@mattklein123
Copy link
Member

@mpuncel sorry I haven't fully tracked the conversation between you and @alyssawilk. Is there anything you need from me on this right now? Or are you working through her comments?

@mpuncel
Copy link
Contributor Author

mpuncel commented Apr 12, 2019

@mattklein123 mostly I've been catching up my understanding of the problem. After doing that for a bit, I think for this PR in particular (hedging not implemented yet) I should be fine to assert in the conn manager that at most 1 callback is registered at a given time. Callbacks are registered/deregistered at upstream request construction/destruction. Since there can only be one upstream request at a time, there should never be more than one callback registered at a time.

For the full PR, I think I should never expect the callback to be invoked on more than one UpstreamRequest, because I only ever write data from one upstream request back downstream. Nothing is written to the downstream until all but the winning upstream request are reset. I think I could put an assert in the callback handler that blows up if the corresponding request isn't the "winning" one.

The other direction (request too big) is more difficult, because I think it is possible to hit a per try timeout before having written the full request upstream, so we might stop reading from downstream and wedge the hedged retry. I don't know the implicated code well enough yet to know how to fix.

@mattklein123
Copy link
Member

@mpuncel OK at a high level that makes sense to me and I think gives me the context I need to help review. So is this PR finished given that or do you need to make further changes?

@mpuncel
Copy link
Contributor Author

mpuncel commented Apr 15, 2019

I think I can add a few unit tests around the subscribe/unsubscribe as Alyssa suggested and possibly a few asserts to cover the assumptions that there shouldn't actually be multiple requests in flight

…s, encode assumptions into asserts

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@mpuncel
Copy link
Contributor Author

mpuncel commented Apr 15, 2019

okay @mattklein123 I believe this one is ready to go (assuming build passes)

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for splitting this out. Makes sense with one small nit.

/wait

upstream_request_->upstream_host_->stats().rq_timeout_.inc();
ASSERT(upstream_requests_.size() <= 1);
if (upstream_requests_.size() == 1) {
UpstreamRequest* upstream_request = upstream_requests_.front().get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why do we need to grab the raw pointer here? Is to just avoid calling front() a bunch. If so I would either just call front() or grab a reference and not a pointer for non-null clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was just to avoid calling front(), will change

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@mpuncel
Copy link
Contributor Author

mpuncel commented Apr 16, 2019

is there a flake i that lua test?

@mpuncel
Copy link
Contributor Author

mpuncel commented Apr 16, 2019

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: tsan (failed build)

🐱

Caused by: a #6540 (comment) was created by @mpuncel.

see: more, trace.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123 mattklein123 merged commit 21fd119 into envoyproxy:master Apr 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants