refactor router filter to store upstream requests in a list.#6540
refactor router filter to store upstream requests in a list.#6540mattklein123 merged 7 commits intoenvoyproxy:masterfrom
Conversation
This is in preparation for implementing envoyproxy#5841 which will introduce request racing. As of this commit there is no situation where there will be more than one upstream request in flight, however it organizes the code in such a way that doing so will cause less code churn. Signed-off-by: Michael Puncel <mpuncel@squareup.com>
|
I broke this out from #6228 at Alyssa's suggestion |
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
|
/retest |
|
🔨 rebuilding |
This is in preparation for there being multiple simultaneous requests in the router filter Signed-off-by: Michael Puncel <mpuncel@squareup.com>
|
added the watermark callbacks change as well |
snowp
left a comment
There was a problem hiding this comment.
This LGTM, seems like a good call to split this out from the other PR
alyssawilk
left a comment
There was a problem hiding this comment.
Thanks for breaking this out - it really made it easier to reason about.
| } | ||
|
|
||
| void ConnectionManagerImpl::ActiveStream::callLowWatermarkCallbacks() { | ||
| ASSERT(high_watermark_count_ > 0); |
There was a problem hiding this comment.
I'm looking at this and I think there may be a pre-existing bug which worked OK before because there was one back-up cause, but will not work with two.
If you have the upstream connection call high watermark callbacks, and increment high_watermark_count_, then the hedge connection hits its watermark and increments high_watermark_count_, I don't think we want to resume by calling the low watermark callbacks until the count is back to 0.
If I'm correct here we may have been overenthusiastic resuming, but fixing will be a fairly high risk change.
There was a problem hiding this comment.
is the reason the fix is high risk because the count might not reach 0 if there is a counting bug somewhere?
There was a problem hiding this comment.
Yep. I mean you can land this and do the other separately but I don't think you can land your hedge fixes without both, and the fix is high risk because it may be masking other bugs.
|
|
||
| void ConnectionManagerImpl::ActiveStream::callHighWatermarkCallbacks() { | ||
| ++high_watermark_count_; | ||
| if (watermark_callbacks_) { |
There was a problem hiding this comment.
I think we should enhance existing unit tests to have two subscribers, to regression test both get the callback
| watermark_callbacks.onAboveWriteBufferHighWatermark(); | ||
| } | ||
| } | ||
| void ConnectionManagerImpl::ActiveStreamDecoderFilter::removeDownstreamWatermarkCallbacks( |
There was a problem hiding this comment.
can you poke through code and make sure if upstream connection 1 is above the high watermark (and causes the state to transition to high watermark) and upstream connection 2 ends up paused, that if upstream connection 1 goes away that it clears the state so that 2 ends up resuming? We want to make sure we don't get wedged here.
|
@mpuncel sorry I haven't fully tracked the conversation between you and @alyssawilk. Is there anything you need from me on this right now? Or are you working through her comments? |
|
@mattklein123 mostly I've been catching up my understanding of the problem. After doing that for a bit, I think for this PR in particular (hedging not implemented yet) I should be fine to assert in the conn manager that at most 1 callback is registered at a given time. Callbacks are registered/deregistered at upstream request construction/destruction. Since there can only be one upstream request at a time, there should never be more than one callback registered at a time. For the full PR, I think I should never expect the callback to be invoked on more than one UpstreamRequest, because I only ever write data from one upstream request back downstream. Nothing is written to the downstream until all but the winning upstream request are reset. I think I could put an assert in the callback handler that blows up if the corresponding request isn't the "winning" one. The other direction (request too big) is more difficult, because I think it is possible to hit a per try timeout before having written the full request upstream, so we might stop reading from downstream and wedge the hedged retry. I don't know the implicated code well enough yet to know how to fix. |
|
@mpuncel OK at a high level that makes sense to me and I think gives me the context I need to help review. So is this PR finished given that or do you need to make further changes? |
|
I think I can add a few unit tests around the subscribe/unsubscribe as Alyssa suggested and possibly a few asserts to cover the assumptions that there shouldn't actually be multiple requests in flight |
…s, encode assumptions into asserts Signed-off-by: Michael Puncel <mpuncel@squareup.com>
|
okay @mattklein123 I believe this one is ready to go (assuming build passes) |
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks for splitting this out. Makes sense with one small nit.
/wait
source/common/router/router.cc
Outdated
| upstream_request_->upstream_host_->stats().rq_timeout_.inc(); | ||
| ASSERT(upstream_requests_.size() <= 1); | ||
| if (upstream_requests_.size() == 1) { | ||
| UpstreamRequest* upstream_request = upstream_requests_.front().get(); |
There was a problem hiding this comment.
nit: why do we need to grab the raw pointer here? Is to just avoid calling front() a bunch. If so I would either just call front() or grab a reference and not a pointer for non-null clarity.
There was a problem hiding this comment.
was just to avoid calling front(), will change
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
|
is there a flake i that lua test? |
|
/retest |
|
🔨 rebuilding |
This is in preparation for implementing #5841 which will introduce
request racing. As of this commit there is no situation where there will
be more than one upstream request in flight, however it organizes the
code in such a way that doing so will cause less code churn.
Signed-off-by: Michael Puncel mpuncel@squareup.com
Description: Change upstream request storage to list from pointer in router
Risk Level: Medium
Testing: Existing unit tests
Docs Changes: N/A
Release Notes: N/A
This is a subset of the changes in https://github.com/envoyproxy/envoy/pull/6228/files which is implementing #5841