router: generic timeout on request start#14003

Closed

yuval-k wants to merge 3 commits intoenvoyproxy:masterfrom

yuval-k:timeout-on-request-start

Contributor

yuval-k commented Nov 12, 2020

Commit Message:
Allow measuring request timeout on request start when using the async client.

Additional Description:
This PR fixes #13580, by first reverting #12778, and then re-adding the functionality in generic way to the router filter.

Opening a draft PR to get early feedback that this is the right direction; once we agree I will add tests and modify the async client / ext auth client to use the new functionality (hopefully in a backwards compatible way).

For the reviewer: The first commit is just a revert, and the second commit adds the new functionality.

Risk Level: Medium (touches the router filter; but changes are be opt-in)
Testing: TBD
Docs Changes: TBD
Release Notes: TBD
Runtime guard: TBD
Deprecates: TBD
Fixes: #13580

cc @mattklein123 @htuch .

yuval-k added 2 commits

November 12, 2020 13:30


          Revert "ext_auth: add option to measure timeout when check request is…

3bc02ab

… created. (envoyproxy#12778)"

This reverts commit dfa85e8.

Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>


          initial generic timeout on request start

6fc290c

Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>

mattklein123 self-assigned this

htuch reviewed

View reviewed changes

Member

htuch left a comment

Thanks for taking on the more general fix here. I'll defer to folks who understand router timeout details better than me on the implementation, but one thing that I think would be useful is some e2e tests that show that both Envoy and Google gRPC clients can have the same deadline set and enforced.

mattklein123 requested changes

View reviewed changes

Member

mattklein123 left a comment

Thanks for fixing this. At a high level this looks on the right track!

/wait

include/envoy/router/router.h Outdated

                 virtual std::chrono::milliseconds timeout() const PURE;
+                /**
+                 * @return bool measure timeout() from when the request is started, instead of when it is

Member

mattklein123 Nov 13, 2020

Can you be more specific about what "started" means? Does this mean after a connection is received from the connection pool? Or before a connection is requested from the pool?

source/common/router/router.cc Outdated

               }
+              void Filter::finalizeTimeoutHeaders() {
+                // if we have a deadline, update the timeout headers.

Member

mattklein123 Nov 13, 2020

All comments start with capital, proper grammar, etc. Here and elsewhere.

source/common/router/router.cc Outdated

Comment on lines +663 to +664

		// TODO: I'm not sure if this check is needed. can this method still be called
		// if onResponseTimeout was called?

Member

mattklein123 Nov 13, 2020

No, once the reply is sent and/or reset this can't happen. Same below.

source/common/router/router.cc Outdated

               void Filter::onResponseTimeout() {
                 ENVOY_STREAM_LOG(debug, "upstream timeout", *callbacks_);
+                timedout_ = true;

Member

mattklein123 Nov 13, 2020

Not needed.

source/common/router/router.cc Outdated

Comment on lines +569 to +571

+                  // start measuring timeout.
+                  response_timeout_ = dispatcher.createTimer([this]() -> void { onResponseTimeout(); });
+                  response_timeout_->enableTimer(timeout_.global_timeout_);

Member

mattklein123 Nov 13, 2020

I would make a common method for starting the response timeout timer for clarity.

source/common/router/router.cc Outdated

Comment on lines +346 to +349

+                  // this _should_ be positive
+                  // TODO: what should be done if it is not positive? as i believe this can happen in rare
+                  // cases.
+                  ASSERT(expected_timeout.count() > 0, "expected_timeout is not positive");

Member

mattklein123 Nov 13, 2020

Yeah I think it's possible for this to be delayed. In this case though I think the timeout timer should fire so we can just return here? Make sure this case has coverage.

source/common/router/router.cc Outdated

+                  // if it is less than the current per try timeout, or there is no per try timeout, update
+                  // timeout headers
+                  if (timeout_.per_try_timeout_.count() == 0 || expected_timeout < timeout_.per_try_timeout_) {
+                    FilterUtility::setTimeoutHeaders(*route_entry_, expected_timeout.count(), *downstream_headers_,

Member

mattklein123 Nov 13, 2020

Even in cases where we start measuring in request complete, shouldn't we still adjust expected timeout HTTP and gRPC time based on how long it took to get a connection from the CP? I think this has been raised elsewhere and it would be good to fix this also?

Contributor Author

yuval-k Nov 13, 2020

i can adjust; my intention was to prevent a change in behavior to reduce risk

Member

mattklein123 Nov 13, 2020

I think it would be fine to do this as a separate change, just something to keep in mind.

repokitteh-read-only bot added the waiting label


          rotuer unit tests

1bac0dc

Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>

repokitteh-read-only bot removed the waiting label

Contributor Author

yuval-k commented Nov 19, 2020

/wait
more tests are coming

repokitteh-read-only bot added the waiting label

Contributor Author

yuval-k commented Nov 23, 2020

Hi @mattklein123 ,

I think I miss-understood the original behavior of the async client; I see now that the whole request is sent at the same time, unless the circuit breaker hits, and then it is rejected immediately; i.e. the reason I didn't originally see timeouts is because the request was already rejected, and not because it was pending.

This means I can take this PR in one of two ways:

Option1: Just revert the original PR (will revert the second commit in this PR)
Option2: Add an option measure_timeout_on_request_start on the route api, to allow measuring the timeouts on request start for regular requests, allowing users to benefit from the work already done, though I personally don't have a use case in mind (In this case, I will add an api and integration tests).

I apologize for this confusion, and will make sure to do better research next time.
WDYT?

Member

mattklein123 commented Nov 23, 2020

@yuval-k I think my preference is to just revert the original PR and we can add (2) later if someone asks? Thank you!

I'm going to close this out and can you open a fresh revert PR please? Thank you!

mattklein123 closed this

Contributor Author

yuval-k commented Nov 24, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels