Skip to content

router: generic timeout on request start#14003

Closed
yuval-k wants to merge 3 commits intoenvoyproxy:masterfrom
yuval-k:timeout-on-request-start
Closed

router: generic timeout on request start#14003
yuval-k wants to merge 3 commits intoenvoyproxy:masterfrom
yuval-k:timeout-on-request-start

Conversation

@yuval-k
Copy link
Contributor

@yuval-k yuval-k commented Nov 12, 2020

Commit Message:
Allow measuring request timeout on request start when using the async client.

Additional Description:
This PR fixes #13580, by first reverting #12778, and then re-adding the functionality in generic way to the router filter.

Opening a draft PR to get early feedback that this is the right direction; once we agree I will add tests and modify the async client / ext auth client to use the new functionality (hopefully in a backwards compatible way).

For the reviewer: The first commit is just a revert, and the second commit adds the new functionality.

Risk Level: Medium (touches the router filter; but changes are be opt-in)
Testing: TBD
Docs Changes: TBD
Release Notes: TBD
Runtime guard: TBD
Deprecates: TBD
Fixes: #13580

cc @mattklein123 @htuch .

… created. (envoyproxy#12778)"

This reverts commit dfa85e8.

Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
@mattklein123 mattklein123 self-assigned this Nov 12, 2020
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking on the more general fix here. I'll defer to folks who understand router timeout details better than me on the implementation, but one thing that I think would be useful is some e2e tests that show that both Envoy and Google gRPC clients can have the same deadline set and enforced.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this. At a high level this looks on the right track!

/wait

virtual std::chrono::milliseconds timeout() const PURE;

/**
* @return bool measure timeout() from when the request is started, instead of when it is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific about what "started" means? Does this mean after a connection is received from the connection pool? Or before a connection is requested from the pool?

}

void Filter::finalizeTimeoutHeaders() {
// if we have a deadline, update the timeout headers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments start with capital, proper grammar, etc. Here and elsewhere.

Comment on lines +663 to +664
// TODO: I'm not sure if this check is needed. can this method still be called
// if onResponseTimeout was called?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, once the reply is sent and/or reset this can't happen. Same below.

void Filter::onResponseTimeout() {
ENVOY_STREAM_LOG(debug, "upstream timeout", *callbacks_);

timedout_ = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed.

Comment on lines +569 to +571
// start measuring timeout.
response_timeout_ = dispatcher.createTimer([this]() -> void { onResponseTimeout(); });
response_timeout_->enableTimer(timeout_.global_timeout_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make a common method for starting the response timeout timer for clarity.

Comment on lines +346 to +349
// this _should_ be positive
// TODO: what should be done if it is not positive? as i believe this can happen in rare
// cases.
ASSERT(expected_timeout.count() > 0, "expected_timeout is not positive");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it's possible for this to be delayed. In this case though I think the timeout timer should fire so we can just return here? Make sure this case has coverage.

// if it is less than the current per try timeout, or there is no per try timeout, update
// timeout headers
if (timeout_.per_try_timeout_.count() == 0 || expected_timeout < timeout_.per_try_timeout_) {
FilterUtility::setTimeoutHeaders(*route_entry_, expected_timeout.count(), *downstream_headers_,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in cases where we start measuring in request complete, shouldn't we still adjust expected timeout HTTP and gRPC time based on how long it took to get a connection from the CP? I think this has been raised elsewhere and it would be good to fix this also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can adjust; my intention was to prevent a change in behavior to reduce risk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be fine to do this as a separate change, just something to keep in mind.

Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
@yuval-k
Copy link
Contributor Author

yuval-k commented Nov 19, 2020

/wait
more tests are coming

@yuval-k
Copy link
Contributor Author

yuval-k commented Nov 23, 2020

Hi @mattklein123 ,

I think I miss-understood the original behavior of the async client; I see now that the whole request is sent at the same time, unless the circuit breaker hits, and then it is rejected immediately; i.e. the reason I didn't originally see timeouts is because the request was already rejected, and not because it was pending.

This means I can take this PR in one of two ways:

  • Option1: Just revert the original PR (will revert the second commit in this PR)
  • Option2: Add an option measure_timeout_on_request_start on the route api, to allow measuring the timeouts on request start for regular requests, allowing users to benefit from the work already done, though I personally don't have a use case in mind (In this case, I will add an api and integration tests).

I apologize for this confusion, and will make sure to do better research next time.
WDYT?

@mattklein123
Copy link
Member

@yuval-k I think my preference is to just revert the original PR and we can add (2) later if someone asks? Thank you!

I'm going to close this out and can you open a fresh revert PR please? Thank you!

@yuval-k
Copy link
Contributor Author

yuval-k commented Nov 24, 2020

xref #14152

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

extauth: new timeout behavior does not add the grpc-timeout header

3 participants