-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GEP-1742: HTTPRoute Timeouts API #1997
Conversation
Hi @frankbu. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
geps/gep-1742.md
Outdated
// One or more conditions can be specified using a ‘,’ delimited list. | ||
RetryOn string `json:"retry_on,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If its a list it should be []string
IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
geps/gep-1742.md
Outdated
// Specifies the conditions under which retry takes place. | ||
// One or more conditions can be specified using a ‘,’ delimited list. | ||
RetryOn string `json:"retry_on,omitempty"` | ||
// Flag to specify whether the retries should retry to other localities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems under-specified given there is no mention of "localities" anywhere else in the repo. Can we expand on this a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added another section below.
geps/gep-1742.md
Outdated
```go | ||
type HTTPRouteRule struct { | ||
// Timeout for HTTP requests, disabled by default. | ||
Timeout *time.Duration `json:"timeout,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we expand the specification for this a bit? Given the GEP already documents probably 20 types of "timeout" in https://gateway-api.sigs.k8s.io/geps/gep-1742/?h=1742#flow-diagrams-with-available-timeouts, having just a single field as "timeout" with no further explanation seems a bit ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a paragraph above. PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of a flat field, should this be tiered, so you can start off with a request
timeout and add things like connect
timeout in the future ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arkodg Good question. I think it depends on whether implementations support configuring other kinds of timeouts at the HTTPRoute level. Based on the diagrams in the previous sections, I think Envoy does not, but unclear about the others.
@youngnick do you have more insight on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this GEP is actually defining two separate timeouts here... 🙂
HTTPRouteRule.timeout
is defining an end-to-end timeout (a timeout for a single request as measured by the client), andHTTPRouteRule.retries.perTryTimeout
is defining a timeout for a single request as measured by the gateway rather than the client.
Both of these timeouts are useful! though, per the diagrams above, a given gateway may not be able to honor both.
Overall, I would prefer
HTTPRouteRule:
timeouts:
request: duration # end-to-end as seen by the client
workloadRequest: duration # single request as seen by the gateway
with no perTryTimeout
in the retry configuration. (And no, I'm not wedded to the names in the timeouts
dict above.) I'd further suggest that the workloadRequest
timeout not be part of the core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kflynn This suggestion sounds quite good to me. The reason that I dragged retries into this GEP in the first place is because perTryTimeout was so tightly coupled to retries, but moving it out and generalizing it like you suggest allows for better separation of concerns and to only focus on Timeouts in this GEP. I'm happy to make this change but will give others some time to comment first.
geps/gep-1742.md
Outdated
// Timeout per attempt for a given request, including the initial call and any retries. | ||
// Default is the value of the HTTPRouteRule request Timeout. | ||
PerTryTimeout *time.Duration `json:"per_try_timeout,omitempty"` | ||
// Specifies the conditions under which retry takes place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What conditions are valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explain below that HTTP error codes are valid and must be supported by implementations. Other implementation-specific identifier are also valid, depending on the implementation.
- Does that sound reasonable?
- Do you think I need to explain it more clearly in the below text?
geps/gep-1742.md
Outdated
|
||
### Timeout values | ||
|
||
Timeout and PerTryTimeout values are formatted as 1h/1m/1s/1ms and MUST BE >= 1ms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used anywhere else in gateway-api or core k8s? Is there a more formal spec?
is 1h5m valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any other use of duration in gateway-api. I couldn't find it in core k8s either, but maybe not looking in the right place? Otherwise, I think this is just the Go time.Duration string serialization that we'd use?
/cc Let's hold, to make sure more people get some time to review. /hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote for pulling retries into a separate GEP, but for pulling the perTryTimeout
up into the main timeouts section. Thanks for diving into this!
geps/gep-1742.md
Outdated
```go | ||
type HTTPRouteRule struct { | ||
// Timeout for HTTP requests, disabled by default. | ||
Timeout *time.Duration `json:"timeout,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this GEP is actually defining two separate timeouts here... 🙂
HTTPRouteRule.timeout
is defining an end-to-end timeout (a timeout for a single request as measured by the client), andHTTPRouteRule.retries.perTryTimeout
is defining a timeout for a single request as measured by the gateway rather than the client.
Both of these timeouts are useful! though, per the diagrams above, a given gateway may not be able to honor both.
Overall, I would prefer
HTTPRouteRule:
timeouts:
request: duration # end-to-end as seen by the client
workloadRequest: duration # single request as seen by the gateway
with no perTryTimeout
in the retry configuration. (And no, I'm not wedded to the names in the timeouts
dict above.) I'd further suggest that the workloadRequest
timeout not be part of the core.
geps/gep-1742.md
Outdated
|
||
type HTTPRetry struct { | ||
// Number of retries to be allowed for a given request. | ||
Attempts int32 `json:"attempts,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have quite a few comments on retries, but I really think that the retry portion here needs to be pulled out into a different GEP. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this
Just a note to say I have seen this, but need a bit more time to give it a full review. I should hopefully be able to do this later on tomorrow. |
@youngnick @howardjohn @kflynn @arkodg and other reviewers: I have made a number of updates to address reviewer comments and removed retry configuration from this proposal. This PR is now only addressing timeout configuration for GEP issue #1742. Will move discussion of retries to a separate GEP for issue #1731. |
geps/gep-1742.md
Outdated
|
||
There a 2 kinds of timeouts that can be configured in an `HTTPRouteRule`: | ||
|
||
1. `timeouts.request` is an end-to-end timeout for a single request as measured by the client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pedantic: we cannot measure the client end-to-end time, as there is latency between the client and us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I originally wrote that description... uh... maybe describe it as "an end-to-end timeout for a single request from the client to the Gateway API implementation", contrasted with "a single backend request from the Gateway API implementation to a workload"?
This would probably benefit from repeating the diagrams above to show where the two timeouts should measure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is not so pedantic. At a high level, the "useful" timeout that we're trying to support here is basically the time it takes to get a response. In the various diagrams above, it seems like there is a timeout that corresponds to the Finishes Response
arrow, which is more-or-less what we want, but the question of when the timeout starts measuring is important. John pointed out to me that in the case of Envoy, the implementation route timeout (R timeout
in the above diagram) does not start until the entire downstream request stream has been received
. Not sure about Nginx and other impls, but seems we need to be more flexible with this definition, to make it implementable.
Maybe something like this:
timeouts.request
is the timeout for the Gateway API implementation to send a response to a client HTTP request. Whether the gateway starts the timeout before or after the entire client request stream has been received, is implementation dependent.
Maybe we could first suggest the timeout begins after the entire client request stream has been received, and see if that is a problem for non-Envoy implementations?
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the diagrams I did earlier (needed to recover context myself), I think that it's going to be difficult to define this timeout in such a way that all the implementations can implement it.
I was going to say something like "from when the proxy has finished receiving the request from the client to when it starts sending the response back to the client", but some proxies can't do that, I think.
I'd definitely encourage @frankbu to take the initial source diagram and try and mark where the request timeout would cover on a diagram for this section, and for the backendRequest
timeout as well.
It may be that we need to be more vague here and say something like "this timeout covers as close to the whole request->response transaction as is possible for the proxy implementation, given its configurable timeouts." and then leave some of it up to the implementations. This would have the downside of being difficult to test though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the common timeout we want to handle is for when it's taking too long to get the backend response. We should be able to test that, even if we make the spec a little vague about whether the timeout value includes other request overhead or not.
I'll update the wording to make it more clear, including adding the diagram with the proposed timeouts marked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated wording and added diagram, PTAL.
geps/gep-1742.md
Outdated
A client HTTP request to a gateway may result in more than one call to the destination | ||
backend service, for example, if automatic retries are supported. Some implementations | ||
support configuring a timeout for the individual backend requests, seperate from the | ||
overall client request timeout. Adding None-Core support for this would also benefit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/None-Core/non-Core/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frankbu Thanks much!! I left a couple of small comments about wordsmithing but honestly, I'd be OK merging this. Others may have other thoughts, of course. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the structure looks okay to me - although I think that as it stands, this proposal will not be able to be combined with a TimeoutPolicy to default or override these fields, as we currently don't allow Policy fields to mess with the values of fields within an object.
That would just mean that this setting would only be configurable on two levels: a default in the implementation somehow, configured via runtime config, environment variables, or similar, and then a HTTPRoute-specific value which must be configured on every HTTPRoute which needs a different timeout to the default. (The additional boilerplate adds up over lots of separate objects over time, in my experience).
@youngnick I understand and agree that a TimeoutPolicy shouldn't be able to |
SGTM. Should that be included in this document? If so where? Maybe it should be included in the comments of the Go APIs? #2013 |
I think we should add a new standard section to GEPs - "Conformance Details", that's a subsection of the "API" section, that lists the feature name(s) so that they're agreed on as part of this, and then can be used in standard ways as part of the conformance tests and profiles (this is particularly important for Extended features.) @shaneutt @mlavacca and anyone else who's working on the conformance profiles stuff. |
@youngnick @shaneutt @mlavacca and others - Added Nick's suggested "Conformance Details" section. PTAL. |
The latest proxy-api release, v0.10.0, adds fields to the `OutboundPolicies` API for configuring HTTP request timeouts, based on the proposed changes to HTTPRoute in kubernetes-sigs/gateway-api#1997. This branch updates the proxy-api dependency to v0.10.0 and adds the new timeout configuration fields to the proxy's internal client policy types. In addition, this branch adds a timeout middleware to the HTTP client policy stack, so that the timeout described by the `Rule.request_timeout` field is now applied. Implementing the `RouteBackend.request_timeout` field with semantics as close as possible to those described in GEP-1742 will be somewhat more complex, and will be added in a separate PR.
the following feature names: | ||
|
||
- `HTTPRouteRequestTimeout`: supports `rules.timeouts.request` in an `HTTPRoute`. | ||
- `HTTPRouteBackendTimeout`: supports `rules.timeouts.backendRequest` in an `HTTPRoute`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call it HTTPRouteBackendRequestTimeout
? What if we add other backend timeouts in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary if we think of request
as the "default" backend timeout. If we add other ones in the future, e.g., rules.timeouts.backendIdle
, they need to include the kind explicitly: HTTPRouteBackendIdelTimeout
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to let the feature name matches the field name. But it's fine to keep the current name.
Depends on #2418 The latest proxy-api release, v0.10.0, adds fields to the `OutboundPolicies` API for configuring HTTP request timeouts, based on the proposed changes to HTTPRoute in kubernetes-sigs/gateway-api#1997. PR #2418 updates the proxy to depend on the new proxy-api release, and implements the `Rule.request_timeout` field added to the API. However, that branch does *not* add a timeout for the `RouteBackend.request_timeout` field. This branch changes the proxy to apply the backend request timeout when configured by the policy controller. This branch implements `RouteBackend.request_timeout` by adding an additional timeout layer in the `MatchedBackend` stack. This applies the per-backend timeout once a backend is selected for a route. I've also added stack tests for the interaction between the request and backend request timeouts. Note that once retries are added to client policy stacks, it may be necessary to move the backend request timeout to ensure it occurs "below" retries, depending on where the retry middleware ends up being located in the proxy stack.
* update HTTPRoute interfaces according to latest proposal in GEP-1742 * include Request and BackendRequest in HTTPRouteTimeouts * add validation for HTTPRouteTimeouts * add unit tests in httproute_test.go * relates to kubernetes-sigs#1997
* update HTTPRoute interfaces according to latest proposal in GEP-1742 * include Request and BackendRequest in HTTPRouteTimeouts * add validation for HTTPRouteTimeouts * add unit tests in httproute_test.go * relates to kubernetes-sigs#1997
* update HTTPRoute interfaces according to latest proposal in GEP-1742 * include Request and BackendRequest in HTTPRouteTimeouts * add validation for HTTPRouteTimeouts * add unit tests in httproute_test.go * relates to kubernetes-sigs#1997
Thanks for the work @frankbu, this LGTM. I'll leave someone else to unhold though. /lgtm |
Thanks @frankbu! /hold cancel |
* update HTTPRoute interfaces according to latest proposal in GEP-1742 * add Duration type as per GEP-2257 * include Request and BackendRequest in HTTPRouteTimeouts * add validation for HTTPRouteTimeouts * add unit tests in httproute_test.go * relates to kubernetes-sigs#1997
Implementation to support POC on HTTRoute Timeouts for GatewayAPI references: * kubernetes-sigs/gateway-api#2013 * https://gateway-api.sigs.k8s.io/geps/gep-1742/ * https://gateway-api.sigs.k8s.io/geps/gep-2257/ * kubernetes-sigs/gateway-api#1742 * kubernetes-sigs/gateway-api#1997
What type of PR is this?
/kind gep
What this PR does / why we need it:
Adds API proposal to GEP-1742.
Related issues: #1742
Which issue(s) this PR fixes:
Does this PR introduce a user-facing change?: