GEP-1742: HTTPRoute Timeouts API #1997

frankbu · 2023-05-04T19:31:59Z

What type of PR is this?

/kind gep

What this PR does / why we need it:

Adds API proposal to GEP-1742.
Related issues: #1742

Which issue(s) this PR fixes:

Does this PR introduce a user-facing change?:

k8s-ci-robot · 2023-05-04T19:32:09Z

Hi @frankbu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

howardjohn · 2023-05-04T19:32:50Z

geps/gep-1742.md

+	// One or more conditions can be specified using a ‘,’ delimited list.
+	RetryOn string `json:"retry_on,omitempty"`


If its a list it should be []string IMO

howardjohn · 2023-05-04T19:33:22Z

geps/gep-1742.md

+	// Specifies the conditions under which retry takes place.
+	// One or more conditions can be specified using a ‘,’ delimited list.
+	RetryOn string `json:"retry_on,omitempty"`
+	// Flag to specify whether the retries should retry to other localities.


This seems under-specified given there is no mention of "localities" anywhere else in the repo. Can we expand on this a bit.

Added another section below.

howardjohn · 2023-05-04T19:35:59Z

geps/gep-1742.md

+```go
+type HTTPRouteRule struct {
+	// Timeout for HTTP requests, disabled by default.
+	Timeout *time.Duration `json:"timeout,omitempty"`


Can we expand the specification for this a bit? Given the GEP already documents probably 20 types of "timeout" in https://gateway-api.sigs.k8s.io/geps/gep-1742/?h=1742#flow-diagrams-with-available-timeouts, having just a single field as "timeout" with no further explanation seems a bit ambiguous.

Added a paragraph above. PTAL

instead of a flat field, should this be tiered, so you can start off with a request timeout and add things like connect timeout in the future ?

@arkodg Good question. I think it depends on whether implementations support configuring other kinds of timeouts at the HTTPRoute level. Based on the diagrams in the previous sections, I think Envoy does not, but unclear about the others.

@youngnick do you have more insight on this?

Note that this GEP is actually defining two separate timeouts here... 🙂

HTTPRouteRule.timeout is defining an end-to-end timeout (a timeout for a single request as measured by the client), and

HTTPRouteRule.retries.perTryTimeout is defining a timeout for a single request as measured by the gateway rather than the client.

Both of these timeouts are useful! though, per the diagrams above, a given gateway may not be able to honor both.

Overall, I would prefer

HTTPRouteRule: timeouts: request: duration # end-to-end as seen by the client workloadRequest: duration # single request as seen by the gateway

with no perTryTimeout in the retry configuration. (And no, I'm not wedded to the names in the timeouts dict above.) I'd further suggest that the workloadRequest timeout not be part of the core.

@kflynn This suggestion sounds quite good to me. The reason that I dragged retries into this GEP in the first place is because perTryTimeout was so tightly coupled to retries, but moving it out and generalizing it like you suggest allows for better separation of concerns and to only focus on Timeouts in this GEP. I'm happy to make this change but will give others some time to comment first.

howardjohn · 2023-05-04T19:36:36Z

geps/gep-1742.md

+	// Timeout per attempt for a given request, including the initial call and any retries.
+	// Default is the value of the HTTPRouteRule request Timeout.
+	PerTryTimeout *time.Duration `json:"per_try_timeout,omitempty"`
+	// Specifies the conditions under which retry takes place.


What conditions are valid?

I explain below that HTTP error codes are valid and must be supported by implementations. Other implementation-specific identifier are also valid, depending on the implementation.

Does that sound reasonable?

Do you think I need to explain it more clearly in the below text?

howardjohn · 2023-05-04T19:37:15Z

geps/gep-1742.md

+
+### Timeout values
+
+Timeout and PerTryTimeout values are formatted as 1h/1m/1s/1ms and MUST BE >= 1ms.


Is this used anywhere else in gateway-api or core k8s? Is there a more formal spec?

is 1h5m valid?

I couldn't find any other use of duration in gateway-api. I couldn't find it in core k8s either, but maybe not looking in the right place? Otherwise, I think this is just the Go time.Duration string serialization that we'd use?

shaneutt · 2023-05-09T14:41:26Z

/cc

Let's hold, to make sure more people get some time to review.

/hold

kflynn

I'd vote for pulling retries into a separate GEP, but for pulling the perTryTimeout up into the main timeouts section. Thanks for diving into this!

kflynn · 2023-05-09T20:13:44Z

geps/gep-1742.md

+```go
+type HTTPRouteRule struct {
+	// Timeout for HTTP requests, disabled by default.
+	Timeout *time.Duration `json:"timeout,omitempty"`


Note that this GEP is actually defining two separate timeouts here... 🙂

HTTPRouteRule.timeout is defining an end-to-end timeout (a timeout for a single request as measured by the client), and

HTTPRouteRule.retries.perTryTimeout is defining a timeout for a single request as measured by the gateway rather than the client.

Both of these timeouts are useful! though, per the diagrams above, a given gateway may not be able to honor both.

Overall, I would prefer

HTTPRouteRule: timeouts: request: duration # end-to-end as seen by the client workloadRequest: duration # single request as seen by the gateway

with no perTryTimeout in the retry configuration. (And no, I'm not wedded to the names in the timeouts dict above.) I'd further suggest that the workloadRequest timeout not be part of the core.

kflynn · 2023-05-09T22:37:08Z

geps/gep-1742.md

+
+type HTTPRetry struct {
+	// Number of retries to be allowed for a given request.
+	Attempts int32 `json:"attempts,omitempty"`


I have quite a few comments on retries, but I really think that the retry portion here needs to be pulled out into a different GEP. 🙂

youngnick · 2023-05-11T16:46:54Z

Just a note to say I have seen this, but need a bit more time to give it a full review. I should hopefully be able to do this later on tomorrow.

frankbu · 2023-05-15T20:38:01Z

@youngnick @howardjohn @kflynn @arkodg and other reviewers:

I have made a number of updates to address reviewer comments and removed retry configuration from this proposal. This PR is now only addressing timeout configuration for GEP issue #1742. Will move discussion of retries to a separate GEP for issue #1731.

howardjohn · 2023-05-15T20:48:43Z

geps/gep-1742.md

+
+There a 2 kinds of timeouts that can be configured in an `HTTPRouteRule`:
+
+1. `timeouts.request` is an end-to-end timeout for a single request as measured by the client.


Pedantic: we cannot measure the client end-to-end time, as there is latency between the client and us.

Since I originally wrote that description... uh... maybe describe it as "an end-to-end timeout for a single request from the client to the Gateway API implementation", contrasted with "a single backend request from the Gateway API implementation to a workload"?

This would probably benefit from repeating the diagrams above to show where the two timeouts should measure.

Actually, this is not so pedantic. At a high level, the "useful" timeout that we're trying to support here is basically the time it takes to get a response. In the various diagrams above, it seems like there is a timeout that corresponds to the Finishes Response arrow, which is more-or-less what we want, but the question of when the timeout starts measuring is important. John pointed out to me that in the case of Envoy, the implementation route timeout (R timeout in the above diagram) does not start until the entire downstream request stream has been received. Not sure about Nginx and other impls, but seems we need to be more flexible with this definition, to make it implementable.

Maybe something like this:

timeouts.request is the timeout for the Gateway API implementation to send a response to a client HTTP request. Whether the gateway starts the timeout before or after the entire client request stream has been received, is implementation dependent.

Maybe we could first suggest the timeout begins after the entire client request stream has been received, and see if that is a problem for non-Envoy implementations?

Thoughts?

Looking at the diagrams I did earlier (needed to recover context myself), I think that it's going to be difficult to define this timeout in such a way that all the implementations can implement it.

I was going to say something like "from when the proxy has finished receiving the request from the client to when it starts sending the response back to the client", but some proxies can't do that, I think.

I'd definitely encourage @frankbu to take the initial source diagram and try and mark where the request timeout would cover on a diagram for this section, and for the backendRequest timeout as well.

It may be that we need to be more vague here and say something like "this timeout covers as close to the whole request->response transaction as is possible for the proxy implementation, given its configurable timeouts." and then leave some of it up to the implementations. This would have the downside of being difficult to test though.

I think that the common timeout we want to handle is for when it's taking too long to get the backend response. We should be able to test that, even if we make the spec a little vague about whether the timeout value includes other request overhead or not.

I'll update the wording to make it more clear, including adding the diagram with the proposed timeouts marked.

Updated wording and added diagram, PTAL.

https://github.com/kubernetes-sigs/gateway-api/blob/90606a864a21c81f9db9c4d99b63e3fa25c37895/geps/gep-1742.md#timeout-values

kflynn · 2023-05-16T02:56:08Z

geps/gep-1742.md

+A client HTTP request to a gateway may result in more than one call to the destination
+backend service, for example, if automatic retries are supported. Some implementations
+support configuring a timeout for the individual backend requests, seperate from the
+overall client request timeout. Adding None-Core support for this would also benefit


s/None-Core/non-Core/ ?

kflynn

@frankbu Thanks much!! I left a couple of small comments about wordsmithing but honestly, I'd be OK merging this. Others may have other thoughts, of course. 🙂

youngnick

Overall, the structure looks okay to me - although I think that as it stands, this proposal will not be able to be combined with a TimeoutPolicy to default or override these fields, as we currently don't allow Policy fields to mess with the values of fields within an object.

That would just mean that this setting would only be configurable on two levels: a default in the implementation somehow, configured via runtime config, environment variables, or similar, and then a HTTPRoute-specific value which must be configured on every HTTPRoute which needs a different timeout to the default. (The additional boilerplate adds up over lots of separate objects over time, in my experience).

frankbu · 2023-05-16T17:41:52Z

I think that as it stands, this proposal will not be able to be combined with a TimeoutPolicy to default or override these fields, as we currently don't allow Policy fields to mess with the values of fields within an object.

@youngnick I understand and agree that a TimeoutPolicy shouldn't be able to override the fields in the route rule, but why couldn't we (in the future) allow a TimeoutPolicy to set the default values at the namespace, gateway, etc. level?

frankbu · 2023-06-13T15:43:24Z

@youngnick

I suggest HTTPRouteRequestTimeout and HTTPRouteBackendTimeout.

SGTM. Should that be included in this document? If so where?

Maybe it should be included in the comments of the Go APIs? #2013

youngnick · 2023-06-14T11:21:07Z

I think we should add a new standard section to GEPs - "Conformance Details", that's a subsection of the "API" section, that lists the feature name(s) so that they're agreed on as part of this, and then can be used in standard ways as part of the conformance tests and profiles (this is particularly important for Extended features.)

@shaneutt @mlavacca and anyone else who's working on the conformance profiles stuff.

frankbu · 2023-06-14T13:10:50Z

@youngnick @shaneutt @mlavacca and others - Added Nick's suggested "Conformance Details" section. PTAL.

The latest proxy-api release, v0.10.0, adds fields to the `OutboundPolicies` API for configuring HTTP request timeouts, based on the proposed changes to HTTPRoute in kubernetes-sigs/gateway-api#1997. This branch updates the proxy-api dependency to v0.10.0 and adds the new timeout configuration fields to the proxy's internal client policy types. In addition, this branch adds a timeout middleware to the HTTP client policy stack, so that the timeout described by the `Rule.request_timeout` field is now applied. Implementing the `RouteBackend.request_timeout` field with semantics as close as possible to those described in GEP-1742 will be somewhat more complex, and will be added in a separate PR.

spacewander · 2023-06-15T07:01:17Z

geps/gep-1742.md

+the following feature names:
+
+- `HTTPRouteRequestTimeout`: supports `rules.timeouts.request` in an `HTTPRoute`.
+- `HTTPRouteBackendTimeout`: supports `rules.timeouts.backendRequest` in an `HTTPRoute`.


Should we call it HTTPRouteBackendRequestTimeout? What if we add other backend timeouts in the future?

I don't think it's necessary if we think of request as the "default" backend timeout. If we add other ones in the future, e.g., rules.timeouts.backendIdle, they need to include the kind explicitly: HTTPRouteBackendIdelTimeout.

Just want to let the feature name matches the field name. But it's fine to keep the current name.

Depends on #2418 The latest proxy-api release, v0.10.0, adds fields to the `OutboundPolicies` API for configuring HTTP request timeouts, based on the proposed changes to HTTPRoute in kubernetes-sigs/gateway-api#1997. PR #2418 updates the proxy to depend on the new proxy-api release, and implements the `Rule.request_timeout` field added to the API. However, that branch does *not* add a timeout for the `RouteBackend.request_timeout` field. This branch changes the proxy to apply the backend request timeout when configured by the policy controller. This branch implements `RouteBackend.request_timeout` by adding an additional timeout layer in the `MatchedBackend` stack. This applies the per-backend timeout once a backend is selected for a route. I've also added stack tests for the interaction between the request and backend request timeouts. Note that once retries are added to client policy stacks, it may be necessary to move the backend request timeout to ensure it occurs "below" retries, depending on where the retry middleware ends up being located in the proxy stack.

* update HTTPRoute interfaces according to latest proposal in GEP-1742 * include Request and BackendRequest in HTTPRouteTimeouts * add validation for HTTPRouteTimeouts * add unit tests in httproute_test.go * relates to kubernetes-sigs#1997

youngnick · 2023-06-26T14:56:02Z

Thanks for the work @frankbu, this LGTM. I'll leave someone else to unhold though.

/lgtm

robscott · 2023-06-26T18:43:38Z

Thanks @frankbu!

/hold cancel

* update HTTPRoute interfaces according to latest proposal in GEP-1742 * add Duration type as per GEP-2257 * include Request and BackendRequest in HTTPRouteTimeouts * add validation for HTTPRouteTimeouts * add unit tests in httproute_test.go * relates to kubernetes-sigs#1997

Implementation to support POC on HTTRoute Timeouts for GatewayAPI references: * kubernetes-sigs/gateway-api#2013 * https://gateway-api.sigs.k8s.io/geps/gep-1742/ * https://gateway-api.sigs.k8s.io/geps/gep-2257/ * kubernetes-sigs/gateway-api#1742 * kubernetes-sigs/gateway-api#1997

GEP-1742: Timeouts API proposal

accce11

k8s-ci-robot added kind/gep PRs related to Gateway Enhancement Proposal(GEP) cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 4, 2023

k8s-ci-robot requested a review from mikemorris May 4, 2023 19:32

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 4, 2023

k8s-ci-robot requested a review from youngnick May 4, 2023 19:32

howardjohn reviewed May 4, 2023

View reviewed changes

address comments

0525880

This was referenced May 5, 2023

gateway-api: add experimental API for Istio extensions istio/api#2770

Closed

gateway-api: add timeouts and retries support istio/istio#44571

Closed

k8s-ci-robot requested a review from shaneutt May 9, 2023 14:41

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 9, 2023

shaneutt requested review from kflynn, mlavacca and sunjayBhatia May 9, 2023 15:05

kflynn suggested changes May 9, 2023

View reviewed changes

SRodi mentioned this pull request May 12, 2023

HTTPRoute timeout #2013

Merged

remove retries proposal

7696daa

howardjohn reviewed May 15, 2023

View reviewed changes

kflynn reviewed May 16, 2023

View reviewed changes

kflynn approved these changes May 16, 2023

View reviewed changes

youngnick reviewed May 16, 2023

View reviewed changes

clarify and add diagram

09ca210

k8s-ci-robot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 16, 2023

add feature names

bad35be

spacewander reviewed Jun 15, 2023

View reviewed changes

fix merge conflict

f0ac5ac

k8s-ci-robot assigned youngnick Jun 26, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 26, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2023

k8s-ci-robot merged commit 592d31a into kubernetes-sigs:main Jun 26, 2023

shaneutt mentioned this pull request Jul 24, 2023

shaneutt/1822 shaneutt/gateway-api#3

Closed

skriss mentioned this pull request Jul 28, 2023

Avoiding HTTP 504 errors for long-running upstream requests, with Gateway API projectcontour/contour#5159

Closed

shaneutt mentioned this pull request Aug 7, 2023

update owner aliases shaneutt/gateway-api#4

Closed

frankbu mentioned this pull request Sep 15, 2023

REQUEST: New membership for @frankbu kubernetes/org#4466

Closed

9 tasks

shaneutt mentioned this pull request Sep 18, 2023

shaneutt/conformance tests for gateway addresses shaneutt/gateway-api#5

Closed

candita mentioned this pull request Sep 22, 2023

REQUEST: New membership for candita kubernetes/org#4477

Closed

9 tasks

frankbu deleted the request-timeout branch September 25, 2023 14:20

SRodi mentioned this pull request Oct 13, 2023

Documentation for HTTPRoute Timeouts #2480

Closed

		// One or more conditions can be specified using a ‘,’ delimited list.
		RetryOn string `json:"retry_on,omitempty"`


		### Timeout values

		Timeout and PerTryTimeout values are formatted as 1h/1m/1s/1ms and MUST BE >= 1ms.


		There a 2 kinds of timeouts that can be configured in an `HTTPRouteRule`:

		1. `timeouts.request` is an end-to-end timeout for a single request as measured by the client.

GEP-1742: HTTPRoute Timeouts API #1997

GEP-1742: HTTPRoute Timeouts API #1997

Conversation

frankbu commented May 4, 2023 • edited Loading

k8s-ci-robot commented May 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kflynn May 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shaneutt commented May 9, 2023

kflynn left a comment

Choose a reason for hiding this comment

kflynn May 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youngnick commented May 11, 2023

frankbu commented May 15, 2023 • edited Loading

Choose a reason for hiding this comment

kflynn May 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kflynn left a comment

Choose a reason for hiding this comment

youngnick left a comment

Choose a reason for hiding this comment

frankbu commented May 16, 2023 • edited Loading

frankbu commented Jun 13, 2023

youngnick commented Jun 14, 2023

frankbu commented Jun 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youngnick commented Jun 26, 2023

robscott commented Jun 26, 2023

frankbu commented May 4, 2023 •

edited

Loading

kflynn May 9, 2023 •

edited

Loading

kflynn May 9, 2023 •

edited

Loading

frankbu commented May 15, 2023 •

edited

Loading

kflynn May 16, 2023 •

edited

Loading

frankbu commented May 16, 2023 •

edited

Loading