Consistent http.Server timeout configurations#30248
Conversation
There was a problem hiding this comment.
I'm a bit concerned about the more aggressive timeout causing test flakiness, or worse performance issues at scale. 1 second is almost always too short for our CI environment where many test cases are running in parallel.
What do you think about starting with 2 seconds, and leaving this in master for the v14 performance tests before backporting?
There was a problem hiding this comment.
+1. From experience I'd much rather start larger and err on the side of caution and slowly reduce the timeout over time.
There was a problem hiding this comment.
What do you think about starting with 2 seconds, and leaving this in master for the v14 performance tests before backporting?
Sounds like a great plan to me. In commit ff22401ecd6b5c95793e8c03cba3ea77e1c642e2 I updated the configured timeout to 10 seconds. I feel like 2 seconds is likely fine, but we are coming from 30 seconds so even at 10, I feel like this is a notable net gain.
Maybe in Teleport 15 we go to 2 seconds (no concerns with moving slow to make sure we don't cause impacts)
There was a problem hiding this comment.
I have seen two test failures, I worry this could be increasing the flakiness.
The timeouts seem pretty reasonable though, does it make sense to reduce test concurrency so that tests can complete faster? @zmb3 what are your thoughts?
There was a problem hiding this comment.
Which tests? It's probably Go 1.21 and not your change
There was a problem hiding this comment.
Seen twice:
=== Failed
=== FAIL: lib/auth TestAutoRotation (1.75s)
tls_test.go:418:
Error Trace: /__w/teleport/teleport/lib/auth/tls_test.go:418
Error: Error "write tcp 127.0.0.1:47596->127.0.0.1:32991: write: broken pipe" does not contain "certificate"
Test: TestAutoRotation
I assumed it might be the write timeout causing the connection to be closed.
There was a problem hiding this comment.
#30253
may or may not be your change (most likely not) but i don't think we have found the root cause for the above flaky test yet
There was a problem hiding this comment.
This is fine but it doesn't really matter, this is just an example program and is not part of Teleport.
ff22401 to
d252012
Compare
8b3d186 to
e796c10
Compare
This commit builds on the work from #30151 in the following ways: * A couple additional server configuartions that were missing timeouts are now covered * Timeouts are now configured in a consistent way. This means: - Configuring the `ReadTimeout` which was not covered by only setting `ReadHeaderTimeout` - Set `ReadHeaderTimeout` to be the more aggressive (1 second) `defaults.ReadHeadersTimeout` - Set a `WriteTimeout` in cases of potential large responses
e796c10 to
addcb6a
Compare
This PR removes the `ReadTimeout` and `WriteTimeout` settings from `kube/proxy.Server`. The revert is required because both settings were terminating watch streams earlier and causing several when parsing the long lived data stream. Signed-off-by: Tiago Silva <tiago.silva@goteleport.com>
This PR removes the `ReadTimeout` and `WriteTimeout` settings from `kube/proxy.Server`. The revert is required because both settings were terminating watch streams earlier and causing several when parsing the long lived data stream. Signed-off-by: Tiago Silva <tiago.silva@goteleport.com>
This PR removes the `ReadTimeout` and `WriteTimeout` settings from `kube/proxy.Server`. The revert is required because both settings were terminating watch streams earlier and causing several when parsing the long lived data stream. Signed-off-by: Tiago Silva <tiago.silva@goteleport.com>
This PR removes the `ReadTimeout` and `WriteTimeout` settings from `kube/proxy.Server`. The revert is required because both settings were terminating watch streams earlier and causing several when parsing the long lived data stream. Signed-off-by: Tiago Silva <tiago.silva@goteleport.com>
|
@jentfoo FYI this change broke not just Kube Access that @tigrato fixed in #31945 but application access as well (it took us a week to troubleshoot and pinpoint the cause). I removed all timeouts set by this PR in application access request path in #34843. I think we should carefully reevaluate all other places where this PR introduced timeouts as well, for example I see it also sets them in local proxy which I think may be prone to the same issue. TBH I'm tempted to just roll this back entirely. |
|
Looks like this was also the cause of #34201 |
This PR builds on the work from #30151 in the following ways:
ReadTimeoutwhich was not covered by only settingReadHeaderTimeoutReadHeaderTimeoutto be the more aggressive (1 second)defaults.ReadHeadersTimeoutWriteTimeoutin cases of potential large responses