Updated default timeout seconds for probes#2265
Updated default timeout seconds for probes#2265kevin85421 merged 2 commits intoray-project:masterfrom
Conversation
|
Hi @HarshAgarwal11, would you mind fixing the CI error and running unit tests locally? You can read this doc for more details https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md#running-the-tests |
@kevin85421 CI errors have been fixed |
| // Ray FT default readiness probe values | ||
| DefaultReadinessProbeInitialDelaySeconds = 10 | ||
| DefaultReadinessProbeTimeoutSeconds = 1 | ||
| DefaultReadinessProbeTimeoutSeconds = 2 |
There was a problem hiding this comment.
I ran into issues with the probe timeout in v1.1 as well.
I am thinking this probe timeout should actually be 4 or 5 seconds for the Head pod. This is because the probe for head pod runs both the agent heath check and GCS health check:
wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
&& wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep success
Which is collectively up to 4 seconds. Thoughts @kevin85421 @HarshAgarwal11
There was a problem hiding this comment.
Yes it should be 4 to 5 seconds. Because of the OR statement, timeout might get add up. And with 2 sec I was still getting some timeouts, not as frequent as earlier. But after changing it to 5 seconds, I didn't see any timeouts, there were some failures though.
There was a problem hiding this comment.
Opened a separete issue to track exec probe issues: #2355
Why are these changes needed?
Have updated Probes Timeout Seconds for Ray Clusters to 2 seconds, as the value used in the probes command for wget timeout is 2 seconds. Also added the wget timeout to be picked from readiness default timeout seconds.
Initially Default Timeout Seconds for both the probes were 1 second, which used to result in the failure of probes, even before the wget command gets finished.
Related issue number
#2264
Checks