fix: fix too short timeout causing cascading failures#4133
fix: fix too short timeout causing cascading failures#4133morotti wants to merge 1 commit intoray-project:masterfrom
Conversation
the 2 second timeout on liveness probes it way too short. it is causing cascading failures when the container is busy and cannot reply immediately. this is especially bad if you have cpu limits configured on the ray pods. in addition to that, TCP takes 2-3 seconds to detect a lost packet and retry. you should NEVER have a timeout below 5 seconds in any production software. Signed-off-by: morotti <[email protected]>
|
Longer term, we really need to remove dependency to exec probes, I believe that once we are using HTTP probes, we can use shorter timeouts with significantly better reliability. There's a PR for using http probes (#2360), however, it's blocked on Ray unifiying health check endpoints ray-project/ray#56204 |
| - bash | ||
| - -c | ||
| - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 10 -q -O- http://localhost:8443/api/gcs_healthz | grep success | ||
| - wget -T 10 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 10 -q -O- http://localhost:8443/api/gcs_healthz | grep success |
There was a problem hiding this comment.
it's not enough to increase wget timeout, you need to also increase probeTimeout in the container's probe config
There was a problem hiding this comment.
btw, this is just an example manifest, you need to update the controller logic if you want to change default behavior
|
@andrewsykim you mentioned that this may be merely an example file and the settings may be coming from somewhere else? I had a look but I can't find where the setting comes from. do you think you can find the source and update ray? |
|
Hi @morotti, For KubeRay Operator I think it should be here: kuberay/ray-operator/controllers/ray/utils/constant.go Lines 212 to 219 in 530318b But you can also overwrite |
|
@win5923 could you please follow up and do the bug fix yourself if there are other files to adjust too? I don't mind not getting the credit for the PR. |

Why are these changes needed?
Hello,
The 2 second timeout on liveness probes it way too short. it is causing cascading failures when the container is busy and cannot reply immediately. this is especially bad if you have cpu limits configured on the ray pods, which restricts how much cpu the container can use.
In addition to that, TCP takes 2-3 seconds to detect a lost packet and retry. You should NEVER have a timeout below 5 seconds in any production software.
Checks