Use HTTP probes for Ray readiness and liviness probes#2360
Use HTTP probes for Ray readiness and liviness probes#2360andrewsykim wants to merge 1 commit intoray-project:masterfrom
Conversation
| FailureThreshold: utils.DefaultLivenessProbeFailureThreshold, | ||
| } | ||
| rayContainer.LivenessProbe.Exec = &corev1.ExecAction{Command: []string{"bash", "-c", strings.Join(commands, " && ")}} | ||
| rayContainer.LivenessProbe.HTTPGet = &corev1.HTTPGetAction{Path: healthCheckPath, Port: healthCheckPort} |
There was a problem hiding this comment.
Using HTTP probes means we can only query 1 endpoint per probe now. For head pod this would /api/gcs_healthz and for worker pod it would be api/local_raylet_healthz. I'm not sure if not health checking api/local_raylet_healthz in the head pod is problematic, it would depend on what whether /api/gcs_healthz incorporates raylet health in some way as well
|
We also face this issue when the workload is high. |
|
@andrewsykim do we still need this PR after #2353 has been merged? |
|
I think we should still consider use of HTTP probes, they are significantly ligher weight. I haven't root caused the issue I'm seeing, but increasing the timeout did not fully resolve the issue I'm seeing where exec probes cause high load |
|
I have encountered some bizarre behavior with the exec probes that I think would be solved with http probes. The biggest issue may actually be a k8s bug though. The |
|
I'm still in favor of this change and I would like to see it merged for v1.3. However, using http probes means we can only probe 1 HTTP endpoint per container. Specifically for the Head pod, it means probing only the dashboard endpoint and not the raylet agent. Are w okay with that change? @kevin85421 @joshhvulcan |
|
I think an http probe on the dashboard would be sufficient for the failures we have experienced. |
865a01a to
e7cdb70
Compare
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
e7cdb70 to
d106ab9
Compare
|
PR updated, PTAL |
|
A side benefit is that this does not force custom images to include wget |
|
ray-project/ray#56943 is now merged so we don't need to pick between endpoints. Maybe it'll make ray 2.53 and we can feature gate this |
|
Hi, are you still going to work on this? |
|
I'll take a look this week at reviving this |
Why are these changes needed?
HTTP probes are considered lighter-weight than exec probes. However, exec probes have the advantage of doing multiple health checks. In KubeRay, we use exec probes to execute "wget" commands against multiple endpoints. Use of exec probes seems to be causing some issues, as shown in #2264 and from KubeRay scalability testing.
This PR explores using HTTP probes instead. This PR needs more consideration as using HTTP probes means we can only health check 1 end point per probe. Marking WIP for now until that quesiton is resolved.
Related issue number
Checks