Skip to content

Use HTTP probes for Ray readiness and liviness probes#2360

Open
andrewsykim wants to merge 1 commit intoray-project:masterfrom
andrewsykim:http-probes
Open

Use HTTP probes for Ray readiness and liviness probes#2360
andrewsykim wants to merge 1 commit intoray-project:masterfrom
andrewsykim:http-probes

Conversation

@andrewsykim
Copy link
Member

@andrewsykim andrewsykim commented Sep 6, 2024

Why are these changes needed?

HTTP probes are considered lighter-weight than exec probes. However, exec probes have the advantage of doing multiple health checks. In KubeRay, we use exec probes to execute "wget" commands against multiple endpoints. Use of exec probes seems to be causing some issues, as shown in #2264 and from KubeRay scalability testing.

This PR explores using HTTP probes instead. This PR needs more consideration as using HTTP probes means we can only health check 1 end point per probe. Marking WIP for now until that quesiton is resolved.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

FailureThreshold: utils.DefaultLivenessProbeFailureThreshold,
}
rayContainer.LivenessProbe.Exec = &corev1.ExecAction{Command: []string{"bash", "-c", strings.Join(commands, " && ")}}
rayContainer.LivenessProbe.HTTPGet = &corev1.HTTPGetAction{Path: healthCheckPath, Port: healthCheckPort}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using HTTP probes means we can only query 1 endpoint per probe now. For head pod this would /api/gcs_healthz and for worker pod it would be api/local_raylet_healthz. I'm not sure if not health checking api/local_raylet_healthz in the head pod is problematic, it would depend on what whether /api/gcs_healthz incorporates raylet health in some way as well

@YQ-Wang
Copy link
Contributor

YQ-Wang commented Sep 10, 2024

We also face this issue when the workload is high.

@kevin85421
Copy link
Member

@andrewsykim do we still need this PR after #2353 has been merged?

@andrewsykim
Copy link
Member Author

I think we should still consider use of HTTP probes, they are significantly ligher weight. I haven't root caused the issue I'm seeing, but increasing the timeout did not fully resolve the issue I'm seeing where exec probes cause high load

@joshhvulcan
Copy link

I have encountered some bizarre behavior with the exec probes that I think would be solved with http probes. The biggest issue may actually be a k8s bug though. The ray-head container had died but the autoscaler container was still running so the pod was kept alive by k8s. The probe was failing because the exec was failing and k8s took no action because the probe failed to fail? (lol). Anyway, http likely probes would have done the needful here.

Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "03e693753f58930cd9bf004e047ff1cf7c26afd30ea916cbe0d291e130ea9d27": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown

@andrewsykim
Copy link
Member Author

andrewsykim commented Oct 31, 2024

I'm still in favor of this change and I would like to see it merged for v1.3. However, using http probes means we can only probe 1 HTTP endpoint per container. Specifically for the Head pod, it means probing only the dashboard endpoint and not the raylet agent. Are w okay with that change? @kevin85421 @joshhvulcan

@joshhvulcan
Copy link

I think an http probe on the dashboard would be sufficient for the failures we have experienced.

@andrewsykim andrewsykim changed the title [WIP] Use HTTP probes for Ray readiness and liviness probes Use HTTP probes for Ray readiness and liviness probes Oct 31, 2024
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
@andrewsykim
Copy link
Member Author

PR updated, PTAL

@metasyn
Copy link
Contributor

metasyn commented Dec 10, 2024

A side benefit is that this does not force custom images to include wget

@kevin85421
Copy link
Member

@spencer-p
Copy link
Contributor

ray-project/ray#56943 is now merged so we don't need to pick between endpoints. Maybe it'll make ray 2.53 and we can feature gate this

@400Ping
Copy link
Contributor

400Ping commented Jan 17, 2026

Hi, are you still going to work on this?

@spencer-p
Copy link
Contributor

I'll take a look this week at reviving this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants