Skip to content

Add unified /api/healthz#56943

Merged
edoakes merged 14 commits intoray-project:masterfrom
spencer-p:unified-health-pr
Dec 4, 2025
Merged

Add unified /api/healthz#56943
edoakes merged 14 commits intoray-project:masterfrom
spencer-p:unified-health-pr

Conversation

@spencer-p
Copy link
Contributor

Why are these changes needed?

This new endpoint on the HealthzAgent class combines the status of /api/local_raylet_healthz and /api/gcs_healthz into one endpoint for use with Kubernetes.

Related issue number

See #56204.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@spencer-p spencer-p requested a review from a team as a code owner September 25, 2025 23:48
@spencer-p spencer-p marked this pull request as draft September 25, 2025 23:49
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new unified health check endpoint /api/healthz which combines the health status of the local Raylet and the GCS. The implementation is clear and achieves its goal. I have a couple of suggestions to improve code quality: one regarding Python's comparison operators for better correctness and another to remove some unreachable code for better clarity and maintainability.

@spencer-p spencer-p force-pushed the unified-health-pr branch 2 times, most recently from 69de984 to 5ace9dd Compare October 1, 2025 18:47
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Oct 1, 2025
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some basic tests. If it's possible to write integration tests for this, that would be great, but not sure how because if the raylet and/or GCS become unhealthy, I think the agent will shortly crash...

Else we can scaffold up some unit tests (without too much mocking please)

@edoakes
Copy link
Collaborator

edoakes commented Oct 1, 2025

I also kicked off premerge CI by adding the go label, it'll run on each commit now: https://buildkite.com/ray-project/premerge/builds/50330

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 16, 2025
@edoakes
Copy link
Collaborator

edoakes commented Oct 16, 2025

unstale

@github-actions github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Oct 17, 2025
@400Ping
Copy link
Contributor

400Ping commented Nov 19, 2025

Hi any updates on this?

This new endpoint on the HealthzAgent class combines the status of
/api/local_raylet_healthz and /api/gcs_healthz into one endpoint for use
with Kubernetes. See ray-project#56204.

Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
- imports
- Use asyncio.gather instead of TaskGroup, which is not available on
  3.10
- Fix missing happy path in GCS check

Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
fixes missing await and bad logging references.

Signed-off-by: Spencer Peterson <spencerjp@google.com>
@spencer-p spencer-p marked this pull request as ready for review November 25, 2025 01:55
@spencer-p
Copy link
Contributor Author

Tests are up; ready for review.

@spencer-p spencer-p changed the title [WIP] Add unified /api/healthz Add unified /api/healthz Nov 25, 2025
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Nov 25, 2025
@jjyao jjyao requested a review from edoakes November 25, 2025 23:08
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Copy link
Contributor

@400Ping 400Ping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

Signed-off-by: Spencer Peterson <spencerjp@google.com>
@edoakes edoakes merged commit 888083b into ray-project:master Dec 4, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants