[core] Clean up `test_raylet_resubscribe_to_worker_death` and relevant Raylet logs #58244

edoakes · 2025-10-28T15:14:02Z

test_gcs_fault_tolerance.py:: test_worker_raylet_resubscription is still flaky in CI despite bumping up the timeout. Making a few improvements here:

Increasing the timeout to 20s just in case it's a timeout issue (unlikely).
Changing to scheduling an actor instead of using internal_kv for our signal that the GCS is back up. This should better indicate that the Raylet is resubscribed.
Cleaning up some system logs.
Modifying the ObjectLostError logs to avoid logging likely-irrelevant plasma usage on owner death.

It's likely that the underlying issue here is that we don't actually reliably resubscribe to all worker death notifications, as indicated in the TODO in the PR.

Signed-off-by: Edward Oakes <[email protected]>

gemini-code-assist

Code Review

This pull request significantly improves the test_raylet_resubscribe_worker_death test by making it more robust and easier to understand. The changes, including adding explicit checks for OwnerDiedError and ensuring GCS is responsive after a restart, are excellent for test stability. Additionally, the cleanup of log messages across various Raylet components enhances clarity and moves towards more structured logging, which is beneficial for debugging. I've found one minor typo in a log message, but otherwise, the changes are solid.

src/ray/raylet/node_manager.cc

Signed-off-by: Edward Oakes <[email protected]>

edoakes · 2025-10-28T15:15:07Z

src/ray/common/asio/periodical_runner.cc

    : io_service_(io_service) {}

 PeriodicalRunner::~PeriodicalRunner() {
-  RAY_LOG(DEBUG) << "PeriodicalRunner is destructed";


unnecessarily noisy when debug logs were on

edoakes · 2025-10-28T17:47:01Z

src/ray/raylet/node_manager.cc

-  if (worker) {
-    // The client is a worker.
+  std::shared_ptr<WorkerInterface> worker;
+  if ((worker = worker_pool_.GetRegisteredWorker(client))) {


@dayshah is this frowned upon by c++ enjoyers?

no, i kinda like the if value syntax, scopes it so it gets destroyed earlier + can't be accessed outside

Signed-off-by: Edward Oakes <[email protected]>

…es/deflake-and-logs

…t Raylet logs (ray-project#58244) `test_gcs_fault_tolerance.py:: test_worker_raylet_resubscription` is still flaky in CI despite bumping up the timeout. Making a few improvements here: - Increasing the timeout to `20s` just in case it's a timeout issue (unlikely). - Changing to scheduling an actor instead of using `internal_kv` for our signal that the GCS is back up. This should better indicate that the Raylet is resubscribed. - Cleaning up some system logs. - Modifying the `ObjectLostError` logs to avoid logging likely-irrelevant plasma usage on owner death. It's likely that the underlying issue here is that we don't actually reliably resubscribe to all worker death notifications, as indicated in the TODO in the PR. --------- Signed-off-by: Edward Oakes <[email protected]>

…t Raylet logs (ray-project#58244) `test_gcs_fault_tolerance.py:: test_worker_raylet_resubscription` is still flaky in CI despite bumping up the timeout. Making a few improvements here: - Increasing the timeout to `20s` just in case it's a timeout issue (unlikely). - Changing to scheduling an actor instead of using `internal_kv` for our signal that the GCS is back up. This should better indicate that the Raylet is resubscribed. - Cleaning up some system logs. - Modifying the `ObjectLostError` logs to avoid logging likely-irrelevant plasma usage on owner death. It's likely that the underlying issue here is that we don't actually reliably resubscribe to all worker death notifications, as indicated in the TODO in the PR. --------- Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>

edoakes added 2 commits October 28, 2025 10:11

fix

5225e3f

Signed-off-by: Edward Oakes <[email protected]>

fix

ff954fe

Signed-off-by: Edward Oakes <[email protected]>

edoakes requested a review from a team as a code owner October 28, 2025 15:14

edoakes added the go add ONLY when ready to merge, run all tests label Oct 28, 2025

fix

1b70576

Signed-off-by: Edward Oakes <[email protected]>

edoakes changed the title ~~[core] Clean up test_raylet_resubscribe_worker_death and relevant Raylet logs~~ [core] Clean up test_raylet_resubscribe_to_worker_death and relevant Raylet logs Oct 28, 2025

gemini-code-assist bot reviewed Oct 28, 2025

View reviewed changes

src/ray/raylet/node_manager.cc Outdated Show resolved Hide resolved

fix

ab60f5f

Signed-off-by: Edward Oakes <[email protected]>

edoakes commented Oct 28, 2025

View reviewed changes

ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 28, 2025

edoakes added 3 commits October 30, 2025 07:11

fix

881977f

Signed-off-by: Edward Oakes <[email protected]>

Fix

9b9d772

Signed-off-by: Edward Oakes <[email protected]>

Fix

1700a4c

Signed-off-by: Edward Oakes <[email protected]>

This comment was marked as outdated.

Sign in to view

edoakes added 2 commits October 30, 2025 08:14

fix

63b0991

Signed-off-by: Edward Oakes <[email protected]>

fix

b71647d

Signed-off-by: Edward Oakes <[email protected]>

edoakes enabled auto-merge (squash) October 30, 2025 15:33

jjyao approved these changes Oct 30, 2025

View reviewed changes

ZacAttack approved these changes Oct 30, 2025

View reviewed changes

Merge branch 'master' of https://github.com/ray-project/ray into eoak…

a24240e

…es/deflake-and-logs

github-actions bot disabled auto-merge October 30, 2025 20:42

edoakes enabled auto-merge (squash) October 30, 2025 20:46

edoakes merged commit fd1e404 into ray-project:master Oct 30, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Clean up `test_raylet_resubscribe_to_worker_death` and relevant Raylet logs #58244

[core] Clean up `test_raylet_resubscribe_to_worker_death` and relevant Raylet logs #58244

Uh oh!

edoakes commented Oct 28, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

edoakes Oct 28, 2025

Uh oh!

edoakes Oct 28, 2025

Uh oh!

dayshah Oct 28, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[core] Clean up test_raylet_resubscribe_to_worker_death and relevant Raylet logs #58244

[core] Clean up test_raylet_resubscribe_to_worker_death and relevant Raylet logs #58244

Uh oh!

Conversation

edoakes commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

edoakes Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

dayshah Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[core] Clean up `test_raylet_resubscribe_to_worker_death` and relevant Raylet logs #58244

[core] Clean up `test_raylet_resubscribe_to_worker_death` and relevant Raylet logs #58244

edoakes commented Oct 28, 2025 •

edited

Loading