Skip to content

Conversation

@edoakes
Copy link
Collaborator

@edoakes edoakes commented Oct 28, 2025

test_gcs_fault_tolerance.py:: test_worker_raylet_resubscription is still flaky in CI despite bumping up the timeout. Making a few improvements here:

  • Increasing the timeout to 20s just in case it's a timeout issue (unlikely).
  • Changing to scheduling an actor instead of using internal_kv for our signal that the GCS is back up. This should better indicate that the Raylet is resubscribed.
  • Cleaning up some system logs.
  • Modifying the ObjectLostError logs to avoid logging likely-irrelevant plasma usage on owner death.

It's likely that the underlying issue here is that we don't actually reliably resubscribe to all worker death notifications, as indicated in the TODO in the PR.

Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
@edoakes edoakes requested a review from a team as a code owner October 28, 2025 15:14
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Oct 28, 2025
Signed-off-by: Edward Oakes <[email protected]>
@edoakes edoakes changed the title [core] Clean up test_raylet_resubscribe_worker_death and relevant Raylet logs [core] Clean up test_raylet_resubscribe_to_worker_death and relevant Raylet logs Oct 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves the test_raylet_resubscribe_worker_death test by making it more robust and easier to understand. The changes, including adding explicit checks for OwnerDiedError and ensuring GCS is responsive after a restart, are excellent for test stability. Additionally, the cleanup of log messages across various Raylet components enhances clarity and moves towards more structured logging, which is beneficial for debugging. I've found one minor typo in a log message, but otherwise, the changes are solid.

Signed-off-by: Edward Oakes <[email protected]>
: io_service_(io_service) {}

PeriodicalRunner::~PeriodicalRunner() {
RAY_LOG(DEBUG) << "PeriodicalRunner is destructed";
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessarily noisy when debug logs were on

if (worker) {
// The client is a worker.
std::shared_ptr<WorkerInterface> worker;
if ((worker = worker_pool_.GetRegisteredWorker(client))) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dayshah is this frowned upon by c++ enjoyers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, i kinda like the if value syntax, scopes it so it gets destroyed earlier + can't be accessed outside

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 28, 2025
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
@edoakes edoakes enabled auto-merge (squash) October 30, 2025 15:33
@github-actions github-actions bot disabled auto-merge October 30, 2025 20:42
@edoakes edoakes enabled auto-merge (squash) October 30, 2025 20:46
@edoakes edoakes merged commit fd1e404 into ray-project:master Oct 30, 2025
7 checks passed
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
…t Raylet logs (ray-project#58244)

`test_gcs_fault_tolerance.py:: test_worker_raylet_resubscription` is
still flaky in CI despite bumping up the timeout. Making a few
improvements here:

- Increasing the timeout to `20s` just in case it's a timeout issue
(unlikely).
- Changing to scheduling an actor instead of using `internal_kv` for our
signal that the GCS is back up. This should better indicate that the
Raylet is resubscribed.
- Cleaning up some system logs.
- Modifying the `ObjectLostError` logs to avoid logging
likely-irrelevant plasma usage on owner death.

It's likely that the underlying issue here is that we don't actually
reliably resubscribe to all worker death notifications, as indicated in
the TODO in the PR.

---------

Signed-off-by: Edward Oakes <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…t Raylet logs (ray-project#58244)

`test_gcs_fault_tolerance.py:: test_worker_raylet_resubscription` is
still flaky in CI despite bumping up the timeout. Making a few
improvements here:

- Increasing the timeout to `20s` just in case it's a timeout issue
(unlikely).
- Changing to scheduling an actor instead of using `internal_kv` for our
signal that the GCS is back up. This should better indicate that the
Raylet is resubscribed.
- Cleaning up some system logs.
- Modifying the `ObjectLostError` logs to avoid logging
likely-irrelevant plasma usage on owner death.

It's likely that the underlying issue here is that we don't actually
reliably resubscribe to all worker death notifications, as indicated in
the TODO in the PR.

---------

Signed-off-by: Edward Oakes <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…t Raylet logs (ray-project#58244)

`test_gcs_fault_tolerance.py:: test_worker_raylet_resubscription` is
still flaky in CI despite bumping up the timeout. Making a few
improvements here:

- Increasing the timeout to `20s` just in case it's a timeout issue
(unlikely).
- Changing to scheduling an actor instead of using `internal_kv` for our
signal that the GCS is back up. This should better indicate that the
Raylet is resubscribed.
- Cleaning up some system logs.
- Modifying the `ObjectLostError` logs to avoid logging
likely-irrelevant plasma usage on owner death.

It's likely that the underlying issue here is that we don't actually
reliably resubscribe to all worker death notifications, as indicated in
the TODO in the PR.

---------

Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants