-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] Clean up test_raylet_resubscribe_to_worker_death and relevant Raylet logs
#58244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
test_raylet_resubscribe_worker_death and relevant Raylet logstest_raylet_resubscribe_to_worker_death and relevant Raylet logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly improves the test_raylet_resubscribe_worker_death test by making it more robust and easier to understand. The changes, including adding explicit checks for OwnerDiedError and ensuring GCS is responsive after a restart, are excellent for test stability. Additionally, the cleanup of log messages across various Raylet components enhances clarity and moves towards more structured logging, which is beneficial for debugging. I've found one minor typo in a log message, but otherwise, the changes are solid.
Signed-off-by: Edward Oakes <[email protected]>
| : io_service_(io_service) {} | ||
|
|
||
| PeriodicalRunner::~PeriodicalRunner() { | ||
| RAY_LOG(DEBUG) << "PeriodicalRunner is destructed"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessarily noisy when debug logs were on
| if (worker) { | ||
| // The client is a worker. | ||
| std::shared_ptr<WorkerInterface> worker; | ||
| if ((worker = worker_pool_.GetRegisteredWorker(client))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dayshah is this frowned upon by c++ enjoyers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, i kinda like the if value syntax, scopes it so it gets destroyed earlier + can't be accessed outside
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
…t Raylet logs (ray-project#58244) `test_gcs_fault_tolerance.py:: test_worker_raylet_resubscription` is still flaky in CI despite bumping up the timeout. Making a few improvements here: - Increasing the timeout to `20s` just in case it's a timeout issue (unlikely). - Changing to scheduling an actor instead of using `internal_kv` for our signal that the GCS is back up. This should better indicate that the Raylet is resubscribed. - Cleaning up some system logs. - Modifying the `ObjectLostError` logs to avoid logging likely-irrelevant plasma usage on owner death. It's likely that the underlying issue here is that we don't actually reliably resubscribe to all worker death notifications, as indicated in the TODO in the PR. --------- Signed-off-by: Edward Oakes <[email protected]>
…t Raylet logs (ray-project#58244) `test_gcs_fault_tolerance.py:: test_worker_raylet_resubscription` is still flaky in CI despite bumping up the timeout. Making a few improvements here: - Increasing the timeout to `20s` just in case it's a timeout issue (unlikely). - Changing to scheduling an actor instead of using `internal_kv` for our signal that the GCS is back up. This should better indicate that the Raylet is resubscribed. - Cleaning up some system logs. - Modifying the `ObjectLostError` logs to avoid logging likely-irrelevant plasma usage on owner death. It's likely that the underlying issue here is that we don't actually reliably resubscribe to all worker death notifications, as indicated in the TODO in the PR. --------- Signed-off-by: Edward Oakes <[email protected]>
…t Raylet logs (ray-project#58244) `test_gcs_fault_tolerance.py:: test_worker_raylet_resubscription` is still flaky in CI despite bumping up the timeout. Making a few improvements here: - Increasing the timeout to `20s` just in case it's a timeout issue (unlikely). - Changing to scheduling an actor instead of using `internal_kv` for our signal that the GCS is back up. This should better indicate that the Raylet is resubscribed. - Cleaning up some system logs. - Modifying the `ObjectLostError` logs to avoid logging likely-irrelevant plasma usage on owner death. It's likely that the underlying issue here is that we don't actually reliably resubscribe to all worker death notifications, as indicated in the TODO in the PR. --------- Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
test_gcs_fault_tolerance.py:: test_worker_raylet_resubscriptionis still flaky in CI despite bumping up the timeout. Making a few improvements here:20sjust in case it's a timeout issue (unlikely).internal_kvfor our signal that the GCS is back up. This should better indicate that the Raylet is resubscribed.ObjectLostErrorlogs to avoid logging likely-irrelevant plasma usage on owner death.It's likely that the underlying issue here is that we don't actually reliably resubscribe to all worker death notifications, as indicated in the TODO in the PR.