[core] Use graceful actor shutdown when GCS polling detects actor ref deleted #58605

codope · 2025-11-13T22:10:57Z

The tests that exercised actor failures when they go out of scope, such as test_actor_ray_shutdown_called_on_del and test_actor_ray_shutdown_called_on_scope_exit were flaky. This PR fixes the flakiness by ensuring actors use graceful shutdown when GCS polling detects actor refs are deleted.

Problem
When actors go out of scope, GCS uses two mechanisms to detect reference deletion:

Push model (GcsActorManager::HandleReportActorOutOfScope) - already fixed in [core] Use graceful shutdown path when actor OUT_OF_SCOPE (del actor) #57090
Pull model (GcsActorManager::PollOwnerForActorRefDeleted) - was still using force kill

The pull model was calling DestroyActor(..., force_kill=true), which skips __ray_shutdown__ and immediately terminates the actor. This created a race condition: whichever mechanism completed first determined whether cleanup callbacks ran, causing test flakiness.

To fix the issue, changed PollOwnerForActorRefDeleted to use graceful shutdown with timeout (same as HandleReportActorOutOfScope). I ran all the actor failure tests that exercise this shutdown path 20 times locally, and where they failed 3/20 previously, they succeeded everytime after the fix.

… deleted Signed-off-by: Sagar Sumit <[email protected]>

gemini-code-assist

Code Review

This pull request addresses flakiness in actor failure tests by ensuring graceful shutdown for actors when GCS polling detects their references are deleted. The change in GcsActorManager::PollOwnerForActorRefDeleted correctly modifies the call to DestroyActor, setting force_kill to false and providing a graceful shutdown timeout. This aligns the behavior of the pull-based reference deletion mechanism with the push-based one, fixing the described race condition. The change is clear, well-targeted, and looks correct.

codope · 2025-11-13T22:29:21Z

@dayshah @edoakes I have a more basic question: why do we need two different mechanisms for the same thing? It feels like the polling mechanism is only there as a backup (say when owner crashes before sending the notification).

dayshah · 2025-11-13T22:58:26Z

@dayshah @edoakes I have a more basic question: why do we need two different mechanisms for the same thing? It feels like the polling mechanism is only there as a backup (say when owner crashes before sending the notification).

ohh ok I remember this now
So ref deleted is for when it's totally deleted from the ref counter, e.g. no lineage ref count too the actor is not restartable, etc. so the actor is dead dead

The out of scope is for when the ref is out of scope (people are done using it but it can still be lineage reconstructed). This means the actor can be destroyed right now but could be restarted later.

It doesn't make that much send for the two directions, we should just always do worker -> GCS and not have the long polling rpc the whole time, the GCS also knows when workers die and when the owner dies, it'll also destroy the actor. There's also some core worker ref counting pubsub for this stuff for object deletion that's a little repetitive with this. Also we don't need to send the outofscope if we send the delete one so if it's always from worker we can deduplicate that and 1 less rpc per non-restartable actor

Maybe @Sparks0219 can clean this up with the actor restart project 😃

dayshah · 2025-11-13T23:17:46Z

src/ray/gcs/gcs_actor_manager.cc

          DestroyActor(actor_id,
                       GenActorRefDeletedCause(GetActor(actor_id)),
-                       /*force_kill=*/true);
+                       /*force_kill=*/false,


it looks like we never set force kill to true with the new timeout changes, can we just kill the parameter completely

makes sense; can i do that in a separate PR?

… deleted (ray-project#58605) The tests that exercised actor failures when they go out of scope, such as `test_actor_ray_shutdown_called_on_del` and `test_actor_ray_shutdown_called_on_scope_exit` [were flaky](https://buildkite.com/ray-project/postmerge/builds/14336#019a7abe-73d3-46e0-8dc2-13351e12b7c3/613-1919). This PR fixes the flakiness by ensuring actors use graceful shutdown when GCS polling detects actor refs are deleted. **Problem** When actors go out of scope, GCS uses two mechanisms to detect reference deletion: 1. Push model (`GcsActorManager::HandleReportActorOutOfScope`) - already fixed in ray-project#57090 2. Pull model (`GcsActorManager::PollOwnerForActorRefDeleted`) - was still using force kill The pull model was calling DestroyActor(..., force_kill=true), which skips `__ray_shutdown__` and immediately terminates the actor. This created a race condition: whichever mechanism completed first determined whether cleanup callbacks ran, causing test flakiness. To fix the issue, changed `PollOwnerForActorRefDeleted` to use graceful shutdown with timeout (same as `HandleReportActorOutOfScope`). I ran all the actor failure tests that exercise this shutdown path 20 times locally, and where they failed 3/20 previously, they succeeded everytime after the fix. Signed-off-by: Sagar Sumit <[email protected]>

… deleted (ray-project#58605) The tests that exercised actor failures when they go out of scope, such as `test_actor_ray_shutdown_called_on_del` and `test_actor_ray_shutdown_called_on_scope_exit` [were flaky](https://buildkite.com/ray-project/postmerge/builds/14336#019a7abe-73d3-46e0-8dc2-13351e12b7c3/613-1919). This PR fixes the flakiness by ensuring actors use graceful shutdown when GCS polling detects actor refs are deleted. **Problem** When actors go out of scope, GCS uses two mechanisms to detect reference deletion: 1. Push model (`GcsActorManager::HandleReportActorOutOfScope`) - already fixed in ray-project#57090 2. Pull model (`GcsActorManager::PollOwnerForActorRefDeleted`) - was still using force kill The pull model was calling DestroyActor(..., force_kill=true), which skips `__ray_shutdown__` and immediately terminates the actor. This created a race condition: whichever mechanism completed first determined whether cleanup callbacks ran, causing test flakiness. To fix the issue, changed `PollOwnerForActorRefDeleted` to use graceful shutdown with timeout (same as `HandleReportActorOutOfScope`). I ran all the actor failure tests that exercise this shutdown path 20 times locally, and where they failed 3/20 previously, they succeeded everytime after the fix. Signed-off-by: Sagar Sumit <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>

… deleted (ray-project#58605) The tests that exercised actor failures when they go out of scope, such as `test_actor_ray_shutdown_called_on_del` and `test_actor_ray_shutdown_called_on_scope_exit` [were flaky](https://buildkite.com/ray-project/postmerge/builds/14336#019a7abe-73d3-46e0-8dc2-13351e12b7c3/613-1919). This PR fixes the flakiness by ensuring actors use graceful shutdown when GCS polling detects actor refs are deleted. **Problem** When actors go out of scope, GCS uses two mechanisms to detect reference deletion: 1. Push model (`GcsActorManager::HandleReportActorOutOfScope`) - already fixed in ray-project#57090 2. Pull model (`GcsActorManager::PollOwnerForActorRefDeleted`) - was still using force kill The pull model was calling DestroyActor(..., force_kill=true), which skips `__ray_shutdown__` and immediately terminates the actor. This created a race condition: whichever mechanism completed first determined whether cleanup callbacks ran, causing test flakiness. To fix the issue, changed `PollOwnerForActorRefDeleted` to use graceful shutdown with timeout (same as `HandleReportActorOutOfScope`). I ran all the actor failure tests that exercise this shutdown path 20 times locally, and where they failed 3/20 previously, they succeeded everytime after the fix. Signed-off-by: Sagar Sumit <[email protected]> Signed-off-by: YK <[email protected]>

… deleted (ray-project#58605) The tests that exercised actor failures when they go out of scope, such as `test_actor_ray_shutdown_called_on_del` and `test_actor_ray_shutdown_called_on_scope_exit` [were flaky](https://buildkite.com/ray-project/postmerge/builds/14336#019a7abe-73d3-46e0-8dc2-13351e12b7c3/613-1919). This PR fixes the flakiness by ensuring actors use graceful shutdown when GCS polling detects actor refs are deleted. **Problem** When actors go out of scope, GCS uses two mechanisms to detect reference deletion: 1. Push model (`GcsActorManager::HandleReportActorOutOfScope`) - already fixed in ray-project#57090 2. Pull model (`GcsActorManager::PollOwnerForActorRefDeleted`) - was still using force kill The pull model was calling DestroyActor(..., force_kill=true), which skips `__ray_shutdown__` and immediately terminates the actor. This created a race condition: whichever mechanism completed first determined whether cleanup callbacks ran, causing test flakiness. To fix the issue, changed `PollOwnerForActorRefDeleted` to use graceful shutdown with timeout (same as `HandleReportActorOutOfScope`). I ran all the actor failure tests that exercise this shutdown path 20 times locally, and where they failed 3/20 previously, they succeeded everytime after the fix. Signed-off-by: Sagar Sumit <[email protected]>

[core] Use graceful actor shutdown when GCS polling detects actor ref…

82671a7

… deleted Signed-off-by: Sagar Sumit <[email protected]>

codope requested a review from a team as a code owner November 13, 2025 22:10

codope requested a review from dayshah November 13, 2025 22:11

gemini-code-assist bot reviewed Nov 13, 2025

View reviewed changes

codope added the go add ONLY when ready to merge, run all tests label Nov 13, 2025

dayshah reviewed Nov 13, 2025

View reviewed changes

ray-gardener bot added the core Issues that should be addressed in Ray Core label Nov 14, 2025

israbbani approved these changes Nov 14, 2025

View reviewed changes

edoakes merged commit e2dce38 into ray-project:master Nov 14, 2025
6 checks passed

edoakes mentioned this pull request Nov 17, 2025

CI test linux://python/ray/tests:test_actor_failures is flaky #58604

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Use graceful actor shutdown when GCS polling detects actor ref deleted #58605

[core] Use graceful actor shutdown when GCS polling detects actor ref deleted #58605

codope commented Nov 13, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

codope commented Nov 13, 2025

Uh oh!

dayshah commented Nov 13, 2025 •

edited

Loading

Uh oh!

dayshah Nov 13, 2025

Uh oh!

codope Nov 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[core] Use graceful actor shutdown when GCS polling detects actor ref deleted #58605

[core] Use graceful actor shutdown when GCS polling detects actor ref deleted #58605

Conversation

codope commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

codope commented Nov 13, 2025

Uh oh!

dayshah commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dayshah Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

codope Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codope commented Nov 13, 2025 •

edited

Loading

dayshah commented Nov 13, 2025 •

edited

Loading