Skip to content

Commit e2dce38

Browse files
authored
[core] Use graceful actor shutdown when GCS polling detects actor ref deleted (#58605)
The tests that exercised actor failures when they go out of scope, such as `test_actor_ray_shutdown_called_on_del` and `test_actor_ray_shutdown_called_on_scope_exit` [were flaky](https://buildkite.com/ray-project/postmerge/builds/14336#019a7abe-73d3-46e0-8dc2-13351e12b7c3/613-1919). This PR fixes the flakiness by ensuring actors use graceful shutdown when GCS polling detects actor refs are deleted. **Problem** When actors go out of scope, GCS uses two mechanisms to detect reference deletion: 1. Push model (`GcsActorManager::HandleReportActorOutOfScope`) - already fixed in #57090 2. Pull model (`GcsActorManager::PollOwnerForActorRefDeleted`) - was still using force kill The pull model was calling DestroyActor(..., force_kill=true), which skips `__ray_shutdown__` and immediately terminates the actor. This created a race condition: whichever mechanism completed first determined whether cleanup callbacks ran, causing test flakiness. To fix the issue, changed `PollOwnerForActorRefDeleted` to use graceful shutdown with timeout (same as `HandleReportActorOutOfScope`). I ran all the actor failure tests that exercise this shutdown path 20 times locally, and where they failed 3/20 previously, they succeeded everytime after the fix. Signed-off-by: Sagar Sumit <[email protected]>
1 parent 88a6cee commit e2dce38

File tree

1 file changed

+4
-1
lines changed

1 file changed

+4
-1
lines changed

src/ray/gcs/gcs_actor_manager.cc

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -970,9 +970,12 @@ void GcsActorManager::PollOwnerForActorRefDeleted(
970970
if (node_it != owners_.end() && node_it->second.count(owner_id)) {
971971
// Only destroy the actor if its owner is still alive. The actor may
972972
// have already been destroyed if the owner died.
973+
int64_t timeout_ms = RayConfig::instance().actor_graceful_shutdown_timeout_ms();
973974
DestroyActor(actor_id,
974975
GenActorRefDeletedCause(GetActor(actor_id)),
975-
/*force_kill=*/true);
976+
/*force_kill=*/false,
977+
nullptr,
978+
timeout_ms);
976979
}
977980
});
978981
}

0 commit comments

Comments
 (0)