Commit e2dce38
authored
[core] Use graceful actor shutdown when GCS polling detects actor ref deleted (#58605)
The tests that exercised actor failures when they go out of scope, such
as `test_actor_ray_shutdown_called_on_del` and
`test_actor_ray_shutdown_called_on_scope_exit` [were
flaky](https://buildkite.com/ray-project/postmerge/builds/14336#019a7abe-73d3-46e0-8dc2-13351e12b7c3/613-1919).
This PR fixes the flakiness by ensuring actors use graceful shutdown
when GCS polling detects actor refs are deleted.
**Problem**
When actors go out of scope, GCS uses two mechanisms to detect reference
deletion:
1. Push model (`GcsActorManager::HandleReportActorOutOfScope`) - already
fixed in #57090
2. Pull model (`GcsActorManager::PollOwnerForActorRefDeleted`) - was
still using force kill
The pull model was calling DestroyActor(..., force_kill=true), which
skips `__ray_shutdown__` and immediately terminates the actor. This
created a race condition: whichever mechanism completed first determined
whether cleanup callbacks ran, causing test flakiness.
To fix the issue, changed `PollOwnerForActorRefDeleted` to use graceful
shutdown with timeout (same as `HandleReportActorOutOfScope`). I ran all
the actor failure tests that exercise this shutdown path 20 times
locally, and where they failed 3/20 previously, they succeeded everytime
after the fix.
Signed-off-by: Sagar Sumit <[email protected]>1 parent 88a6cee commit e2dce38
1 file changed
+4
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
970 | 970 | | |
971 | 971 | | |
972 | 972 | | |
| 973 | + | |
973 | 974 | | |
974 | 975 | | |
975 | | - | |
| 976 | + | |
| 977 | + | |
| 978 | + | |
976 | 979 | | |
977 | 980 | | |
978 | 981 | | |
| |||
0 commit comments