You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The issue with the current implementation of core worker HandleKillActor
is that it won't send a reply when the RPC completes because the worker
is dead. The application code from the GCS doesn't really care since it
just logs the response if one is received, a response is only sent if
the actor ID of the actor on the worker and in the RPC don't match, and
the GCS will just log it and move on with its life.
Hence we can't differentiate in the case of a transient network failure
whether there was a network issue, or the actor was successfully killed.
What I think is the most straightforward approach is instead of the GCS
directly calling core worker KillActor, we have the GCS talk to the
raylet instead and call a new RPC KillLocalActor that in turn calls
KillActor. Since the raylet that receives KillLocalActor is local to the
worker that the actor is on, we're guaranteed to kill it at that point
(either through using KillActor, or if it hangs falling back to
SIGKILL).
Thus the main intuition is that the GCS now talks to the raylet, and
this layer implements retries. Once the raylet receives the
KillLocalActor request, it routes this to KillActor. This layer between
the raylet and core worker does not have retries enabled because we can
assume that RPCs between the local raylet and worker won't fail (same
machine). We then check on the status of the worker after a while (5
seconds via kill_worker_timeout_milliseconds) and if it still hasn't
been killed then we call DestroyWorker that in turn sends the SIGKILL.
---------
Signed-off-by: joshlee <[email protected]>
0 commit comments