[core] Make KillActor RPC Fault Tolerant #57648

Sparks0219 · 2025-10-11T22:20:07Z

Why are these changes needed?

The issue with the current implementation of core worker HandleKillActor is that it won't send a reply when the RPC completes because the worker is dead. The application code from the GCS doesn't really care since it just logs the response if one is received, a response is only sent if the actor ID of the actor on the worker and in the RPC don't match, and the GCS will just log it and move on with its life.

Hence we can't differentiate in the case of a transient network failure whether there was a network issue, or the actor was successfully killed. What I think is the most straightforward approach is instead of the GCS directly calling core worker KillActor, we have the GCS talk to the raylet instead and call a new RPC KillLocalActor that in turn calls KillActor. Since the raylet that receives KillLocalActor is local to the worker that the actor is on, we're guaranteed to kill it at that point (either through using KillActor, or if it hangs falling back to SIGKILL).

Thus the main intuition is that the GCS now talks to the raylet, and this layer implements retries. Once the raylet receives the KillLocalActor request, it routes this to KillActor. This layer between the raylet and core worker does not have retries enabled because we can assume that RPCs between the local raylet and worker won't fail (same machine). We then check on the status of the worker after a while (5 seconds via kill_worker_timeout_milliseconds) and if it still hasn't been killed then we call DestroyWorker that in turn sends the SIGKILL.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: joshlee <[email protected]>

Sparks0219 · 2025-10-11T23:29:30Z

Couple comments about my design/concerns I have:
1.) I decided to keep KillActor rather than have the Raylet use core worker Exit since I think it would get a bit bloated. From what I see they both call into shutdown coordinator, but juggle a different set of params provided by rpc and use this to set various fields for shutdown coordinator. Hence think it would just get messier if I just used Exit for KillLocalActor with not much benefit.
2.) From what I see, KillActor should already be idempotent because of the changes in d63f464 i.e we track if there's an ongoing shutdown request and subsequent ones are noops
3.) In the current behavior of KillActor, we don't really have any mitigations if in graceful shutdown it gets stuck. Hence I added a timer and an async_wait call that does a force kill after a certain amount of time (5s). This is a bit not nice because ideally if we get a DisconnectClient message, we should immediately cancel the timer but I don't think it's too important because the GCS isn't blocked waiting for the response.

What are your thoughts @codope ?

Sparks0219 · 2025-10-11T23:32:25Z

src/ray/raylet/node_manager.cc

+  worker->rpc_client()->KillActor(
+      kill_actor_request,
+      [kill_actor_error, timer](const ray::Status &status, const rpc::KillActorReply &) {
+        *kill_actor_error = status;


not sure how to do this a bit more elegantly, only possible case where KillActor responds is if we hit this error check at

ray/src/ray/core_worker/core_worker.cc

Line 3964 in 237268b

if (intended_actor_id != worker_context_->GetCurrentActorID()) {

and I want to propagate this error to the callback below if cancellation happens

imo don't need to propagate back up to the gcs, since it throws it away

dayshah · 2025-10-12T19:53:54Z

src/ray/gcs/gcs_actor.cc

+const rpc::Address &GcsActor::GetLocalRayletAddress() const {
+  return local_raylet_address_;
+}
+
+void GcsActor::UpdateLocalRayletAddress(const rpc::Address &address) {
+  local_raylet_address_.CopyFrom(address);
+}
+


nit but usually cpp getters and setters look like this

Suggested change

const rpc::Address &GcsActor::GetLocalRayletAddress() const {

return local_raylet_address_;

}

void GcsActor::UpdateLocalRayletAddress(const rpc::Address &address) {

local_raylet_address_.CopyFrom(address);

}

const rpc::Address &GcsActor::LocalRayletAddress() const {

return local_raylet_address_;

}

rpc::Address &GcsActor::LocalRayletAddress() {

return local_raylet_address_;

}

Updated the getter, but for the setter should I keep it as is?

Yeah that looks like a typo

oh not a typo, the "setter" usually just returns a non-const lvalue ref so you can more efficiently make changes however you want to the data member.

An UpdateMember func is bound to changing by replacing rather than modifying in place + extra move constructs or separate rvalue/lvalue overloads, etc. if you want to give flexibility on replacing

oh wait but at this point why isn't the member public ☠️ it doesn't matter lol

dayshah · 2025-10-12T19:56:06Z

src/ray/gcs/gcs_actor_scheduler.cc

+    actor_local_raylet_address.set_node_id(node->node_id());
+    actor_local_raylet_address.set_ip_address(node->node_manager_address());
+    actor_local_raylet_address.set_port(node->node_manager_port());
+    actor->UpdateLocalRayletAddress(actor_local_raylet_address);


on actor death in between restarting, there's probably a point where the actor doesn't have a local raylet.
The actor also doesn't have a local raylet at registration time + before creation completion

local_raylet_address should probably be an optional and we shouldn't make the rpc if it's nullopt

ahhh good catch thanks, yup I modified it so that we don't make the rpc if its nullopt.

src/ray/raylet/tests/node_manager_test.cc

src/ray/raylet/node_manager.cc

dayshah · 2025-10-12T20:02:58Z

src/ray/raylet/node_manager.cc

+
+  timer->expires_from_now(boost::posix_time::milliseconds(
+      RayConfig::instance().kill_worker_timeout_milliseconds()));
+  timer->async_wait([this, send_reply_callback, kill_actor_error, worker_id, timer](


i think execute_after does some of this boiler plate for you

Also this is new functionality we're adding to kill actor right? In what cases will the worker survive after it got the kill actor req?

Yea agreed execute_after is a lot cleaner and yup this is new functionality we're adding. I think the issue is that in graceful shutdown we could end up getting stuck, and I don't think KillActor takes this into account. There's multiple places we use KillAsync that sends a SIGTERM (graceful shutdown) then SIGKILL hence I think we should follow the same pattern here.

dayshah · 2025-10-12T20:04:20Z

src/ray/raylet/node_manager.cc

+  worker->rpc_client()->KillActor(
+      kill_actor_request,
+      [kill_actor_error, timer](const ray::Status &status, const rpc::KillActorReply &) {
+        *kill_actor_error = status;


imo don't need to propagate back up to the gcs, since it throws it away

Signed-off-by: joshlee <[email protected]>

…-actor-rpc-fault-tolerant

codope

@Sparks0219 Thanks for the PR and I'm aligned with your design. Few comments:

1.) I decided to keep KillActor rather than have the Raylet use core worker Exit since I think it would get a bit bloated.

Decision to keep KillActor is correct. It provides actor id validation to prevent killing wrong actor after restart. Plus, there is separation between worker lifecycle (Exit) and actor termination (KillActor).

2.) From what I see, KillActor should already be idempotent because of the changes in d63f464

Yes, that commit makes it idempotent at core worker level. ShutdownCoordinator tracks state, subsequent calls are no-op. Your code returns early if worker is nullptr. I suggested to also add worker->IsDead() check below.

In the current behavior of KillActor, we don't really have any mitigations if in graceful shutdown it gets stuck. Hence I added a timer and an async_wait call that does a force kill after a certain amount of time

Good callout! But i noticed that the current changes always log mismatch and report err. Please check my comments below in NodeManager::HandleKillLocalActor.

src/ray/raylet/node_manager.cc

src/ray/gcs/gcs_actor_scheduler.cc

src/ray/gcs/gcs_actor_manager.cc

src/ray/raylet/node_manager.cc

Signed-off-by: joshlee <[email protected]>

…-actor-rpc-fault-tolerant

Signed-off-by: joshlee <[email protected]>

edoakes · 2025-10-17T14:55:02Z

src/ray/gcs/gcs_actor_scheduler.cc

-                                          ActorID actor_id) {
-  if (worker_address.node_id().empty()) {
-    RAY_LOG(DEBUG) << "Invalid worker address, skip the killing of actor " << actor_id;
+bool GcsActorScheduler::CleanupWorkerForActor(const rpc::Address &raylet_address,


how does this fit in with @codope's changes in #57090?

seems we have three separate paths for killing an actor? DestroyActor, KillActor, CleanupwWorkerForActor... there should be only one code path in the GCS for this if possible

Here I renamed it from KillActorOnWorker to CleanupWorkerForActor since I think the latter makes more sense. This method is only used to cleanup workers allocated for actors that the GCS has marked dead or where creation failed, hence KillActorOnWorker didn't make much sense to me.

Think DestroyActor is meant to prevent the killed actor from restarting, while KillActor allows restarting... bad naming and think they probably could be unified by passing a flag to differentiate the restart behavior. Not entirely sure though, what do you think @codope ?

yes, we should unify the two

Signed-off-by: joshlee <[email protected]>

src/ray/raylet/node_manager.cc

Signed-off-by: joshlee <[email protected]>

edoakes · 2025-10-30T15:45:20Z

src/ray/gcs/gcs_actor_scheduler.cc

+        RAY_LOG(DEBUG) << "Killing actor " << actor_id
+                       << " with return status: " << status.ToString();


hm, shouldn't we be retrying here if we get a non-OK status?

at a minimum we should have an ERROR log here

We're already retrying implicitly via the retryable grpc client for status UNAVAILABLE/UNKNOWN, think for the other cases like the actor id mismatch in HandleKillActor we don't want to retry.
Added an ERROR log to report this though

edoakes · 2025-10-30T15:45:36Z

src/ray/gcs/gcs_actor_scheduler.cc

-    RAY_LOG(DEBUG) << "Invalid worker address, skip the killing of actor " << actor_id;
+bool GcsActorScheduler::CleanupWorkerForActor(const rpc::Address &raylet_address,
+                                              const rpc::Address &worker_address,
+                                              ActorID actor_id) {


please add an INFO level log that an actor is being killed (unless it's already logged in the callsite)

Done, sounds good

edoakes · 2025-10-30T15:46:19Z

src/ray/gcs/gcs_actor_scheduler.h

+  /// Cleanup the worker for an actor
+  /// \param raylet_address The address of the local raylet of the worker
+  /// \param worker_address The address of the worker to clean up
+  /// \param actor_id ID of the actor (may be Nil if actor setup failed)
+  bool CleanupWorkerForActor(const rpc::Address &raylet_address,
+                             const rpc::Address &worker_address,
+                             ActorID actor_id);


it's very unclear what "cleanup means". let's pick a better name and also clarify the exact behavior in the header comment

I renamed it to KillLeasedWorkerForActor and updated the header comments

Signed-off-by: joshlee <[email protected]>

…-actor-rpc-fault-tolerant

src/ray/raylet/node_manager.cc

Signed-off-by: joshlee <[email protected]>

cursor · 2025-11-01T00:01:22Z

src/ray/raylet/node_manager.cc

+        // NOTE: on a successful kill, we don't expect a reply back from the dead actor.
+        // The only case where we receive a reply is if the mismatched actor ID check is
+        // triggered.
+      });


Bug: Bug

Race condition in HandleKillLocalActor: The replied flag is a std::shared_ptr<bool> but is accessed concurrently from both the timer callback (running on io_service thread) and the KillActor RPC callback (running on a gRPC thread). Additionally, the KillActor callback doesn't set *replied = true after sending an error reply, which could allow the timer to also fire and send a duplicate reply.

The callback checks if (!status.ok() && !*replied) and then calls timer->cancel() and send_reply_callback(), but never sets *replied = true. This creates two problems:

Without atomic operations, there's a data race when both callbacks access replied concurrently

Even if the callback sends an error reply, the timer could still see *replied == false and send another reply

The fix should be: (a) change replied to std::atomic<bool> or protect it with a mutex, and (b) set *replied = true immediately after checking it in the callback using atomic compare-and-swap or under a lock.

1.) KillActorCallback is posted onto the main io_service thread so it's single threaded, there's no race condition
2.) We don't need to set *replied = true when timer->cancel() is triggered since the timer callback is never executed if the timer hasn't fired. If the timer has fired, the timer callback will have already set *replied = true so we'll skip sending the reply.

The issue with the current implementation of core worker HandleKillActor is that it won't send a reply when the RPC completes because the worker is dead. The application code from the GCS doesn't really care since it just logs the response if one is received, a response is only sent if the actor ID of the actor on the worker and in the RPC don't match, and the GCS will just log it and move on with its life. Hence we can't differentiate in the case of a transient network failure whether there was a network issue, or the actor was successfully killed. What I think is the most straightforward approach is instead of the GCS directly calling core worker KillActor, we have the GCS talk to the raylet instead and call a new RPC KillLocalActor that in turn calls KillActor. Since the raylet that receives KillLocalActor is local to the worker that the actor is on, we're guaranteed to kill it at that point (either through using KillActor, or if it hangs falling back to SIGKILL). Thus the main intuition is that the GCS now talks to the raylet, and this layer implements retries. Once the raylet receives the KillLocalActor request, it routes this to KillActor. This layer between the raylet and core worker does not have retries enabled because we can assume that RPCs between the local raylet and worker won't fail (same machine). We then check on the status of the worker after a while (5 seconds via kill_worker_timeout_milliseconds) and if it still hasn't been killed then we call DestroyWorker that in turn sends the SIGKILL. --------- Signed-off-by: joshlee <[email protected]>

The issue with the current implementation of core worker HandleKillActor is that it won't send a reply when the RPC completes because the worker is dead. The application code from the GCS doesn't really care since it just logs the response if one is received, a response is only sent if the actor ID of the actor on the worker and in the RPC don't match, and the GCS will just log it and move on with its life. Hence we can't differentiate in the case of a transient network failure whether there was a network issue, or the actor was successfully killed. What I think is the most straightforward approach is instead of the GCS directly calling core worker KillActor, we have the GCS talk to the raylet instead and call a new RPC KillLocalActor that in turn calls KillActor. Since the raylet that receives KillLocalActor is local to the worker that the actor is on, we're guaranteed to kill it at that point (either through using KillActor, or if it hangs falling back to SIGKILL). Thus the main intuition is that the GCS now talks to the raylet, and this layer implements retries. Once the raylet receives the KillLocalActor request, it routes this to KillActor. This layer between the raylet and core worker does not have retries enabled because we can assume that RPCs between the local raylet and worker won't fail (same machine). We then check on the status of the worker after a while (5 seconds via kill_worker_timeout_milliseconds) and if it still hasn't been killed then we call DestroyWorker that in turn sends the SIGKILL. --------- Signed-off-by: joshlee <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>

The issue with the current implementation of core worker HandleKillActor is that it won't send a reply when the RPC completes because the worker is dead. The application code from the GCS doesn't really care since it just logs the response if one is received, a response is only sent if the actor ID of the actor on the worker and in the RPC don't match, and the GCS will just log it and move on with its life. Hence we can't differentiate in the case of a transient network failure whether there was a network issue, or the actor was successfully killed. What I think is the most straightforward approach is instead of the GCS directly calling core worker KillActor, we have the GCS talk to the raylet instead and call a new RPC KillLocalActor that in turn calls KillActor. Since the raylet that receives KillLocalActor is local to the worker that the actor is on, we're guaranteed to kill it at that point (either through using KillActor, or if it hangs falling back to SIGKILL). Thus the main intuition is that the GCS now talks to the raylet, and this layer implements retries. Once the raylet receives the KillLocalActor request, it routes this to KillActor. This layer between the raylet and core worker does not have retries enabled because we can assume that RPCs between the local raylet and worker won't fail (same machine). We then check on the status of the worker after a while (5 seconds via kill_worker_timeout_milliseconds) and if it still hasn't been killed then we call DestroyWorker that in turn sends the SIGKILL. --------- Signed-off-by: joshlee <[email protected]> Signed-off-by: YK <[email protected]>

The issue with the current implementation of core worker HandleKillActor is that it won't send a reply when the RPC completes because the worker is dead. The application code from the GCS doesn't really care since it just logs the response if one is received, a response is only sent if the actor ID of the actor on the worker and in the RPC don't match, and the GCS will just log it and move on with its life. Hence we can't differentiate in the case of a transient network failure whether there was a network issue, or the actor was successfully killed. What I think is the most straightforward approach is instead of the GCS directly calling core worker KillActor, we have the GCS talk to the raylet instead and call a new RPC KillLocalActor that in turn calls KillActor. Since the raylet that receives KillLocalActor is local to the worker that the actor is on, we're guaranteed to kill it at that point (either through using KillActor, or if it hangs falling back to SIGKILL). Thus the main intuition is that the GCS now talks to the raylet, and this layer implements retries. Once the raylet receives the KillLocalActor request, it routes this to KillActor. This layer between the raylet and core worker does not have retries enabled because we can assume that RPCs between the local raylet and worker won't fail (same machine). We then check on the status of the worker after a while (5 seconds via kill_worker_timeout_milliseconds) and if it still hasn't been killed then we call DestroyWorker that in turn sends the SIGKILL. --------- Signed-off-by: joshlee <[email protected]>

Sparks0219 added 3 commits October 11, 2025 22:05

Make KillActor RPC retryable

321b2a9

Signed-off-by: joshlee <[email protected]>

nits

092122c

Signed-off-by: joshlee <[email protected]>

lint

5fa01aa

Signed-off-by: joshlee <[email protected]>

Sparks0219 added the go add ONLY when ready to merge, run all tests label Oct 11, 2025

Sparks0219 requested review from codope and dayshah October 11, 2025 23:29

Sparks0219 marked this pull request as ready for review October 11, 2025 23:30

Sparks0219 requested a review from a team as a code owner October 11, 2025 23:30

Sparks0219 commented Oct 11, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 12, 2025

dayshah reviewed Oct 12, 2025

View reviewed changes

Sparks0219 added 2 commits October 14, 2025 21:09

Addressing comments

35b6e04

Signed-off-by: joshlee <[email protected]>

Addressing comments

6fc4162

Signed-off-by: joshlee <[email protected]>

This comment was marked as outdated.

Sign in to view

Propagate error

74b7931

Signed-off-by: joshlee <[email protected]>

Sparks0219 requested a review from dayshah October 14, 2025 22:07

Fix compilation error

49f8feb

Signed-off-by: joshlee <[email protected]>

This comment was marked as outdated.

Sign in to view

Sparks0219 added 2 commits October 15, 2025 01:13

fix failing unit tests

2d4e2f8

Signed-off-by: joshlee <[email protected]>

Merge remote-tracking branch 'upstream/master' into joshlee/make-kill…

c6e3fd8

…-actor-rpc-fault-tolerant

This comment was marked as outdated.

Sign in to view

codope reviewed Oct 15, 2025

View reviewed changes

Addressing comments

90bd504

Signed-off-by: joshlee <[email protected]>

Sparks0219 requested a review from codope October 15, 2025 22:11

Sparks0219 added 3 commits October 15, 2025 22:16

fix comment

826ed8c

Signed-off-by: joshlee <[email protected]>

Merge remote-tracking branch 'upstream/master' into joshlee/make-kill…

2b9f5b2

…-actor-rpc-fault-tolerant

Merge conflicts

7a57cce

Signed-off-by: joshlee <[email protected]>

codope approved these changes Oct 17, 2025

View reviewed changes

edoakes reviewed Oct 17, 2025

View reviewed changes

Merge conflicts

8d64c7a

Signed-off-by: joshlee <[email protected]>

Sparks0219 requested review from codope and edoakes October 21, 2025 04:31

Sparks0219 added 3 commits October 21, 2025 06:10

Merge conflicts

8acdee7

Signed-off-by: joshlee <[email protected]>

Merge conflicts

cee9324

Signed-off-by: joshlee <[email protected]>

Bad merge conflict fix

6f39e12

Signed-off-by: joshlee <[email protected]>

Sparks0219 commented Oct 21, 2025

View reviewed changes

src/ray/raylet/node_manager.cc Show resolved Hide resolved

nits

ae808da

Signed-off-by: joshlee <[email protected]>

Sparks0219 mentioned this pull request Oct 22, 2025

[core] Make CancelTask RPC Fault Tolerant #58018

Open

edoakes reviewed Oct 30, 2025

View reviewed changes

Addressing comments

e58c6c3

Signed-off-by: joshlee <[email protected]>

Sparks0219 requested a review from edoakes October 30, 2025 22:31

Merge remote-tracking branch 'upstream/master' into joshlee/make-kill…

04341f7

…-actor-rpc-fault-tolerant

cursor bot reviewed Oct 31, 2025

View reviewed changes

src/ray/raylet/node_manager.cc Show resolved Hide resolved

Addressing comments

3a92b47

Signed-off-by: joshlee <[email protected]>

cursor bot reviewed Nov 1, 2025

View reviewed changes

edoakes merged commit 5e2b29d into ray-project:master Nov 4, 2025
6 checks passed

dayshah mentioned this pull request Nov 11, 2025

[core] Improve kill actor logs #58544

Merged

		RAY_LOG(DEBUG) << "Killing actor " << actor_id
		<< " with return status: " << status.ToString();

[core] Make KillActor RPC Fault Tolerant #57648

[core] Make KillActor RPC Fault Tolerant #57648

Uh oh!

Conversation

Sparks0219 commented Oct 11, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

Sparks0219 commented Oct 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sparks0219 Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Nov 1, 2025

Choose a reason for hiding this comment

Bug: Bug

Uh oh!

Sparks0219 Oct 17, 2025 •

edited

Loading