Skip to content

Conversation

@Sparks0219
Copy link
Contributor

Why are these changes needed?

The issue with the current implementation of core worker HandleKillActor is that it won't send a reply when the RPC completes because the worker is dead. The application code from the GCS doesn't really care since it just logs the response if one is received, a response is only sent if the actor ID of the actor on the worker and in the RPC don't match, and the GCS will just log it and move on with its life.

Hence we can't differentiate in the case of a transient network failure whether there was a network issue, or the actor was successfully killed. What I think is the most straightforward approach is instead of the GCS directly calling core worker KillActor, we have the GCS talk to the raylet instead and call a new RPC KillLocalActor that in turn calls KillActor. Since the raylet that receives KillLocalActor is local to the worker that the actor is on, we're guaranteed to kill it at that point (either through using KillActor, or if it hangs falling back to SIGKILL).

Thus the main intuition is that the GCS now talks to the raylet, and this layer implements retries. Once the raylet receives the KillLocalActor request, it routes this to KillActor. This layer between the raylet and core worker does not have retries enabled because we can assume that RPCs between the local raylet and worker won't fail (same machine). We then check on the status of the worker after a while (5 seconds via kill_worker_timeout_milliseconds) and if it still hasn't been killed then we call DestroyWorker that in turn sends the SIGKILL.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
@Sparks0219
Copy link
Contributor Author

Couple comments about my design/concerns I have:
1.) I decided to keep KillActor rather than have the Raylet use core worker Exit since I think it would get a bit bloated. From what I see they both call into shutdown coordinator, but juggle a different set of params provided by rpc and use this to set various fields for shutdown coordinator. Hence think it would just get messier if I just used Exit for KillLocalActor with not much benefit.
2.) From what I see, KillActor should already be idempotent because of the changes in d63f464 i.e we track if there's an ongoing shutdown request and subsequent ones are noops
3.) In the current behavior of KillActor, we don't really have any mitigations if in graceful shutdown it gets stuck. Hence I added a timer and an async_wait call that does a force kill after a certain amount of time (5s). This is a bit not nice because ideally if we get a DisconnectClient message, we should immediately cancel the timer but I don't think it's too important because the GCS isn't blocked waiting for the response.

What are your thoughts @codope ?

@Sparks0219 Sparks0219 added the go add ONLY when ready to merge, run all tests label Oct 11, 2025
@Sparks0219 Sparks0219 requested review from codope and dayshah October 11, 2025 23:29
@Sparks0219 Sparks0219 marked this pull request as ready for review October 11, 2025 23:30
@Sparks0219 Sparks0219 requested a review from a team as a code owner October 11, 2025 23:30
worker->rpc_client()->KillActor(
kill_actor_request,
[kill_actor_error, timer](const ray::Status &status, const rpc::KillActorReply &) {
*kill_actor_error = status;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how to do this a bit more elegantly, only possible case where KillActor responds is if we hit this error check at

if (intended_actor_id != worker_context_->GetCurrentActorID()) {
and I want to propagate this error to the callback below if cancellation happens

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo don't need to propagate back up to the gcs, since it throws it away

cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 12, 2025
Comment on lines 63 to 70
const rpc::Address &GcsActor::GetLocalRayletAddress() const {
return local_raylet_address_;
}

void GcsActor::UpdateLocalRayletAddress(const rpc::Address &address) {
local_raylet_address_.CopyFrom(address);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but usually cpp getters and setters look like this

Suggested change
const rpc::Address &GcsActor::GetLocalRayletAddress() const {
return local_raylet_address_;
}
void GcsActor::UpdateLocalRayletAddress(const rpc::Address &address) {
local_raylet_address_.CopyFrom(address);
}
const rpc::Address &GcsActor::LocalRayletAddress() const {
return local_raylet_address_;
}
rpc::Address &GcsActor::LocalRayletAddress() {
return local_raylet_address_;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the getter, but for the setter should I keep it as is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that looks like a typo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh not a typo, the "setter" usually just returns a non-const lvalue ref so you can more efficiently make changes however you want to the data member.

An UpdateMember func is bound to changing by replacing rather than modifying in place + extra move constructs or separate rvalue/lvalue overloads, etc. if you want to give flexibility on replacing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait but at this point why isn't the member public ☠️ it doesn't matter lol

actor_local_raylet_address.set_node_id(node->node_id());
actor_local_raylet_address.set_ip_address(node->node_manager_address());
actor_local_raylet_address.set_port(node->node_manager_port());
actor->UpdateLocalRayletAddress(actor_local_raylet_address);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on actor death in between restarting, there's probably a point where the actor doesn't have a local raylet.
The actor also doesn't have a local raylet at registration time + before creation completion

local_raylet_address should probably be an optional and we shouldn't make the rpc if it's nullopt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh good catch thanks, yup I modified it so that we don't make the rpc if its nullopt.


timer->expires_from_now(boost::posix_time::milliseconds(
RayConfig::instance().kill_worker_timeout_milliseconds()));
timer->async_wait([this, send_reply_callback, kill_actor_error, worker_id, timer](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think execute_after does some of this boiler plate for you

Also this is new functionality we're adding to kill actor right? In what cases will the worker survive after it got the kill actor req?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea agreed execute_after is a lot cleaner and yup this is new functionality we're adding. I think the issue is that in graceful shutdown we could end up getting stuck, and I don't think KillActor takes this into account. There's multiple places we use KillAsync that sends a SIGTERM (graceful shutdown) then SIGKILL hence I think we should follow the same pattern here.

worker->rpc_client()->KillActor(
kill_actor_request,
[kill_actor_error, timer](const ray::Status &status, const rpc::KillActorReply &) {
*kill_actor_error = status;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo don't need to propagate back up to the gcs, since it throws it away

cursor[bot]

This comment was marked as outdated.

Signed-off-by: joshlee <[email protected]>
@Sparks0219 Sparks0219 requested a review from dayshah October 14, 2025 22:07
Signed-off-by: joshlee <[email protected]>
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sparks0219 Thanks for the PR and I'm aligned with your design. Few comments:

1.) I decided to keep KillActor rather than have the Raylet use core worker Exit since I think it would get a bit bloated.

Decision to keep KillActor is correct. It provides actor id validation to prevent killing wrong actor after restart. Plus, there is separation between worker lifecycle (Exit) and actor termination (KillActor).

2.) From what I see, KillActor should already be idempotent because of the changes in d63f464

Yes, that commit makes it idempotent at core worker level. ShutdownCoordinator tracks state, subsequent calls are no-op. Your code returns early if worker is nullptr. I suggested to also add worker->IsDead() check below.

In the current behavior of KillActor, we don't really have any mitigations if in graceful shutdown it gets stuck. Hence I added a timer and an async_wait call that does a force kill after a certain amount of time

Good callout! But i noticed that the current changes always log mismatch and report err. Please check my comments below in NodeManager::HandleKillLocalActor.

Signed-off-by: joshlee <[email protected]>
@Sparks0219 Sparks0219 requested a review from codope October 15, 2025 22:11
ActorID actor_id) {
if (worker_address.node_id().empty()) {
RAY_LOG(DEBUG) << "Invalid worker address, skip the killing of actor " << actor_id;
bool GcsActorScheduler::CleanupWorkerForActor(const rpc::Address &raylet_address,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this fit in with @codope's changes in #57090?

seems we have three separate paths for killing an actor? DestroyActor, KillActor, CleanupwWorkerForActor... there should be only one code path in the GCS for this if possible

Copy link
Contributor Author

@Sparks0219 Sparks0219 Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I renamed it from KillActorOnWorker to CleanupWorkerForActor since I think the latter makes more sense. This method is only used to cleanup workers allocated for actors that the GCS has marked dead or where creation failed, hence KillActorOnWorker didn't make much sense to me.

Think DestroyActor is meant to prevent the killed actor from restarting, while KillActor allows restarting... bad naming and think they probably could be unified by passing a flag to differentiate the restart behavior. Not entirely sure though, what do you think @codope ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should unify the two

Signed-off-by: joshlee <[email protected]>
@Sparks0219 Sparks0219 requested review from codope and edoakes October 21, 2025 04:31
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Comment on lines 553 to 554
RAY_LOG(DEBUG) << "Killing actor " << actor_id
<< " with return status: " << status.ToString();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, shouldn't we be retrying here if we get a non-OK status?

at a minimum we should have an ERROR log here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already retrying implicitly via the retryable grpc client for status UNAVAILABLE/UNKNOWN, think for the other cases like the actor id mismatch in HandleKillActor we don't want to retry.
Added an ERROR log to report this though

RAY_LOG(DEBUG) << "Invalid worker address, skip the killing of actor " << actor_id;
bool GcsActorScheduler::CleanupWorkerForActor(const rpc::Address &raylet_address,
const rpc::Address &worker_address,
ActorID actor_id) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add an INFO level log that an actor is being killed (unless it's already logged in the callsite)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, sounds good

Comment on lines 337 to 343
/// Cleanup the worker for an actor
/// \param raylet_address The address of the local raylet of the worker
/// \param worker_address The address of the worker to clean up
/// \param actor_id ID of the actor (may be Nil if actor setup failed)
bool CleanupWorkerForActor(const rpc::Address &raylet_address,
const rpc::Address &worker_address,
ActorID actor_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's very unclear what "cleanup means". let's pick a better name and also clarify the exact behavior in the header comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to KillLeasedWorkerForActor and updated the header comments

Signed-off-by: joshlee <[email protected]>
@Sparks0219 Sparks0219 requested a review from edoakes October 30, 2025 22:31
Signed-off-by: joshlee <[email protected]>
// NOTE: on a successful kill, we don't expect a reply back from the dead actor.
// The only case where we receive a reply is if the mismatched actor ID check is
// triggered.
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Bug

Race condition in HandleKillLocalActor: The replied flag is a std::shared_ptr<bool> but is accessed concurrently from both the timer callback (running on io_service thread) and the KillActor RPC callback (running on a gRPC thread). Additionally, the KillActor callback doesn't set *replied = true after sending an error reply, which could allow the timer to also fire and send a duplicate reply.

The callback checks if (!status.ok() && !*replied) and then calls timer->cancel() and send_reply_callback(), but never sets *replied = true. This creates two problems:

  1. Without atomic operations, there's a data race when both callbacks access replied concurrently
  2. Even if the callback sends an error reply, the timer could still see *replied == false and send another reply

The fix should be: (a) change replied to std::atomic<bool> or protect it with a mutex, and (b) set *replied = true immediately after checking it in the callback using atomic compare-and-swap or under a lock.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.) KillActorCallback is posted onto the main io_service thread so it's single threaded, there's no race condition
2.) We don't need to set *replied = true when timer->cancel() is triggered since the timer callback is never executed if the timer hasn't fired. If the timer has fired, the timer callback will have already set *replied = true so we'll skip sending the reply.

@edoakes edoakes merged commit 5e2b29d into ray-project:master Nov 4, 2025
6 checks passed
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
The issue with the current implementation of core worker HandleKillActor
is that it won't send a reply when the RPC completes because the worker
is dead. The application code from the GCS doesn't really care since it
just logs the response if one is received, a response is only sent if
the actor ID of the actor on the worker and in the RPC don't match, and
the GCS will just log it and move on with its life.

Hence we can't differentiate in the case of a transient network failure
whether there was a network issue, or the actor was successfully killed.
What I think is the most straightforward approach is instead of the GCS
directly calling core worker KillActor, we have the GCS talk to the
raylet instead and call a new RPC KillLocalActor that in turn calls
KillActor. Since the raylet that receives KillLocalActor is local to the
worker that the actor is on, we're guaranteed to kill it at that point
(either through using KillActor, or if it hangs falling back to
SIGKILL).

Thus the main intuition is that the GCS now talks to the raylet, and
this layer implements retries. Once the raylet receives the
KillLocalActor request, it routes this to KillActor. This layer between
the raylet and core worker does not have retries enabled because we can
assume that RPCs between the local raylet and worker won't fail (same
machine). We then check on the status of the worker after a while (5
seconds via kill_worker_timeout_milliseconds) and if it still hasn't
been killed then we call DestroyWorker that in turn sends the SIGKILL.

---------

Signed-off-by: joshlee <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
The issue with the current implementation of core worker HandleKillActor
is that it won't send a reply when the RPC completes because the worker
is dead. The application code from the GCS doesn't really care since it
just logs the response if one is received, a response is only sent if
the actor ID of the actor on the worker and in the RPC don't match, and
the GCS will just log it and move on with its life.

Hence we can't differentiate in the case of a transient network failure
whether there was a network issue, or the actor was successfully killed.
What I think is the most straightforward approach is instead of the GCS
directly calling core worker KillActor, we have the GCS talk to the
raylet instead and call a new RPC KillLocalActor that in turn calls
KillActor. Since the raylet that receives KillLocalActor is local to the
worker that the actor is on, we're guaranteed to kill it at that point
(either through using KillActor, or if it hangs falling back to
SIGKILL).

Thus the main intuition is that the GCS now talks to the raylet, and
this layer implements retries. Once the raylet receives the
KillLocalActor request, it routes this to KillActor. This layer between
the raylet and core worker does not have retries enabled because we can
assume that RPCs between the local raylet and worker won't fail (same
machine). We then check on the status of the worker after a while (5
seconds via kill_worker_timeout_milliseconds) and if it still hasn't
been killed then we call DestroyWorker that in turn sends the SIGKILL.

---------

Signed-off-by: joshlee <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
The issue with the current implementation of core worker HandleKillActor
is that it won't send a reply when the RPC completes because the worker
is dead. The application code from the GCS doesn't really care since it
just logs the response if one is received, a response is only sent if
the actor ID of the actor on the worker and in the RPC don't match, and
the GCS will just log it and move on with its life.

Hence we can't differentiate in the case of a transient network failure
whether there was a network issue, or the actor was successfully killed.
What I think is the most straightforward approach is instead of the GCS
directly calling core worker KillActor, we have the GCS talk to the
raylet instead and call a new RPC KillLocalActor that in turn calls
KillActor. Since the raylet that receives KillLocalActor is local to the
worker that the actor is on, we're guaranteed to kill it at that point
(either through using KillActor, or if it hangs falling back to
SIGKILL).

Thus the main intuition is that the GCS now talks to the raylet, and
this layer implements retries. Once the raylet receives the
KillLocalActor request, it routes this to KillActor. This layer between
the raylet and core worker does not have retries enabled because we can
assume that RPCs between the local raylet and worker won't fail (same
machine). We then check on the status of the worker after a while (5
seconds via kill_worker_timeout_milliseconds) and if it still hasn't
been killed then we call DestroyWorker that in turn sends the SIGKILL.

---------

Signed-off-by: joshlee <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
The issue with the current implementation of core worker HandleKillActor
is that it won't send a reply when the RPC completes because the worker
is dead. The application code from the GCS doesn't really care since it
just logs the response if one is received, a response is only sent if
the actor ID of the actor on the worker and in the RPC don't match, and
the GCS will just log it and move on with its life.

Hence we can't differentiate in the case of a transient network failure
whether there was a network issue, or the actor was successfully killed.
What I think is the most straightforward approach is instead of the GCS
directly calling core worker KillActor, we have the GCS talk to the
raylet instead and call a new RPC KillLocalActor that in turn calls
KillActor. Since the raylet that receives KillLocalActor is local to the
worker that the actor is on, we're guaranteed to kill it at that point
(either through using KillActor, or if it hangs falling back to
SIGKILL).

Thus the main intuition is that the GCS now talks to the raylet, and
this layer implements retries. Once the raylet receives the
KillLocalActor request, it routes this to KillActor. This layer between
the raylet and core worker does not have retries enabled because we can
assume that RPCs between the local raylet and worker won't fail (same
machine). We then check on the status of the worker after a while (5
seconds via kill_worker_timeout_milliseconds) and if it still hasn't
been killed then we call DestroyWorker that in turn sends the SIGKILL.

---------

Signed-off-by: joshlee <[email protected]>
Signed-off-by: YK <[email protected]>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
The issue with the current implementation of core worker HandleKillActor
is that it won't send a reply when the RPC completes because the worker
is dead. The application code from the GCS doesn't really care since it
just logs the response if one is received, a response is only sent if
the actor ID of the actor on the worker and in the RPC don't match, and
the GCS will just log it and move on with its life.

Hence we can't differentiate in the case of a transient network failure
whether there was a network issue, or the actor was successfully killed.
What I think is the most straightforward approach is instead of the GCS
directly calling core worker KillActor, we have the GCS talk to the
raylet instead and call a new RPC KillLocalActor that in turn calls
KillActor. Since the raylet that receives KillLocalActor is local to the
worker that the actor is on, we're guaranteed to kill it at that point
(either through using KillActor, or if it hangs falling back to
SIGKILL).

Thus the main intuition is that the GCS now talks to the raylet, and
this layer implements retries. Once the raylet receives the
KillLocalActor request, it routes this to KillActor. This layer between
the raylet and core worker does not have retries enabled because we can
assume that RPCs between the local raylet and worker won't fail (same
machine). We then check on the status of the worker after a while (5
seconds via kill_worker_timeout_milliseconds) and if it still hasn't
been killed then we call DestroyWorker that in turn sends the SIGKILL.

---------

Signed-off-by: joshlee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants