-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] Make KillActor RPC Fault Tolerant #57648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Make KillActor RPC Fault Tolerant #57648
Conversation
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
|
Couple comments about my design/concerns I have: What are your thoughts @codope ? |
src/ray/raylet/node_manager.cc
Outdated
| worker->rpc_client()->KillActor( | ||
| kill_actor_request, | ||
| [kill_actor_error, timer](const ray::Status &status, const rpc::KillActorReply &) { | ||
| *kill_actor_error = status; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how to do this a bit more elegantly, only possible case where KillActor responds is if we hit this error check at
ray/src/ray/core_worker/core_worker.cc
Line 3964 in 237268b
| if (intended_actor_id != worker_context_->GetCurrentActorID()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo don't need to propagate back up to the gcs, since it throws it away
src/ray/gcs/gcs_actor.cc
Outdated
| const rpc::Address &GcsActor::GetLocalRayletAddress() const { | ||
| return local_raylet_address_; | ||
| } | ||
|
|
||
| void GcsActor::UpdateLocalRayletAddress(const rpc::Address &address) { | ||
| local_raylet_address_.CopyFrom(address); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but usually cpp getters and setters look like this
| const rpc::Address &GcsActor::GetLocalRayletAddress() const { | |
| return local_raylet_address_; | |
| } | |
| void GcsActor::UpdateLocalRayletAddress(const rpc::Address &address) { | |
| local_raylet_address_.CopyFrom(address); | |
| } | |
| const rpc::Address &GcsActor::LocalRayletAddress() const { | |
| return local_raylet_address_; | |
| } | |
| rpc::Address &GcsActor::LocalRayletAddress() { | |
| return local_raylet_address_; | |
| } | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the getter, but for the setter should I keep it as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that looks like a typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh not a typo, the "setter" usually just returns a non-const lvalue ref so you can more efficiently make changes however you want to the data member.
An UpdateMember func is bound to changing by replacing rather than modifying in place + extra move constructs or separate rvalue/lvalue overloads, etc. if you want to give flexibility on replacing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait but at this point why isn't the member public ☠️ it doesn't matter lol
| actor_local_raylet_address.set_node_id(node->node_id()); | ||
| actor_local_raylet_address.set_ip_address(node->node_manager_address()); | ||
| actor_local_raylet_address.set_port(node->node_manager_port()); | ||
| actor->UpdateLocalRayletAddress(actor_local_raylet_address); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on actor death in between restarting, there's probably a point where the actor doesn't have a local raylet.
The actor also doesn't have a local raylet at registration time + before creation completion
local_raylet_address should probably be an optional and we shouldn't make the rpc if it's nullopt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhh good catch thanks, yup I modified it so that we don't make the rpc if its nullopt.
src/ray/raylet/node_manager.cc
Outdated
|
|
||
| timer->expires_from_now(boost::posix_time::milliseconds( | ||
| RayConfig::instance().kill_worker_timeout_milliseconds())); | ||
| timer->async_wait([this, send_reply_callback, kill_actor_error, worker_id, timer]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think execute_after does some of this boiler plate for you
Also this is new functionality we're adding to kill actor right? In what cases will the worker survive after it got the kill actor req?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea agreed execute_after is a lot cleaner and yup this is new functionality we're adding. I think the issue is that in graceful shutdown we could end up getting stuck, and I don't think KillActor takes this into account. There's multiple places we use KillAsync that sends a SIGTERM (graceful shutdown) then SIGKILL hence I think we should follow the same pattern here.
src/ray/raylet/node_manager.cc
Outdated
| worker->rpc_client()->KillActor( | ||
| kill_actor_request, | ||
| [kill_actor_error, timer](const ray::Status &status, const rpc::KillActorReply &) { | ||
| *kill_actor_error = status; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo don't need to propagate back up to the gcs, since it throws it away
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
…-actor-rpc-fault-tolerant
codope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Sparks0219 Thanks for the PR and I'm aligned with your design. Few comments:
1.) I decided to keep KillActor rather than have the Raylet use core worker Exit since I think it would get a bit bloated.
Decision to keep KillActor is correct. It provides actor id validation to prevent killing wrong actor after restart. Plus, there is separation between worker lifecycle (Exit) and actor termination (KillActor).
2.) From what I see, KillActor should already be idempotent because of the changes in d63f464
Yes, that commit makes it idempotent at core worker level. ShutdownCoordinator tracks state, subsequent calls are no-op. Your code returns early if worker is nullptr. I suggested to also add worker->IsDead() check below.
In the current behavior of KillActor, we don't really have any mitigations if in graceful shutdown it gets stuck. Hence I added a timer and an async_wait call that does a force kill after a certain amount of time
Good callout! But i noticed that the current changes always log mismatch and report err. Please check my comments below in NodeManager::HandleKillLocalActor.
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
…-actor-rpc-fault-tolerant
Signed-off-by: joshlee <[email protected]>
src/ray/gcs/gcs_actor_scheduler.cc
Outdated
| ActorID actor_id) { | ||
| if (worker_address.node_id().empty()) { | ||
| RAY_LOG(DEBUG) << "Invalid worker address, skip the killing of actor " << actor_id; | ||
| bool GcsActorScheduler::CleanupWorkerForActor(const rpc::Address &raylet_address, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I renamed it from KillActorOnWorker to CleanupWorkerForActor since I think the latter makes more sense. This method is only used to cleanup workers allocated for actors that the GCS has marked dead or where creation failed, hence KillActorOnWorker didn't make much sense to me.
Think DestroyActor is meant to prevent the killed actor from restarting, while KillActor allows restarting... bad naming and think they probably could be unified by passing a flag to differentiate the restart behavior. Not entirely sure though, what do you think @codope ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we should unify the two
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
src/ray/gcs/gcs_actor_scheduler.cc
Outdated
| RAY_LOG(DEBUG) << "Killing actor " << actor_id | ||
| << " with return status: " << status.ToString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, shouldn't we be retrying here if we get a non-OK status?
at a minimum we should have an ERROR log here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already retrying implicitly via the retryable grpc client for status UNAVAILABLE/UNKNOWN, think for the other cases like the actor id mismatch in HandleKillActor we don't want to retry.
Added an ERROR log to report this though
src/ray/gcs/gcs_actor_scheduler.cc
Outdated
| RAY_LOG(DEBUG) << "Invalid worker address, skip the killing of actor " << actor_id; | ||
| bool GcsActorScheduler::CleanupWorkerForActor(const rpc::Address &raylet_address, | ||
| const rpc::Address &worker_address, | ||
| ActorID actor_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add an INFO level log that an actor is being killed (unless it's already logged in the callsite)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, sounds good
src/ray/gcs/gcs_actor_scheduler.h
Outdated
| /// Cleanup the worker for an actor | ||
| /// \param raylet_address The address of the local raylet of the worker | ||
| /// \param worker_address The address of the worker to clean up | ||
| /// \param actor_id ID of the actor (may be Nil if actor setup failed) | ||
| bool CleanupWorkerForActor(const rpc::Address &raylet_address, | ||
| const rpc::Address &worker_address, | ||
| ActorID actor_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's very unclear what "cleanup means". let's pick a better name and also clarify the exact behavior in the header comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to KillLeasedWorkerForActor and updated the header comments
Signed-off-by: joshlee <[email protected]>
…-actor-rpc-fault-tolerant
Signed-off-by: joshlee <[email protected]>
| // NOTE: on a successful kill, we don't expect a reply back from the dead actor. | ||
| // The only case where we receive a reply is if the mismatched actor ID check is | ||
| // triggered. | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Bug
Race condition in HandleKillLocalActor: The replied flag is a std::shared_ptr<bool> but is accessed concurrently from both the timer callback (running on io_service thread) and the KillActor RPC callback (running on a gRPC thread). Additionally, the KillActor callback doesn't set *replied = true after sending an error reply, which could allow the timer to also fire and send a duplicate reply.
The callback checks if (!status.ok() && !*replied) and then calls timer->cancel() and send_reply_callback(), but never sets *replied = true. This creates two problems:
- Without atomic operations, there's a data race when both callbacks access
repliedconcurrently - Even if the callback sends an error reply, the timer could still see
*replied == falseand send another reply
The fix should be: (a) change replied to std::atomic<bool> or protect it with a mutex, and (b) set *replied = true immediately after checking it in the callback using atomic compare-and-swap or under a lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.) KillActorCallback is posted onto the main io_service thread so it's single threaded, there's no race condition
2.) We don't need to set *replied = true when timer->cancel() is triggered since the timer callback is never executed if the timer hasn't fired. If the timer has fired, the timer callback will have already set *replied = true so we'll skip sending the reply.
The issue with the current implementation of core worker HandleKillActor is that it won't send a reply when the RPC completes because the worker is dead. The application code from the GCS doesn't really care since it just logs the response if one is received, a response is only sent if the actor ID of the actor on the worker and in the RPC don't match, and the GCS will just log it and move on with its life. Hence we can't differentiate in the case of a transient network failure whether there was a network issue, or the actor was successfully killed. What I think is the most straightforward approach is instead of the GCS directly calling core worker KillActor, we have the GCS talk to the raylet instead and call a new RPC KillLocalActor that in turn calls KillActor. Since the raylet that receives KillLocalActor is local to the worker that the actor is on, we're guaranteed to kill it at that point (either through using KillActor, or if it hangs falling back to SIGKILL). Thus the main intuition is that the GCS now talks to the raylet, and this layer implements retries. Once the raylet receives the KillLocalActor request, it routes this to KillActor. This layer between the raylet and core worker does not have retries enabled because we can assume that RPCs between the local raylet and worker won't fail (same machine). We then check on the status of the worker after a while (5 seconds via kill_worker_timeout_milliseconds) and if it still hasn't been killed then we call DestroyWorker that in turn sends the SIGKILL. --------- Signed-off-by: joshlee <[email protected]>
The issue with the current implementation of core worker HandleKillActor is that it won't send a reply when the RPC completes because the worker is dead. The application code from the GCS doesn't really care since it just logs the response if one is received, a response is only sent if the actor ID of the actor on the worker and in the RPC don't match, and the GCS will just log it and move on with its life. Hence we can't differentiate in the case of a transient network failure whether there was a network issue, or the actor was successfully killed. What I think is the most straightforward approach is instead of the GCS directly calling core worker KillActor, we have the GCS talk to the raylet instead and call a new RPC KillLocalActor that in turn calls KillActor. Since the raylet that receives KillLocalActor is local to the worker that the actor is on, we're guaranteed to kill it at that point (either through using KillActor, or if it hangs falling back to SIGKILL). Thus the main intuition is that the GCS now talks to the raylet, and this layer implements retries. Once the raylet receives the KillLocalActor request, it routes this to KillActor. This layer between the raylet and core worker does not have retries enabled because we can assume that RPCs between the local raylet and worker won't fail (same machine). We then check on the status of the worker after a while (5 seconds via kill_worker_timeout_milliseconds) and if it still hasn't been killed then we call DestroyWorker that in turn sends the SIGKILL. --------- Signed-off-by: joshlee <[email protected]>
The issue with the current implementation of core worker HandleKillActor is that it won't send a reply when the RPC completes because the worker is dead. The application code from the GCS doesn't really care since it just logs the response if one is received, a response is only sent if the actor ID of the actor on the worker and in the RPC don't match, and the GCS will just log it and move on with its life. Hence we can't differentiate in the case of a transient network failure whether there was a network issue, or the actor was successfully killed. What I think is the most straightforward approach is instead of the GCS directly calling core worker KillActor, we have the GCS talk to the raylet instead and call a new RPC KillLocalActor that in turn calls KillActor. Since the raylet that receives KillLocalActor is local to the worker that the actor is on, we're guaranteed to kill it at that point (either through using KillActor, or if it hangs falling back to SIGKILL). Thus the main intuition is that the GCS now talks to the raylet, and this layer implements retries. Once the raylet receives the KillLocalActor request, it routes this to KillActor. This layer between the raylet and core worker does not have retries enabled because we can assume that RPCs between the local raylet and worker won't fail (same machine). We then check on the status of the worker after a while (5 seconds via kill_worker_timeout_milliseconds) and if it still hasn't been killed then we call DestroyWorker that in turn sends the SIGKILL. --------- Signed-off-by: joshlee <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
The issue with the current implementation of core worker HandleKillActor is that it won't send a reply when the RPC completes because the worker is dead. The application code from the GCS doesn't really care since it just logs the response if one is received, a response is only sent if the actor ID of the actor on the worker and in the RPC don't match, and the GCS will just log it and move on with its life. Hence we can't differentiate in the case of a transient network failure whether there was a network issue, or the actor was successfully killed. What I think is the most straightforward approach is instead of the GCS directly calling core worker KillActor, we have the GCS talk to the raylet instead and call a new RPC KillLocalActor that in turn calls KillActor. Since the raylet that receives KillLocalActor is local to the worker that the actor is on, we're guaranteed to kill it at that point (either through using KillActor, or if it hangs falling back to SIGKILL). Thus the main intuition is that the GCS now talks to the raylet, and this layer implements retries. Once the raylet receives the KillLocalActor request, it routes this to KillActor. This layer between the raylet and core worker does not have retries enabled because we can assume that RPCs between the local raylet and worker won't fail (same machine). We then check on the status of the worker after a while (5 seconds via kill_worker_timeout_milliseconds) and if it still hasn't been killed then we call DestroyWorker that in turn sends the SIGKILL. --------- Signed-off-by: joshlee <[email protected]> Signed-off-by: YK <[email protected]>
The issue with the current implementation of core worker HandleKillActor is that it won't send a reply when the RPC completes because the worker is dead. The application code from the GCS doesn't really care since it just logs the response if one is received, a response is only sent if the actor ID of the actor on the worker and in the RPC don't match, and the GCS will just log it and move on with its life. Hence we can't differentiate in the case of a transient network failure whether there was a network issue, or the actor was successfully killed. What I think is the most straightforward approach is instead of the GCS directly calling core worker KillActor, we have the GCS talk to the raylet instead and call a new RPC KillLocalActor that in turn calls KillActor. Since the raylet that receives KillLocalActor is local to the worker that the actor is on, we're guaranteed to kill it at that point (either through using KillActor, or if it hangs falling back to SIGKILL). Thus the main intuition is that the GCS now talks to the raylet, and this layer implements retries. Once the raylet receives the KillLocalActor request, it routes this to KillActor. This layer between the raylet and core worker does not have retries enabled because we can assume that RPCs between the local raylet and worker won't fail (same machine). We then check on the status of the worker after a while (5 seconds via kill_worker_timeout_milliseconds) and if it still hasn't been killed then we call DestroyWorker that in turn sends the SIGKILL. --------- Signed-off-by: joshlee <[email protected]>
Why are these changes needed?
The issue with the current implementation of core worker HandleKillActor is that it won't send a reply when the RPC completes because the worker is dead. The application code from the GCS doesn't really care since it just logs the response if one is received, a response is only sent if the actor ID of the actor on the worker and in the RPC don't match, and the GCS will just log it and move on with its life.
Hence we can't differentiate in the case of a transient network failure whether there was a network issue, or the actor was successfully killed. What I think is the most straightforward approach is instead of the GCS directly calling core worker KillActor, we have the GCS talk to the raylet instead and call a new RPC KillLocalActor that in turn calls KillActor. Since the raylet that receives KillLocalActor is local to the worker that the actor is on, we're guaranteed to kill it at that point (either through using KillActor, or if it hangs falling back to SIGKILL).
Thus the main intuition is that the GCS now talks to the raylet, and this layer implements retries. Once the raylet receives the KillLocalActor request, it routes this to KillActor. This layer between the raylet and core worker does not have retries enabled because we can assume that RPCs between the local raylet and worker won't fail (same machine). We then check on the status of the worker after a while (5 seconds via kill_worker_timeout_milliseconds) and if it still hasn't been killed then we call DestroyWorker that in turn sends the SIGKILL.
Related issue number
Checks
git commit -s) in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.