-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] Make KillActor RPC Fault Tolerant #57648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
321b2a9
092122c
5fa01aa
35b6e04
6fc4162
74b7931
49f8feb
2d4e2f8
c6e3fd8
90bd504
826ed8c
2b9f5b2
7a57cce
8d64c7a
8acdee7
cee9324
6f39e12
ae808da
e58c6c3
04341f7
3a92b47
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -350,7 +350,9 @@ void GcsActorScheduler::DoRetryLeasingWorkerFromNode( | |
| } | ||
|
|
||
| void GcsActorScheduler::HandleWorkerLeaseGrantedReply( | ||
| std::shared_ptr<GcsActor> actor, const ray::rpc::RequestWorkerLeaseReply &reply) { | ||
| std::shared_ptr<GcsActor> actor, | ||
| const ray::rpc::RequestWorkerLeaseReply &reply, | ||
| std::shared_ptr<const rpc::GcsNodeInfo> node) { | ||
| const auto &retry_at_raylet_address = reply.retry_at_raylet_address(); | ||
| const auto &worker_address = reply.worker_address(); | ||
| if (worker_address.node_id().empty()) { | ||
|
|
@@ -390,6 +392,11 @@ void GcsActorScheduler::HandleWorkerLeaseGrantedReply( | |
| RAY_CHECK(node_to_workers_when_creating_[node_id] | ||
| .emplace(leased_worker->GetWorkerID(), leased_worker) | ||
| .second); | ||
| rpc::Address actor_local_raylet_address; | ||
| actor_local_raylet_address.set_node_id(node->node_id()); | ||
| actor_local_raylet_address.set_ip_address(node->node_manager_address()); | ||
| actor_local_raylet_address.set_port(node->node_manager_port()); | ||
| actor->UpdateLocalRayletAddress(actor_local_raylet_address); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. on actor death in between restarting, there's probably a point where the actor doesn't have a local raylet. local_raylet_address should probably be an optional and we shouldn't make the rpc if it's nullopt
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ahhh good catch thanks, yup I modified it so that we don't make the rpc if its nullopt. |
||
| actor->UpdateAddress(leased_worker->GetAddress()); | ||
| actor->GetMutableActorTableData()->set_pid(reply.worker_pid()); | ||
| actor->GetMutableTaskSpec()->set_lease_grant_timestamp_ms(current_sys_time_ms()); | ||
|
|
@@ -621,7 +628,7 @@ void GcsActorScheduler::HandleWorkerLeaseReply( | |
| RAY_LOG(INFO) << "Finished leasing worker from " << node_id << " for actor " | ||
| << actor->GetActorID() | ||
| << ", job id = " << actor->GetActorID().JobId(); | ||
| HandleWorkerLeaseGrantedReply(actor, reply); | ||
| HandleWorkerLeaseGrantedReply(actor, reply, node); | ||
| } | ||
| } else { | ||
| RetryLeasingWorkerFromNode(actor, node); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but usually cpp getters and setters look like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the getter, but for the setter should I keep it as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that looks like a typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh not a typo, the "setter" usually just returns a non-const lvalue ref so you can more efficiently make changes however you want to the data member.
An UpdateMember func is bound to changing by replacing rather than modifying in place + extra move constructs or separate rvalue/lvalue overloads, etc. if you want to give flexibility on replacing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait but at this point why isn't the member public ☠️ it doesn't matter lol