-
Notifications
You must be signed in to change notification settings - Fork 7k
[Core] Reschedule leases in local lease manager when draining the node #57834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
| namespace { | ||
| void ReplyCancelled(const std::shared_ptr<internal::Work> &work, | ||
| rpc::RequestWorkerLeaseReply::SchedulingFailureType failure_type, | ||
| const std::string &scheduling_failure_message) { | ||
| auto reply = work->reply_; | ||
| reply->set_canceled(true); | ||
| reply->set_failure_type(failure_type); | ||
| reply->set_scheduling_failure_message(scheduling_failure_message); | ||
| work->send_reply_callback_(Status::OK(), nullptr, nullptr); | ||
| } | ||
| } // namespace | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pure move.
src/ray/raylet/node_manager.cc
Outdated
| if (reply->is_accepted()) { | ||
| // Fail fast on the leases in the local lease maanger | ||
| // and add them back to the cluster lease manager so a new node | ||
| // can be selected by the scheduler. | ||
| auto cancelled_works = local_lease_manager_.CancelLeasesWithoutReply( | ||
| [&](const std::shared_ptr<internal::Work> &work) { return true; }); | ||
| for (const auto &work : cancelled_works) { | ||
| cluster_lease_manager_.QueueAndScheduleLease(work->lease_, | ||
| work->grant_or_reject_, | ||
| work->is_selected_based_on_locality_, | ||
| work->reply_, | ||
| work->send_reply_callback_); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actual fix
dayshah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic lgtm, nits
Python integration test looks a little prone to flakiness, probably not necessary to test this from python
Signed-off-by: Jiajun Yao <[email protected]>
ray-project#57834) Signed-off-by: Jiajun Yao <[email protected]>
ray-project#57834) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: xgui <[email protected]>
#57834) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: elliot-barn <[email protected]>
ray-project#57834) Signed-off-by: Jiajun Yao <[email protected]>
ray-project#57834) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
Description
Symptom
and task fails with
TaskUnschedulableErrorRoot Cause
In master, we assumed that once a lease request is added to the local lease manager, the lease cannot become infeasible. It turned out to be not true since a node can be drained and once a node is in the draining state, it's excluded from the scheduling decision and a lease can now become infeasible (if the draining node was the only feasible node in the cluster).
Fix
When a node is being drained, we should fail fast on all leases that are in the local lease manager regardless whether they are waiting for workers, pulling args or what and add them back to the cluster lease manager so scheduler can run the scheduling logic again to find a new node.