-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Core] Update cluster scheduler to handle label selector hard node id constraint #56235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Update cluster scheduler to handle label selector hard node id constraint #56235
Conversation
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
|
cc: @MengjinYan |
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
jjyao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| if (auto node_id_values = GetHardNodeAffinityValues(spec.GetLabelSelector())) { | ||
| for (const auto &node_id_hex : *node_id_values) { | ||
| if (auto addr = node_addr_factory_(NodeID::FromHex(node_id_hex))) { | ||
| return std::make_pair(addr.value(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something we can do smarter in the follow-up is that if we have a list of nodes here, we should pick the node with the most arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm wondering if the logic should be in here where the core_worker finds the raylet to send the requests to or it should be in the the raylet logic where it finds the best node to send the task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question. Currently it has to be here since raylet doesn't have location information of objects (owner does).
src/ray/raylet/scheduling/tests/cluster_resource_scheduler_test.cc
Outdated
Show resolved
Hide resolved
…t.cc Co-authored-by: Mengjin Yan <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Seems to be some compilation error. |
Signed-off-by: Ryan O'Leary <[email protected]>
Looks like the |
|
Signed-off-by: Ryan O'Leary <[email protected]>
This should be passing now. |
|
The java test failure should be unrelated. cc: @jjyao |
… constraint (ray-project#56235) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Mengjin Yan <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>
… constraint (ray-project#56235) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Mengjin Yan <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: zac <[email protected]>
… constraint (ray-project#56235) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Mengjin Yan <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: Marco Stephan <[email protected]>
… constraint (#56235) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Mengjin Yan <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: Douglas Strodtman <[email protected]>
… constraint (ray-project#56235) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Mengjin Yan <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>
… constraint (ray-project#56235) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Mengjin Yan <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>
Why are these changes needed?
This PR updates the cluster scheduler to check for
ray.io/node-idlabel selectors when callingGetBestSchedulableNode. If a feasible node is found with the desired ID it's returned, and otherwise the resource demand is marked infeasible and nilNodeIDis returned. Then in theClusterLeaseManager, we are able to check for node ID label constraints and return an unschedulable error whenscheduling_node_id.IsNil().This behavior matches exactly how
NodeAffinitySchedulingPolicyhandles infeasible nodes whensoft=False. This change is necessary for #54940 which replaces usages ofNodeAffinitySchedulingPolicy, since otherwise the behavior of an unsatisfiableray.io/node-idlabel selector is to remain pending indefinitely.Related issue number
#51564
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.