-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Core] Update V2 Autoscaler to support scheduling using Node labels and LabelSelector API #53578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…pute_score Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
|
@MengjinYan I ended up implementing the logic to pass labels from GCS to the autoscaler and consider labels in the autoscaler when scheduling in the same PR because it made it easier to test the change. |
can-anyscale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some tests are failing, could be related @ryanaoleary
Signed-off-by: Ryan O'Leary <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we should also add the logic to parse the LabelSelector from the GCS GetClusterStatusReply in autoscaler to populate the label selector information.
Also, is it possible to add end to end test with label selector?
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
|
@rueian Can you take a look at the PR as well from autoscaler's perspective to see if we missed anything? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we should also change the _sort_resource_request logic in _try_schedule to add labels to the sorting mechanism, as there can resource requests with exact same resource requirements but different label selectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, done in 8cb1816. I had it sort by the length of the constraints of the first selector, similar to how we sort based on the length of the placement constraints.
Signed-off-by: Ryan O'Leary <[email protected]>
| // raylet knows about. | ||
| int64 backlog_size = 4; | ||
| // The label selector constraints for this Resource shape on a node. | ||
| repeated LabelSelector label_selectors = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a repeated LabelSelectorConstraint field in the LabelSelector. Why do we need repeated here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LabelSelectorConstraints are the conditions associated with one label selector with AND semantics (i.e. we require a node to have labels accelerator-type="TPU" and market-type="SPOT" if those two respective constraints are specified). The repeated label selector is so that we can support fallback with label selectors (i.e. the user specifies two separate label selectors with their own constraint(s), and we fallback to the second one if the first is unsatisfiable).
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
dayshah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally just minor nits, not really blockers
src/ray/gcs/gcs_server/test/gcs_autoscaler_state_manager_test.cc
Outdated
Show resolved
Hide resolved
|
|
||
| namespace std { | ||
| template <> | ||
| struct hash<ray::LabelSelector> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we using std sets/maps with this somewhere instead of absl ones. Would prefer sticking to absl and just defining the absl hash functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using std::hash in task_spec.h here for the scheduling class descriptor. I think since we're using absl::flat_hash_map in that class I can just go ahead and switch it all to use absl hash functions instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 9b3bbbe.
Co-authored-by: Dhyey Shah <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Dhyey Shah <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Dhyey Shah <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Dhyey Shah <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Dhyey Shah <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Dhyey Shah <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Dhyey Shah <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
…nd LabelSelector API (ray-project#53578) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Rueian <[email protected]> Co-authored-by: Dhyey Shah <[email protected]> Signed-off-by: Krishna Kalyan <[email protected]>
…nd LabelSelector API (ray-project#53578) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Rueian <[email protected]> Co-authored-by: Dhyey Shah <[email protected]> Signed-off-by: avigyabb <[email protected]>
…nd LabelSelector API (ray-project#53578) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Rueian <[email protected]> Co-authored-by: Dhyey Shah <[email protected]> Signed-off-by: avigyabb <[email protected]>
…nd LabelSelector API (#53578) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Rueian <[email protected]> Co-authored-by: Dhyey Shah <[email protected]> Signed-off-by: Kamil Kaczmarek <[email protected]>
…nd LabelSelector API (ray-project#53578) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Rueian <[email protected]> Co-authored-by: Dhyey Shah <[email protected]> Signed-off-by: jugalshah291 <[email protected]>
…nd LabelSelector API (ray-project#53578) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Rueian <[email protected]> Co-authored-by: Dhyey Shah <[email protected]> Signed-off-by: Douglas Strodtman <[email protected]>
Why are these changes needed?
This PR updates the GCS state manager to append the labels set on a Node and retrieved from GCS to the
dynamic_node_labelson the NodeState passed to the autoscaler. This PR also updates the V2 autoscaler to consider labels when calling_compute_scoreto schedule aResourceRequeston a Node, assigning a higher score to a node type with labels that satisfy a LabelSelector.Related issue number
#51564
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.