Skip to content

Conversation

@ryanaoleary
Copy link
Contributor

Why are these changes needed?

This PR updates the cluster scheduler to check for ray.io/node-id label selectors when calling GetBestSchedulableNode. If a feasible node is found with the desired ID it's returned, and otherwise the resource demand is marked infeasible and nil NodeID is returned. Then in the ClusterLeaseManager, we are able to check for node ID label constraints and return an unschedulable error when scheduling_node_id.IsNil().

This behavior matches exactly how NodeAffinitySchedulingPolicy handles infeasible nodes when soft=False. This change is necessary for #54940 which replaces usages of NodeAffinitySchedulingPolicy, since otherwise the behavior of an unsatisfiable ray.io/node-id label selector is to remain pending indefinitely.

Related issue number

#51564

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ryanaoleary ryanaoleary marked this pull request as ready for review September 9, 2025 00:59
@ryanaoleary ryanaoleary requested a review from a team as a code owner September 9, 2025 00:59
@ryanaoleary
Copy link
Contributor Author

cc: @MengjinYan

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Sep 9, 2025
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Sep 13, 2025
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if (auto node_id_values = GetHardNodeAffinityValues(spec.GetLabelSelector())) {
for (const auto &node_id_hex : *node_id_values) {
if (auto addr = node_addr_factory_(NodeID::FromHex(node_id_hex))) {
return std::make_pair(addr.value(), false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we can do smarter in the follow-up is that if we have a list of nodes here, we should pick the node with the most arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm wondering if the logic should be in here where the core_worker finds the raylet to send the requests to or it should be in the the raylet logic where it finds the best node to send the task?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question. Currently it has to be here since raylet doesn't have location information of objects (owner does).

@MengjinYan
Copy link
Contributor


[2025-09-15T20:14:44Z] In file included from src/ray/common/scheduling/label_selector.cc:15:
--
  | [2025-09-15T20:14:44Z] bazel-out/k8-opt/bin/src/ray/common/scheduling/_virtual_includes/label_selector/ray/common/scheduling/label_selector.h:24:10: fatal error: ray/common/constants.h: No such file or directory
  | [2025-09-15T20:14:44Z]    24 \| #include "ray/common/constants.h"
  | [2025-09-15T20:14:44Z]       \|          ^~~~~~~~~~~~~~~~~~~~~~~~
  | [2025-09-15T20:14:44Z] compilation terminated.

Seems to be some compilation error.

@ryanaoleary
Copy link
Contributor Author

[2d29039](/ray-project/ray/pull/56235/commits/2d290393bac93ff41b001914ca70c7bba542ab36)

Looks like the BUILD.bazel file recently changed, 2d29039 should fix this by updating the deps.

@jjyao jjyao enabled auto-merge (squash) September 15, 2025 22:44
@github-actions github-actions bot disabled auto-merge September 15, 2025 23:45
@jjyao
Copy link
Collaborator

jjyao commented Sep 16, 2025


bazel-out/k8-opt/bin/src/ray/common/scheduling/_virtual_includes/label_selector/ray/common/scheduling/label_selector.h:24:10: fatal error: ray/common/constants.h: No such file or directory
--
  | [2025-09-16T00:18:42Z]    24 \| #include "ray/common/constants.h"
  | [2025-09-16T00:18:42Z]       \|          ^~~~~~~~~~~~~~~~~~~~~~~~

@ryanaoleary
Copy link
Contributor Author


bazel-out/k8-opt/bin/src/ray/common/scheduling/_virtual_includes/label_selector/ray/common/scheduling/label_selector.h:24:10: fatal error: ray/common/constants.h: No such file or directory
--
  | [2025-09-16T00:18:42Z]    24 \| #include "ray/common/constants.h"
  | [2025-09-16T00:18:42Z]       \|          ^~~~~~~~~~~~~~~~~~~~~~~~

This should be passing now.

@jjyao jjyao enabled auto-merge (squash) September 16, 2025 16:46
@github-actions github-actions bot disabled auto-merge September 16, 2025 20:09
@MengjinYan
Copy link
Contributor

The java test failure should be unrelated. cc: @jjyao

@jjyao jjyao merged commit bfe8139 into ray-project:master Sep 16, 2025
3 of 5 checks passed
jmajety-dev pushed a commit to jmajety-dev/ray that referenced this pull request Sep 16, 2025
… constraint (ray-project#56235)

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Mengjin Yan <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
… constraint (ray-project#56235)

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Mengjin Yan <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Signed-off-by: zac <[email protected]>
marcostephan pushed a commit to marcostephan/ray that referenced this pull request Sep 24, 2025
… constraint (ray-project#56235)

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Mengjin Yan <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Signed-off-by: Marco Stephan <[email protected]>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
… constraint (#56235)

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Mengjin Yan <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Signed-off-by: Douglas Strodtman <[email protected]>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
… constraint (ray-project#56235)

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Mengjin Yan <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
… constraint (ray-project#56235)

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Mengjin Yan <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants