Add Semi-Random Weighting to AutoScaler Node Scheduler #49983

fogerd · 2025-01-21T15:48:39Z

Why are these changes needed?

When trying to utilize spot instance types, we noticed the Autoscaler would get stuck attempting to scale up a node type with no availability and never recover / move on to try other node types. This change introduces some randomness, with weighting, so that the Autoscaler will try different node types on each loop and prevent the described scenario.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

jjyao · 2025-01-28T16:15:43Z

python/ray/autoscaler/_private/resource_demand_scheduler.py


-        utilization_scores = sorted(utilization_scores, reverse=True)
-        best_node_type = utilization_scores[0][1]
+        weights = [node_types[node_type[1]].get("max_workers", 0) for node_type in utilization_scores]


the weights should be based on utilization_scores instead of max_workers: we don't want to launch a big machine for a 1 cpu task.

jjyao · 2025-01-28T16:16:30Z

python/ray/autoscaler/_private/resource_demand_scheduler.py

-        utilization_scores = sorted(utilization_scores, reverse=True)
-        best_node_type = utilization_scores[0][1]
+        weights = [node_types[node_type[1]].get("max_workers", 0) for node_type in utilization_scores]
+        best_node_type = random.choices(utilization_scores, weights=weights, k=1)[0][1]


Ideally we should remember the node type that has no availability and skip it next time.

agree that would be ideal but works without it. it quickly cycles through nodes so not saving much time vs the extra code for state management

jjyao · 2025-04-29T20:30:45Z

@fogerd, there are test failures, could you take a look?

github-actions · 2025-06-06T00:32:13Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions · 2025-06-20T00:32:33Z

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

## Description When the Autoscaler receives a resource request and decides which type of node to scale up,, only the `UtilizationScore` is considered (that is, Ray tries to avoid launching a large node for a small resource request, which would lead to resource waste). If multiple node types in the cluster have the same `UtilizationScore`, Ray always request for the same node type. In Spot scenarios, cloud resources are dynamically changing. Therefore, we want the Autoscaler to be aware of cloud resource availability — if a certain node type becomes unavailable, the Autoscaler should be able to automatically switch to requesting other node types. In this PR, I added the `CloudResourceMonitor` class, which records node types that have failed resource allocation, and in future scaling events, reduces the weight of these node types. ## Related issues Related to #49983 Fixes #53636 #39788 #39789 ## implementation details 1. `CloudResourceMonitor` This is a subscriber of Instances. When a Instance get status of `ALLOCATION_FAILED`, `CloudResourceMonitor` record the node_type and set a lower its availability score. 2. `ResourceDemandScheduler` This class determines how to select the best node_type to handle resource request. I modify the part of selecting the best node type: ```python # Sort the results by score. results = sorted( results, key=lambda r: ( r.score, cloud_resource_availabilities.get(r.node.node_type, 1), ), reverse=True ) ``` The sorting includes: 2.1. UtilizationScore: to maximize resource utilization. 2.2. Cloud resource availabilities: prioritize node types with the most available cloud resources, in order to minimize allocation failures. --------- Signed-off-by: xiaowen.wxw <[email protected]> Co-authored-by: 行筠 <[email protected]>

## Description When the Autoscaler receives a resource request and decides which type of node to scale up,, only the `UtilizationScore` is considered (that is, Ray tries to avoid launching a large node for a small resource request, which would lead to resource waste). If multiple node types in the cluster have the same `UtilizationScore`, Ray always request for the same node type. In Spot scenarios, cloud resources are dynamically changing. Therefore, we want the Autoscaler to be aware of cloud resource availability — if a certain node type becomes unavailable, the Autoscaler should be able to automatically switch to requesting other node types. In this PR, I added the `CloudResourceMonitor` class, which records node types that have failed resource allocation, and in future scaling events, reduces the weight of these node types. ## Related issues Related to ray-project#49983 Fixes ray-project#53636 ray-project#39788 ray-project#39789 ## implementation details 1. `CloudResourceMonitor` This is a subscriber of Instances. When a Instance get status of `ALLOCATION_FAILED`, `CloudResourceMonitor` record the node_type and set a lower its availability score. 2. `ResourceDemandScheduler` This class determines how to select the best node_type to handle resource request. I modify the part of selecting the best node type: ```python # Sort the results by score. results = sorted( results, key=lambda r: ( r.score, cloud_resource_availabilities.get(r.node.node_type, 1), ), reverse=True ) ``` The sorting includes: 2.1. UtilizationScore: to maximize resource utilization. 2.2. Cloud resource availabilities: prioritize node types with the most available cloud resources, in order to minimize allocation failures. --------- Signed-off-by: xiaowen.wxw <[email protected]> Co-authored-by: 行筠 <[email protected]>

Add Semi-Random Weighting to AutoScaler Node Scheduler

0f11639

fogerd requested review from a team and hongchaodeng as code owners January 21, 2025 15:48

jcotant1 added the core Issues that should be addressed in Ray Core label Jan 21, 2025

jjyao added the go add ONLY when ready to merge, run all tests label Jan 28, 2025

jjyao reviewed Jan 28, 2025

View reviewed changes

jjyao assigned jjyao and kevin85421 Jan 28, 2025

fogerd and others added 2 commits February 19, 2025 14:28

Merge branch 'ray-project:master' into master

b226b0e

PR Feedback

232994f

fogerd force-pushed the master branch from da503a5 to 232994f Compare February 19, 2025 19:33

hainesmichaelc added the community-contribution Contributed by the community label Apr 4, 2025

hainesmichaelc added community-backlog and removed community-backlog labels May 22, 2025

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 6, 2025

kevin85421 mentioned this pull request Jun 7, 2025

[core][autoscaler] Select different node types when a node type is unavailable #53636

Closed

github-actions bot closed this Jun 20, 2025

This was referenced Nov 14, 2025

[core] autoscale with resource availability #58619

Closed

[core] Autoscaler with resource availability #58623

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Semi-Random Weighting to AutoScaler Node Scheduler #49983

Add Semi-Random Weighting to AutoScaler Node Scheduler #49983

Uh oh!

fogerd commented Jan 21, 2025

Uh oh!

jjyao Jan 28, 2025

Uh oh!

fogerd Feb 19, 2025

Uh oh!

jjyao Jan 28, 2025

Uh oh!

fogerd Feb 19, 2025

Uh oh!

jjyao commented Apr 29, 2025

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

github-actions bot commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add Semi-Random Weighting to AutoScaler Node Scheduler #49983

Add Semi-Random Weighting to AutoScaler Node Scheduler #49983

Uh oh!

Conversation

fogerd commented Jan 21, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

jjyao Jan 28, 2025

Choose a reason for hiding this comment

Uh oh!

fogerd Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

jjyao Jan 28, 2025

Choose a reason for hiding this comment

Uh oh!

fogerd Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

jjyao commented Apr 29, 2025

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

github-actions bot commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants