Skip to content

Conversation

@fogerd
Copy link

@fogerd fogerd commented Jan 21, 2025

Why are these changes needed?

When trying to utilize spot instance types, we noticed the Autoscaler would get stuck attempting to scale up a node type with no availability and never recover / move on to try other node types. This change introduces some randomness, with weighting, so that the Autoscaler will try different node types on each loop and prevent the described scenario.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@fogerd fogerd requested review from a team and hongchaodeng as code owners January 21, 2025 15:48
@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Jan 21, 2025
@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Jan 28, 2025

utilization_scores = sorted(utilization_scores, reverse=True)
best_node_type = utilization_scores[0][1]
weights = [node_types[node_type[1]].get("max_workers", 0) for node_type in utilization_scores]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the weights should be based on utilization_scores instead of max_workers: we don't want to launch a big machine for a 1 cpu task.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

utilization_scores = sorted(utilization_scores, reverse=True)
best_node_type = utilization_scores[0][1]
weights = [node_types[node_type[1]].get("max_workers", 0) for node_type in utilization_scores]
best_node_type = random.choices(utilization_scores, weights=weights, k=1)[0][1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should remember the node type that has no availability and skip it next time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree that would be ideal but works without it. it quickly cycles through nodes so not saving much time vs the extra code for state management

@hainesmichaelc hainesmichaelc added the community-contribution Contributed by the community label Apr 4, 2025
@jjyao
Copy link
Collaborator

jjyao commented Apr 29, 2025

@fogerd, there are test failures, could you take a look?

@github-actions
Copy link

github-actions bot commented Jun 6, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 6, 2025
@github-actions
Copy link

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

@github-actions github-actions bot closed this Jun 20, 2025
edoakes pushed a commit that referenced this pull request Nov 25, 2025
## Description
When the Autoscaler receives a resource request and decides which type
of node to scale up,, only the `UtilizationScore` is considered (that
is, Ray tries to avoid launching a large node for a small resource
request, which would lead to resource waste). If multiple node types in
the cluster have the same `UtilizationScore`, Ray always request for the
same node type.

In Spot scenarios, cloud resources are dynamically changing. Therefore,
we want the Autoscaler to be aware of cloud resource availability — if a
certain node type becomes unavailable, the Autoscaler should be able to
automatically switch to requesting other node types.

In this PR, I added the `CloudResourceMonitor` class, which records node
types that have failed resource allocation, and in future scaling
events, reduces the weight of these node types.

## Related issues
Related to #49983 
Fixes #53636  #39788 #39789 


## implementation details
1. `CloudResourceMonitor`
This is a subscriber of Instances. When a Instance get status of
`ALLOCATION_FAILED`, `CloudResourceMonitor` record the node_type and set
a lower its availability score.
2. `ResourceDemandScheduler`
This class determines how to select the best node_type to handle
resource request. I modify the part of selecting the best node type:
```python
# Sort the results by score.
results = sorted(
    results,
    key=lambda r: (
        r.score,
        cloud_resource_availabilities.get(r.node.node_type, 1),
    ),
    reverse=True
)
```
The sorting includes:
2.1. UtilizationScore: to maximize resource utilization.
2.2. Cloud resource availabilities: prioritize node types with the most
available cloud resources, in order to minimize allocation failures.

---------

Signed-off-by: xiaowen.wxw <[email protected]>
Co-authored-by: 行筠 <[email protected]>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
## Description
When the Autoscaler receives a resource request and decides which type
of node to scale up,, only the `UtilizationScore` is considered (that
is, Ray tries to avoid launching a large node for a small resource
request, which would lead to resource waste). If multiple node types in
the cluster have the same `UtilizationScore`, Ray always request for the
same node type.

In Spot scenarios, cloud resources are dynamically changing. Therefore,
we want the Autoscaler to be aware of cloud resource availability — if a
certain node type becomes unavailable, the Autoscaler should be able to
automatically switch to requesting other node types.

In this PR, I added the `CloudResourceMonitor` class, which records node
types that have failed resource allocation, and in future scaling
events, reduces the weight of these node types.

## Related issues
Related to ray-project#49983 
Fixes ray-project#53636  ray-project#39788 ray-project#39789 


## implementation details
1. `CloudResourceMonitor`
This is a subscriber of Instances. When a Instance get status of
`ALLOCATION_FAILED`, `CloudResourceMonitor` record the node_type and set
a lower its availability score.
2. `ResourceDemandScheduler`
This class determines how to select the best node_type to handle
resource request. I modify the part of selecting the best node type:
```python
# Sort the results by score.
results = sorted(
    results,
    key=lambda r: (
        r.score,
        cloud_resource_availabilities.get(r.node.node_type, 1),
    ),
    reverse=True
)
```
The sorting includes:
2.1. UtilizationScore: to maximize resource utilization.
2.2. Cloud resource availabilities: prioritize node types with the most
available cloud resources, in order to minimize allocation failures.

---------

Signed-off-by: xiaowen.wxw <[email protected]>
Co-authored-by: 行筠 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests stale The issue is stale. It will be closed within 7 days unless there are further conversation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants