Skip to content

Conversation

@wxwmd
Copy link
Contributor

@wxwmd wxwmd commented Nov 14, 2025

Description

When the Autoscaler receives a resource request and decides which type of node to scale up,, only the UtilizationScore is considered (that is, Ray tries to avoid launching a large node for a small resource request, which would lead to resource waste). If multiple node types in the cluster have the same UtilizationScore, Ray always request for the same node type.

In Spot scenarios, cloud resources are dynamically changing. Therefore, we want the Autoscaler to be aware of cloud resource availability — if a certain node type becomes unavailable, the Autoscaler should be able to automatically switch to requesting other node types.

In this PR, I added the CloudResourceMonitor class, which records node types that have failed resource allocation, and in future scaling events, reduces the weight of these node types.

Related issues

Related to #49983
Fixes #53636 #39788 #39789

implementation details

  1. CloudResourceMonitor
    This is a subscriber of Instances. When a Instance get status of ALLOCATION_FAILED, CloudResourceMonitor record the node_type and set a lower its availability score.
  2. ResourceDemandScheduler
    This class determines how to select the best node_type to handle resource request. I modify the part of selecting the best node type:
# Sort the results by score.
results = sorted(
    results,
    key=lambda r: (
        r.score,
        cloud_resource_availabilities.get(r.node.node_type, 1),
    ),
    reverse=True
)

The sorting includes:
2.1. UtilizationScore: to maximize resource utilization.
2.2. Cloud resource availabilities: prioritize node types with the most available cloud resources, in order to minimize allocation failures.

@wxwmd wxwmd requested a review from a team as a code owner November 14, 2025 07:50
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a CloudResourceMonitor to enhance the autoscaler's node selection strategy by considering cloud resource availability. This is a valuable addition, especially for spot instances, as it allows the autoscaler to dynamically adapt to changing cloud conditions and minimize allocation failures. The implementation correctly integrates the monitor into the Autoscaler and Reconciler components, and updates the ResourceDemandScheduler to incorporate availability scores in its node sorting logic. The addition of random.random() to the sorting key is a good approach to diversify requests when other scores are equal. However, there is an inconsistency in time unit handling within the CloudResourceMonitor and a potential issue with a test case's expectation regarding the new sorting logic.

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Nov 14, 2025
@edoakes
Copy link
Collaborator

edoakes commented Nov 14, 2025

@rueian PTAL when you have a chance

@wxwmd wxwmd force-pushed the feature/autoscale_with_resource_availability branch from 0a213fe to d2dd29a Compare November 17, 2025 04:03
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Cloud Allocation Timeout: Resources Misidentified

The is_cloud_instance_allocated() function doesn't include ALLOCATION_TIMEOUT in its set of statuses, but ALLOCATION_TIMEOUT represents an instance with a cloud instance allocated (stuck/pending). Code relying on this function will incorrectly identify ALLOCATION_TIMEOUT instances as not having cloud resources.

python/ray/autoscaler/v2/instance_manager/common.py#L52-L68

@staticmethod
def is_cloud_instance_allocated(instance_status: Instance.InstanceStatus) -> bool:
"""
Returns True if the instance is in a status where there could exist
a cloud instance allocated by the cloud provider.
"""
assert instance_status != Instance.UNKNOWN
return instance_status in {
Instance.ALLOCATED,
Instance.RAY_INSTALLING,
Instance.RAY_RUNNING,
Instance.RAY_STOPPING,
Instance.RAY_STOP_REQUESTED,
Instance.RAY_STOPPED,
Instance.TERMINATING,
Instance.RAY_INSTALL_FAILED,
Instance.TERMINATION_FAILED,

Fix in Cursor Fix in Web


@wxwmd
Copy link
Contributor Author

wxwmd commented Nov 18, 2025

@rueian i think this pr is ready to go, please review again

@wxwmd wxwmd requested a review from rueian November 20, 2025 01:46
@wxwmd wxwmd force-pushed the feature/autoscale_with_resource_availability branch from 80e20a3 to 5f9838e Compare November 24, 2025 11:56
@wxwmd wxwmd requested a review from a team as a code owner November 24, 2025 11:56
@wxwmd wxwmd force-pushed the feature/autoscale_with_resource_availability branch 2 times, most recently from dc5150f to eaaffa8 Compare November 24, 2025 13:12
行筠 added 4 commits November 24, 2025 21:20
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
xiaowen wei added 16 commits November 24, 2025 21:20
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]>

Signed-off-by: xiaowen.wxw <[email protected]>
@wxwmd wxwmd force-pushed the feature/autoscale_with_resource_availability branch from 7bec773 to 1035b89 Compare November 24, 2025 13:21
@rueian rueian added the go add ONLY when ready to merge, run all tests label Nov 24, 2025
@rueian
Copy link
Contributor

rueian commented Nov 25, 2025

Hi @edoakes and @jjyao, this PR looks good to me. Please also review this when you have a chance. Thanks!

@edoakes edoakes merged commit 2de1c49 into ray-project:master Nov 25, 2025
7 checks passed
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
## Description
When the Autoscaler receives a resource request and decides which type
of node to scale up,, only the `UtilizationScore` is considered (that
is, Ray tries to avoid launching a large node for a small resource
request, which would lead to resource waste). If multiple node types in
the cluster have the same `UtilizationScore`, Ray always request for the
same node type.

In Spot scenarios, cloud resources are dynamically changing. Therefore,
we want the Autoscaler to be aware of cloud resource availability — if a
certain node type becomes unavailable, the Autoscaler should be able to
automatically switch to requesting other node types.

In this PR, I added the `CloudResourceMonitor` class, which records node
types that have failed resource allocation, and in future scaling
events, reduces the weight of these node types.

## Related issues
Related to ray-project#49983 
Fixes ray-project#53636  ray-project#39788 ray-project#39789 


## implementation details
1. `CloudResourceMonitor`
This is a subscriber of Instances. When a Instance get status of
`ALLOCATION_FAILED`, `CloudResourceMonitor` record the node_type and set
a lower its availability score.
2. `ResourceDemandScheduler`
This class determines how to select the best node_type to handle
resource request. I modify the part of selecting the best node type:
```python
# Sort the results by score.
results = sorted(
    results,
    key=lambda r: (
        r.score,
        cloud_resource_availabilities.get(r.node.node_type, 1),
    ),
    reverse=True
)
```
The sorting includes:
2.1. UtilizationScore: to maximize resource utilization.
2.2. Cloud resource availabilities: prioritize node types with the most
available cloud resources, in order to minimize allocation failures.

---------

Signed-off-by: xiaowen.wxw <[email protected]>
Co-authored-by: 行筠 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[core][autoscaler] Select different node types when a node type is unavailable

3 participants