[core] Autoscaler with resource availability #58623

wxwmd · 2025-11-14T07:50:35Z

Description

When the Autoscaler receives a resource request and decides which type of node to scale up,, only the UtilizationScore is considered (that is, Ray tries to avoid launching a large node for a small resource request, which would lead to resource waste). If multiple node types in the cluster have the same UtilizationScore, Ray always request for the same node type.

In Spot scenarios, cloud resources are dynamically changing. Therefore, we want the Autoscaler to be aware of cloud resource availability — if a certain node type becomes unavailable, the Autoscaler should be able to automatically switch to requesting other node types.

In this PR, I added the CloudResourceMonitor class, which records node types that have failed resource allocation, and in future scaling events, reduces the weight of these node types.

Related issues

Related to #49983
Fixes #53636 #39788 #39789

implementation details

CloudResourceMonitor
This is a subscriber of Instances. When a Instance get status of ALLOCATION_FAILED, CloudResourceMonitor record the node_type and set a lower its availability score.
ResourceDemandScheduler
This class determines how to select the best node_type to handle resource request. I modify the part of selecting the best node type:

# Sort the results by score.
results = sorted(
    results,
    key=lambda r: (
        r.score,
        cloud_resource_availabilities.get(r.node.node_type, 1),
    ),
    reverse=True
)

The sorting includes:
2.1. UtilizationScore: to maximize resource utilization.
2.2. Cloud resource availabilities: prioritize node types with the most available cloud resources, in order to minimize allocation failures.

python/ray/autoscaler/v2/tests/test_scheduler.py

gemini-code-assist

Code Review

This pull request introduces a CloudResourceMonitor to enhance the autoscaler's node selection strategy by considering cloud resource availability. This is a valuable addition, especially for spot instances, as it allows the autoscaler to dynamically adapt to changing cloud conditions and minimize allocation failures. The implementation correctly integrates the monitor into the Autoscaler and Reconciler components, and updates the ResourceDemandScheduler to incorporate availability scores in its node sorting logic. The addition of random.random() to the sorting key is a good approach to diversify requests when other scores are equal. However, there is an inconsistency in time unit handling within the CloudResourceMonitor and a potential issue with a test case's expectation regarding the new sorting logic.

python/ray/autoscaler/v2/instance_manager/reconciler.py

python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py

python/ray/autoscaler/v2/tests/test_scheduler.py

python/ray/autoscaler/v2/instance_manager/reconciler.py

edoakes · 2025-11-14T21:29:11Z

@rueian PTAL when you have a chance

python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py

python/ray/autoscaler/v2/instance_manager/reconciler.py

python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py

python/ray/autoscaler/v2/scheduler.py

python/ray/autoscaler/v2/tests/test_reconciler.py

cursor

Bug: Cloud Allocation Timeout: Resources Misidentified

The is_cloud_instance_allocated() function doesn't include ALLOCATION_TIMEOUT in its set of statuses, but ALLOCATION_TIMEOUT represents an instance with a cloud instance allocated (stuck/pending). Code relying on this function will incorrectly identify ALLOCATION_TIMEOUT instances as not having cloud resources.

python/ray/autoscaler/v2/instance_manager/common.py#L52-L68

ray/python/ray/autoscaler/v2/instance_manager/common.py

Lines 52 to 68 in 5b1dcd1

    
               @staticmethod 
        
               def is_cloud_instance_allocated(instance_status: Instance.InstanceStatus) -> bool: 
        
                   """ 
        
                   Returns True if the instance is in a status where there could exist 
        
                   a cloud instance allocated by the cloud provider. 
        
                   """ 
        
                   assert instance_status != Instance.UNKNOWN 
        
                   return instance_status in { 
        
                       Instance.ALLOCATED, 
        
                       Instance.RAY_INSTALLING, 
        
                       Instance.RAY_RUNNING, 
        
                       Instance.RAY_STOPPING, 
        
                       Instance.RAY_STOP_REQUESTED, 
        
                       Instance.RAY_STOPPED, 
        
                       Instance.TERMINATING, 
        
                       Instance.RAY_INSTALL_FAILED, 
        
                       Instance.TERMINATION_FAILED,

python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py

python/ray/autoscaler/v2/scheduler.py

wxwmd · 2025-11-18T12:00:03Z

@rueian i think this pr is ready to go, please review again

python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py

python/ray/autoscaler/v2/instance_manager/reconciler.py

python/ray/autoscaler/v2/autoscaler.py

python/ray/autoscaler/v2/scheduler.py

python/ray/autoscaler/v2/tests/test_scheduler.py

python/ray/data/_internal/logical/rules/projection_pushdown.py

python/ray/autoscaler/v2/scheduler.py

python/ray/data/_internal/logical/rules/projection_pushdown.py

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py

rueian · 2025-11-25T02:22:55Z

Hi @edoakes and @jjyao, this PR looks good to me. Please also review this when you have a chance. Thanks!

## Description When the Autoscaler receives a resource request and decides which type of node to scale up,, only the `UtilizationScore` is considered (that is, Ray tries to avoid launching a large node for a small resource request, which would lead to resource waste). If multiple node types in the cluster have the same `UtilizationScore`, Ray always request for the same node type. In Spot scenarios, cloud resources are dynamically changing. Therefore, we want the Autoscaler to be aware of cloud resource availability — if a certain node type becomes unavailable, the Autoscaler should be able to automatically switch to requesting other node types. In this PR, I added the `CloudResourceMonitor` class, which records node types that have failed resource allocation, and in future scaling events, reduces the weight of these node types. ## Related issues Related to ray-project#49983 Fixes ray-project#53636 ray-project#39788 ray-project#39789 ## implementation details 1. `CloudResourceMonitor` This is a subscriber of Instances. When a Instance get status of `ALLOCATION_FAILED`, `CloudResourceMonitor` record the node_type and set a lower its availability score. 2. `ResourceDemandScheduler` This class determines how to select the best node_type to handle resource request. I modify the part of selecting the best node type: ```python # Sort the results by score. results = sorted( results, key=lambda r: ( r.score, cloud_resource_availabilities.get(r.node.node_type, 1), ), reverse=True ) ``` The sorting includes: 2.1. UtilizationScore: to maximize resource utilization. 2.2. Cloud resource availabilities: prioritize node types with the most available cloud resources, in order to minimize allocation failures. --------- Signed-off-by: xiaowen.wxw <[email protected]> Co-authored-by: 行筠 <[email protected]>

wxwmd requested a review from a team as a code owner November 14, 2025 07:50

cursor bot reviewed Nov 14, 2025

View reviewed changes

python/ray/autoscaler/v2/tests/test_scheduler.py Show resolved Hide resolved

gemini-code-assist bot reviewed Nov 14, 2025

View reviewed changes

ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Nov 14, 2025

edoakes assigned rueian Nov 14, 2025

rueian reviewed Nov 16, 2025

View reviewed changes

python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py Outdated Show resolved Hide resolved

cursor bot reviewed Nov 17, 2025

View reviewed changes

python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py Show resolved Hide resolved

python/ray/autoscaler/v2/scheduler.py Outdated Show resolved Hide resolved

wxwmd force-pushed the feature/autoscale_with_resource_availability branch from 0a213fe to d2dd29a Compare November 17, 2025 04:03

cursor bot reviewed Nov 17, 2025

View reviewed changes

python/ray/autoscaler/v2/tests/test_reconciler.py Outdated Show resolved Hide resolved

python/ray/autoscaler/v2/tests/test_reconciler.py Show resolved Hide resolved

cursor bot reviewed Nov 17, 2025

View reviewed changes

cursor bot reviewed Nov 18, 2025

View reviewed changes

python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py Show resolved Hide resolved

cursor bot reviewed Nov 18, 2025

View reviewed changes

python/ray/autoscaler/v2/scheduler.py Show resolved Hide resolved

wxwmd requested a review from rueian November 20, 2025 01:46

rueian reviewed Nov 24, 2025

View reviewed changes

python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py Outdated Show resolved Hide resolved

python/ray/autoscaler/v2/instance_manager/reconciler.py Outdated Show resolved Hide resolved

rueian reviewed Nov 24, 2025

View reviewed changes

python/ray/autoscaler/v2/autoscaler.py Outdated Show resolved Hide resolved

python/ray/autoscaler/v2/scheduler.py Show resolved Hide resolved

wxwmd force-pushed the feature/autoscale_with_resource_availability branch from 80e20a3 to 5f9838e Compare November 24, 2025 11:56

wxwmd requested a review from a team as a code owner November 24, 2025 11:56

cursor bot reviewed Nov 24, 2025

View reviewed changes

python/ray/autoscaler/v2/scheduler.py Show resolved Hide resolved

python/ray/autoscaler/v2/tests/test_scheduler.py Outdated Show resolved Hide resolved

python/ray/data/_internal/logical/rules/projection_pushdown.py Show resolved Hide resolved

cursor bot reviewed Nov 24, 2025

View reviewed changes

python/ray/autoscaler/v2/scheduler.py Show resolved Hide resolved

cursor bot reviewed Nov 24, 2025

View reviewed changes

python/ray/autoscaler/v2/scheduler.py Show resolved Hide resolved

python/ray/data/_internal/logical/rules/projection_pushdown.py Outdated Show resolved Hide resolved

wxwmd force-pushed the feature/autoscale_with_resource_availability branch 2 times, most recently from dc5150f to eaaffa8 Compare November 24, 2025 13:12

行筠 added 4 commits November 24, 2025 21:20

autoscaler for spot resources

c689b67

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

2b4113c

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

f5babbf

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

30db581

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

xiaowen wei added 16 commits November 24, 2025 21:20

autoscaler for spot resources

dd31148

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

9a84eab

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

ff64f09

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

94796cf

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

0b34179

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

4e6ef80

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

185a62b

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

c95675c

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

87b2860

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

7eb7fdf

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

1b6039d

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

3375aa1

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

9dace5d

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

3290f1b

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

d8f22d9

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

autoscaler for spot resources

1035b89

Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>

wxwmd force-pushed the feature/autoscale_with_resource_availability branch from 7bec773 to 1035b89 Compare November 24, 2025 13:21

cursor bot reviewed Nov 24, 2025

View reviewed changes

python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py Show resolved Hide resolved

rueian added the go add ONLY when ready to merge, run all tests label Nov 24, 2025

rueian approved these changes Nov 25, 2025

View reviewed changes

edoakes merged commit 2de1c49 into ray-project:master Nov 25, 2025
7 checks passed

	@staticmethod
	def is_cloud_instance_allocated(instance_status: Instance.InstanceStatus) -> bool:
	"""
	Returns True if the instance is in a status where there could exist
	a cloud instance allocated by the cloud provider.
	"""
	assert instance_status != Instance.UNKNOWN
	return instance_status in {
	Instance.ALLOCATED,
	Instance.RAY_INSTALLING,
	Instance.RAY_RUNNING,
	Instance.RAY_STOPPING,
	Instance.RAY_STOP_REQUESTED,
	Instance.RAY_STOPPED,
	Instance.TERMINATING,
	Instance.RAY_INSTALL_FAILED,
	Instance.TERMINATION_FAILED,

[core] Autoscaler with resource availability #58623

[core] Autoscaler with resource availability #58623

Uh oh!

Conversation

wxwmd commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

implementation details

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edoakes commented Nov 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Cloud Allocation Timeout: Resources Misidentified

Uh oh!

Uh oh!

Uh oh!

wxwmd commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rueian commented Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wxwmd commented Nov 14, 2025 •

edited

Loading