-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] Autoscaler with resource availability #58623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Autoscaler with resource availability #58623
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a CloudResourceMonitor to enhance the autoscaler's node selection strategy by considering cloud resource availability. This is a valuable addition, especially for spot instances, as it allows the autoscaler to dynamically adapt to changing cloud conditions and minimize allocation failures. The implementation correctly integrates the monitor into the Autoscaler and Reconciler components, and updates the ResourceDemandScheduler to incorporate availability scores in its node sorting logic. The addition of random.random() to the sorting key is a good approach to diversify requests when other scores are equal. However, there is an inconsistency in time unit handling within the CloudResourceMonitor and a potential issue with a test case's expectation regarding the new sorting logic.
python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py
Outdated
Show resolved
Hide resolved
|
@rueian PTAL when you have a chance |
python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py
Outdated
Show resolved
Hide resolved
python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py
Outdated
Show resolved
Hide resolved
python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py
Outdated
Show resolved
Hide resolved
python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py
Outdated
Show resolved
Hide resolved
python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py
Show resolved
Hide resolved
0a213fe to
d2dd29a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Cloud Allocation Timeout: Resources Misidentified
The is_cloud_instance_allocated() function doesn't include ALLOCATION_TIMEOUT in its set of statuses, but ALLOCATION_TIMEOUT represents an instance with a cloud instance allocated (stuck/pending). Code relying on this function will incorrectly identify ALLOCATION_TIMEOUT instances as not having cloud resources.
python/ray/autoscaler/v2/instance_manager/common.py#L52-L68
ray/python/ray/autoscaler/v2/instance_manager/common.py
Lines 52 to 68 in 5b1dcd1
| @staticmethod | |
| def is_cloud_instance_allocated(instance_status: Instance.InstanceStatus) -> bool: | |
| """ | |
| Returns True if the instance is in a status where there could exist | |
| a cloud instance allocated by the cloud provider. | |
| """ | |
| assert instance_status != Instance.UNKNOWN | |
| return instance_status in { | |
| Instance.ALLOCATED, | |
| Instance.RAY_INSTALLING, | |
| Instance.RAY_RUNNING, | |
| Instance.RAY_STOPPING, | |
| Instance.RAY_STOP_REQUESTED, | |
| Instance.RAY_STOPPED, | |
| Instance.TERMINATING, | |
| Instance.RAY_INSTALL_FAILED, | |
| Instance.TERMINATION_FAILED, |
python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py
Show resolved
Hide resolved
|
@rueian i think this pr is ready to go, please review again |
python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py
Outdated
Show resolved
Hide resolved
80e20a3 to
5f9838e
Compare
dc5150f to
eaaffa8
Compare
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
Signed-off-by: xiaowen wei <[email protected]> Signed-off-by: xiaowen.wxw <[email protected]>
7bec773 to
1035b89
Compare
python/ray/autoscaler/v2/instance_manager/subscribers/cloud_resource_monitor.py
Show resolved
Hide resolved
## Description When the Autoscaler receives a resource request and decides which type of node to scale up,, only the `UtilizationScore` is considered (that is, Ray tries to avoid launching a large node for a small resource request, which would lead to resource waste). If multiple node types in the cluster have the same `UtilizationScore`, Ray always request for the same node type. In Spot scenarios, cloud resources are dynamically changing. Therefore, we want the Autoscaler to be aware of cloud resource availability — if a certain node type becomes unavailable, the Autoscaler should be able to automatically switch to requesting other node types. In this PR, I added the `CloudResourceMonitor` class, which records node types that have failed resource allocation, and in future scaling events, reduces the weight of these node types. ## Related issues Related to ray-project#49983 Fixes ray-project#53636 ray-project#39788 ray-project#39789 ## implementation details 1. `CloudResourceMonitor` This is a subscriber of Instances. When a Instance get status of `ALLOCATION_FAILED`, `CloudResourceMonitor` record the node_type and set a lower its availability score. 2. `ResourceDemandScheduler` This class determines how to select the best node_type to handle resource request. I modify the part of selecting the best node type: ```python # Sort the results by score. results = sorted( results, key=lambda r: ( r.score, cloud_resource_availabilities.get(r.node.node_type, 1), ), reverse=True ) ``` The sorting includes: 2.1. UtilizationScore: to maximize resource utilization. 2.2. Cloud resource availabilities: prioritize node types with the most available cloud resources, in order to minimize allocation failures. --------- Signed-off-by: xiaowen.wxw <[email protected]> Co-authored-by: 行筠 <[email protected]>
Description
When the Autoscaler receives a resource request and decides which type of node to scale up,, only the
UtilizationScoreis considered (that is, Ray tries to avoid launching a large node for a small resource request, which would lead to resource waste). If multiple node types in the cluster have the sameUtilizationScore, Ray always request for the same node type.In Spot scenarios, cloud resources are dynamically changing. Therefore, we want the Autoscaler to be aware of cloud resource availability — if a certain node type becomes unavailable, the Autoscaler should be able to automatically switch to requesting other node types.
In this PR, I added the
CloudResourceMonitorclass, which records node types that have failed resource allocation, and in future scaling events, reduces the weight of these node types.Related issues
Related to #49983
Fixes #53636 #39788 #39789
implementation details
CloudResourceMonitorThis is a subscriber of Instances. When a Instance get status of
ALLOCATION_FAILED,CloudResourceMonitorrecord the node_type and set a lower its availability score.ResourceDemandSchedulerThis class determines how to select the best node_type to handle resource request. I modify the part of selecting the best node type:
The sorting includes:
2.1. UtilizationScore: to maximize resource utilization.
2.2. Cloud resource availabilities: prioritize node types with the most available cloud resources, in order to minimize allocation failures.