[Autoscaler] Autoscaler gets into infinite cycle of removing and adding nodes, never satisfies placement group #50783

jleben · 2025-02-21T04:10:59Z

What happened + What you expected to happen

Autoscaler often removes existing nodes while it is adding nodes to meet the demands of a placement group. This can cause it to get into an endless cycle of removing and adding nodes, never satisfying the placement group demand.

I have noticed that this issue is more prominent when idle_timeout_minutes is smaller. However, I do not want to increase idle timeout or I will waste resources - to avoid this issue I found I need to make the timeout significantly longer than the time it takes to start up a node, which can be many minutes in case of a large Docker image. I also do not think I should increase the timeout: while the autoscaler is trying to satisfy a placement group, it should not consider any existing nodes that will satisfy it as idle.

I have observed the same issue also when using KubeRay.

Versions / Dependencies

Ray version: 2.42.1

Reproduction script

Cluster config:

cluster_name: 'debug'

max_workers: 10
upscaling_speed: 1.0
idle_timeout_minutes: 1

docker:
    image: 'rayproject/ray:latest'
    container_name: "ray_container"
    pull_before_run: True
    run_options:
        - "--ulimit nofile=65536:65536"

provider:
    type: aws
    region: us-east-1
    cache_stopped_nodes: True
    use_internal_ips: True

auth:
    ssh_user: ubuntu
    ssh_private_key: '~/.ssh/jakobleben-ec2.pem'

available_node_types:
    ray.head.default:
        node_config: &node_config
            InstanceType: m5.large
            ImageId: ami-0dd6adfad4ad37eec
            SecurityGroupIds: ['sg-0c15118bc2b72e433']
            SubnetId: subnet-05324181c6015696c
            IamInstanceProfile:
                Name: 'jakobhawkeyeHawkeyeRayNodeProfile'
            KeyName: 'jakobleben'
            BlockDeviceMappings:
                - DeviceName: /dev/xvda
                  Ebs:
                      VolumeSize: 80
                      VolumeType: gp3
    ray.worker:
        min_workers: 0
        max_workers: 10
        resources: {}
        node_config:
            <<: *node_config

head_node_type: ray.head.default

file_mounts: {}
cluster_synced_files: []
file_mounts_sync_continuously: False

rsync_exclude:
    - "__pycache__"
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []
setup_commands: []
head_setup_commands: []
worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Reproducing application (autoscaling.py):

import ray
from ray.util.placement_group import placement_group
import time
ray.init()
print("Creating placement group...")
pg = placement_group([{"CPU": 1}] * 4)  # Use this in first job
# pg = placement_group([{"CPU": 1}] * 10)  # Use this in second job
created = pg.wait(timeout_seconds=60 * 60)  # Allow 60 minutes
print(f"Finished. Created = {created}")

Submit first job:

ray job submit --working-dir . -- python3 autoscaling.py

It will start 2 worker nodes (since each of the head and worker nodes has 2 CPU) and complete with a log like this:

...
Creating placement group...
Finished. Created = True
...

Edit the application to comment out the line ending in "Use this in first job" and uncomment the line ending in "Use this in second job" - this will increase the size of the placement group.

After at least 30 seconds since the first job completing, but while the 2 worker nodes are still up, submit second job:

ray job submit --working-dir . -- python3 autoscaling.py

It will go into a cycle like this and may never complete:

(autoscaler +6s) Adding 2 node(s) of type ray.worker.
(autoscaler +22s) Removing 2 nodes of type ray.worker (idle).
(autoscaler +33s) Resized to 2 CPUs.
(autoscaler +2m31s) Resized to 6 CPUs.
(autoscaler +3m28s) Removing 2 nodes of type ray.worker (idle).
(autoscaler +3m33s) Adding 2 node(s) of type ray.worker.
(autoscaler +3m38s) Resized to 2 CPUs.
(autoscaler +5m20s) Resized to 6 CPUs.
(autoscaler +6m17s) Removing 2 nodes of type ray.worker (idle).
(autoscaler +6m27s) Resized to 2 CPUs.
(autoscaler +6m43s) Adding 2 node(s) of type ray.worker.
(autoscaler +6m53s) Adding 2 node(s) of type ray.worker.
(autoscaler +7m35s) Resized to 6 CPUs.

Issue Severity

Medium

The text was updated successfully, but these errors were encountered:

jleben added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 21, 2025

jcotant1 added the core Issues that should be addressed in Ray Core label Feb 21, 2025

jjyao added core-autoscaler autoscaler related issues P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 24, 2025

jjyao mentioned this issue Feb 24, 2025

[Umbrella] Autoscaler improvements ray-project/kuberay#2600

Open

32 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Autoscaler] Autoscaler gets into infinite cycle of removing and adding nodes, never satisfies placement group #50783

[Autoscaler] Autoscaler gets into infinite cycle of removing and adding nodes, never satisfies placement group #50783

jleben commented Feb 21, 2025 •

edited

Loading

[Autoscaler] Autoscaler gets into infinite cycle of removing and adding nodes, never satisfies placement group #50783

[Autoscaler] Autoscaler gets into infinite cycle of removing and adding nodes, never satisfies placement group #50783

Comments

jleben commented Feb 21, 2025 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

jleben commented Feb 21, 2025 •

edited

Loading