Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Autoscaler] Autoscaler gets into infinite cycle of removing and adding nodes, never satisfies placement group #50783

Open
Tracked by #2600
jleben opened this issue Feb 21, 2025 · 0 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues P1 Issue that should be fixed within a few weeks

Comments

@jleben
Copy link

jleben commented Feb 21, 2025

What happened + What you expected to happen

Autoscaler often removes existing nodes while it is adding nodes to meet the demands of a placement group. This can cause it to get into an endless cycle of removing and adding nodes, never satisfying the placement group demand.

I have noticed that this issue is more prominent when idle_timeout_minutes is smaller. However, I do not want to increase idle timeout or I will waste resources - to avoid this issue I found I need to make the timeout significantly longer than the time it takes to start up a node, which can be many minutes in case of a large Docker image. I also do not think I should increase the timeout: while the autoscaler is trying to satisfy a placement group, it should not consider any existing nodes that will satisfy it as idle.

I have observed the same issue also when using KubeRay.

Versions / Dependencies

Ray version: 2.42.1

Reproduction script

Cluster config:

cluster_name: 'debug'

max_workers: 10
upscaling_speed: 1.0
idle_timeout_minutes: 1

docker:
    image: 'rayproject/ray:latest'
    container_name: "ray_container"
    pull_before_run: True
    run_options:
        - "--ulimit nofile=65536:65536"

provider:
    type: aws
    region: us-east-1
    cache_stopped_nodes: True
    use_internal_ips: True

auth:
    ssh_user: ubuntu
    ssh_private_key: '~/.ssh/jakobleben-ec2.pem'

available_node_types:
    ray.head.default:
        node_config: &node_config
            InstanceType: m5.large
            ImageId: ami-0dd6adfad4ad37eec
            SecurityGroupIds: ['sg-0c15118bc2b72e433']
            SubnetId: subnet-05324181c6015696c
            IamInstanceProfile:
                Name: 'jakobhawkeyeHawkeyeRayNodeProfile'
            KeyName: 'jakobleben'
            BlockDeviceMappings:
                - DeviceName: /dev/xvda
                  Ebs:
                      VolumeSize: 80
                      VolumeType: gp3
    ray.worker:
        min_workers: 0
        max_workers: 10
        resources: {}
        node_config:
            <<: *node_config

head_node_type: ray.head.default

file_mounts: {}
cluster_synced_files: []
file_mounts_sync_continuously: False

rsync_exclude:
    - "__pycache__"
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []
setup_commands: []
head_setup_commands: []
worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Reproducing application (autoscaling.py):

import ray
from ray.util.placement_group import placement_group
import time
ray.init()
print("Creating placement group...")
pg = placement_group([{"CPU": 1}] * 4)  # Use this in first job
# pg = placement_group([{"CPU": 1}] * 10)  # Use this in second job
created = pg.wait(timeout_seconds=60 * 60)  # Allow 60 minutes
print(f"Finished. Created = {created}")

Submit first job:

ray job submit --working-dir . -- python3 autoscaling.py

It will start 2 worker nodes (since each of the head and worker nodes has 2 CPU) and complete with a log like this:

...
Creating placement group...
Finished. Created = True
...

Edit the application to comment out the line ending in "Use this in first job" and uncomment the line ending in "Use this in second job" - this will increase the size of the placement group.

After at least 30 seconds since the first job completing, but while the 2 worker nodes are still up, submit second job:

ray job submit --working-dir . -- python3 autoscaling.py

It will go into a cycle like this and may never complete:

(autoscaler +6s) Adding 2 node(s) of type ray.worker.
(autoscaler +22s) Removing 2 nodes of type ray.worker (idle).
(autoscaler +33s) Resized to 2 CPUs.
(autoscaler +2m31s) Resized to 6 CPUs.
(autoscaler +3m28s) Removing 2 nodes of type ray.worker (idle).
(autoscaler +3m33s) Adding 2 node(s) of type ray.worker.
(autoscaler +3m38s) Resized to 2 CPUs.
(autoscaler +5m20s) Resized to 6 CPUs.
(autoscaler +6m17s) Removing 2 nodes of type ray.worker (idle).
(autoscaler +6m27s) Resized to 2 CPUs.
(autoscaler +6m43s) Adding 2 node(s) of type ray.worker.
(autoscaler +6m53s) Adding 2 node(s) of type ray.worker.
(autoscaler +7m35s) Resized to 6 CPUs.

Issue Severity

Medium

@jleben jleben added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 21, 2025
@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Feb 21, 2025
@jjyao jjyao added core-autoscaler autoscaler related issues P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

3 participants