[Autoscaler] Autoscaler gets into infinite cycle of removing and adding nodes, never satisfies placement group #50783
Labels
bug
Something that is supposed to be working; but isn't
core
Issues that should be addressed in Ray Core
core-autoscaler
autoscaler related issues
P1
Issue that should be fixed within a few weeks
What happened + What you expected to happen
Autoscaler often removes existing nodes while it is adding nodes to meet the demands of a placement group. This can cause it to get into an endless cycle of removing and adding nodes, never satisfying the placement group demand.
I have noticed that this issue is more prominent when
idle_timeout_minutes
is smaller. However, I do not want to increase idle timeout or I will waste resources - to avoid this issue I found I need to make the timeout significantly longer than the time it takes to start up a node, which can be many minutes in case of a large Docker image. I also do not think I should increase the timeout: while the autoscaler is trying to satisfy a placement group, it should not consider any existing nodes that will satisfy it as idle.I have observed the same issue also when using KubeRay.
Versions / Dependencies
Ray version: 2.42.1
Reproduction script
Cluster config:
Reproducing application (autoscaling.py):
Submit first job:
It will start 2 worker nodes (since each of the head and worker nodes has 2 CPU) and complete with a log like this:
Edit the application to comment out the line ending in "Use this in first job" and uncomment the line ending in "Use this in second job" - this will increase the size of the placement group.
After at least 30 seconds since the first job completing, but while the 2 worker nodes are still up, submit second job:
It will go into a cycle like this and may never complete:
Issue Severity
Medium
The text was updated successfully, but these errors were encountered: