-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] autoscaler occasionally goes into exception loop when using preemptible GCP instances #29698
Comments
@wuisawesome could you help triage this? |
Same story as OP 😢 |
More dets; issue also replicates with TPUs (just a couple is fine; maybe 4-8) UPDATE: |
The issue is a known bug in the GCP provider of the cluster launcher. The Ray autoscaler performs two primary functions:
The problem arose during the first step. The cluster launcher code assumes that instances remain available once created. However, any external actions, such as manual termination or spot preemption, would disrupt this assumption. When such disruptions occur, the cluster launcher does not handle the unexpected exceptions properly and continuously retries the operation. This behavior is due to the cluster launcher being designed primarily for bootstrapping and prototyping Ray projects. It is important to note that this issue does not affect the Anyscale platform, which uses a different proprietary autoscaler. To avoid this problem, you may consider leveraging the autoscaling capabilities of Anyscale. Alternatively, you would need to implement additional steps to manage autoscaling effectively. |
What happened + What you expected to happen
I use ray cluster with Google Cloud Platform for my tasks. One thing to note is that I use preemptible instances for workers (thus, Google may stop it anytime).
After a while (about 30-40 minutes of active usage), the scaling stops working: no new workers go up, and no old workers are destroyed after idle timeout (moreover, some workers are up but not initialized). I've debugged the issue down to something that looks like an infinite exception-restart loop in
/tmp/ray/session_latest/logs/monitor.log
at the head node; the relevant log part is:This exception repeats again and again with the same worker id
ray-research-worker-cbcbb628-compute
.The
ray-research-worker-cbcbb628-compute
instance seems to have indeed existed but does not exist at the moment of the exception (thus, a 404 response from the GCP is justified).I believe (though not sure) that situation is something like this:
The expected behavior is that the autoscaler should handle this case and continue to set up other workers, shut down idle ones, etc.
Versions / Dependencies
Google cloud platform is used, and preemptible instances are used for workers (see condig).
Reproduction script
Config:
Script:
In order to reproduce the issue, you may have to submit the script to the cluster several times for the instance shutdown to be caught in the right state.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: