-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote calls hang when requesting GPU resources from an autoscaling EC2 cluster #2192
Comments
Thanks for filing the issue, is this the same as #2106? |
Yes, seems like the same issue. In my application I am just using the autoscaler to launch GPU instances when work needs to be done, and kill them when the run ends. Monitoring utilization and periodically rescaling isn't really necessary. |
Also it seems problematic that the autoscaling GPU cluster config provided in the docs fails in this way. |
A (nasty) workaround: does your application 100% need GPUs? If using TensorFlow you can normally run code designed for a GPU on a CPU, it will just run much more slowly. In this case, you could adjust the config to make Ray believe that the head_node actually has 1 GPU. |
CPU is about 20-30x slower for my application (comparing V100 GPU to E5-1650 v4 CPU). Really want to stick with GPU for this. Faking a single GPU on the driver sounds promising! How would I go about doing that? To get the right behavior I would then need to convince the autoscaler that the fake GPU is at 100% utilization when there are jobs in the queue, but 0% when there are no jobs. How could I accomplish that? Or better yet, where in the ray code base are these decisions made? Probably best to just fix this issue since, again, it causes a tutorial example to fail. |
I think the right fix is to propagate the queue length for resources to the autoscaler, that way we can make a decision correctly when the current capacity is 0. Btw, there are a couple workarounds for now:
|
System information
When using an autoscaling cluster setup on AWS (see Autoscaling GPU Cluster), running a remote function that requires GPU resources causes an indefinite hang. No worker instances are started.
Relevant bits of mycluster.yaml:
Code on the driver:
Remote functions with
num_cpus=1
run fine (but in a small test like this, the driver just takes care of executing everything itself).The text was updated successfully, but these errors were encountered: