Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote calls hang when requesting GPU resources from an autoscaling EC2 cluster #2192

Open
npyoung opened this issue Jun 5, 2018 · 6 comments
Labels
P3 Issue moderate in impact or severity

Comments

@npyoung
Copy link

npyoung commented Jun 5, 2018

System information

  • OS Platform and Distribution: Linux Ubuntu 16.04 (Deep Learning AMI)
  • Ray installed from (source or binary): pip
  • Ray version: 0.3.1
  • Python version: 3.6.3

When using an autoscaling cluster setup on AWS (see Autoscaling GPU Cluster), running a remote function that requires GPU resources causes an indefinite hang. No worker instances are started.

Relevant bits of mycluster.yaml:

min_workers: 0
max_workers: 10
target_utilization_fraction: 0.8
idle_timeout_minutes: 5

provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2b

head_node:
  InstanceType: m5.large
  ImageId: ami-3b6bce43

worker_nodes:
  InstanceType: p2.16xlarge
    ImageId: ami-3b6bce43  # Amazon Deep Learning AMI (Ubuntu)
    InstanceMarketOptions:
      MarketType: spot
      SpotOptions:
        MaxPrice: 10.00

Code on the driver:

import ray
ray.init(redis_address=ray.services.get_node_ip_address() + ":6379")

@ray.remote(num_gpus=1)
def gpu_test(x):
    return x**2

print(ray.get([gpu_test.remote(i) for i in range(10)]))

Remote functions with num_cpus=1 run fine (but in a small test like this, the driver just takes care of executing everything itself).

@npyoung npyoung changed the title remote calls hang when requesting GPU resources from an autoscaling EC2 cluster Remote calls hang when requesting GPU resources from an autoscaling EC2 cluster Jun 5, 2018
@robertnishihara
Copy link
Collaborator

Thanks for filing the issue, is this the same as #2106?

@npyoung
Copy link
Author

npyoung commented Jun 5, 2018

Yes, seems like the same issue.

In my application I am just using the autoscaler to launch GPU instances when work needs to be done, and kill them when the run ends. Monitoring utilization and periodically rescaling isn't really necessary. The idle_timeout_minutes: 5 setting should handle killing my workers [EDIT: this is not what idle_timeout_minutes does!], so I really just need a way to launch workers. Is it possible to programmatically launch a few workers explicitly every time work is submitted?

@npyoung
Copy link
Author

npyoung commented Jun 5, 2018

Also it seems problematic that the autoscaling GPU cluster config provided in the docs fails in this way.

@AdamGleave
Copy link
Contributor

A (nasty) workaround: does your application 100% need GPUs? If using TensorFlow you can normally run code designed for a GPU on a CPU, it will just run much more slowly. In this case, you could adjust the config to make Ray believe that the head_node actually has 1 GPU.

@npyoung
Copy link
Author

npyoung commented Jun 6, 2018

CPU is about 20-30x slower for my application (comparing V100 GPU to E5-1650 v4 CPU). Really want to stick with GPU for this.

Faking a single GPU on the driver sounds promising! How would I go about doing that? To get the right behavior I would then need to convince the autoscaler that the fake GPU is at 100% utilization when there are jobs in the queue, but 0% when there are no jobs. How could I accomplish that?

Or better yet, where in the ray code base are these decisions made? Probably best to just fix this issue since, again, it causes a tutorial example to fail.

@ericl
Copy link
Contributor

ericl commented Jun 7, 2018

I think the right fix is to propagate the queue length for resources to the autoscaler, that way we can make a decision correctly when the current capacity is 0.

Btw, there are a couple workarounds for now:

  1. Use ray create_or_update cluster.yaml --min-workers=1 --no-restart to force a resize when needed.
  2. Launch a bunch of CPU tasks at the start of your job to force scale-up, e.g.
@ray.remote(num_cpus=1)
def f():
   time.sleep(10)

[f.remote() for _ in range(100)]  # force scale-up to >0 workers

@edoakes edoakes added the P3 Issue moderate in impact or severity label Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 Issue moderate in impact or severity
Projects
None yet
Development

No branches or pull requests

5 participants