Remote calls hang when requesting GPU resources from an autoscaling EC2 cluster #2192

npyoung · 2018-06-05T01:05:17Z

System information

OS Platform and Distribution: Linux Ubuntu 16.04 (Deep Learning AMI)
Ray installed from (source or binary): pip
Ray version: 0.3.1
Python version: 3.6.3

When using an autoscaling cluster setup on AWS (see Autoscaling GPU Cluster), running a remote function that requires GPU resources causes an indefinite hang. No worker instances are started.

Relevant bits of mycluster.yaml:

min_workers: 0
max_workers: 10
target_utilization_fraction: 0.8
idle_timeout_minutes: 5

provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2b

head_node:
  InstanceType: m5.large
  ImageId: ami-3b6bce43

worker_nodes:
  InstanceType: p2.16xlarge
    ImageId: ami-3b6bce43  # Amazon Deep Learning AMI (Ubuntu)
    InstanceMarketOptions:
      MarketType: spot
      SpotOptions:
        MaxPrice: 10.00

Code on the driver:

import ray
ray.init(redis_address=ray.services.get_node_ip_address() + ":6379")

@ray.remote(num_gpus=1)
def gpu_test(x):
    return x**2

print(ray.get([gpu_test.remote(i) for i in range(10)]))

Remote functions with num_cpus=1 run fine (but in a small test like this, the driver just takes care of executing everything itself).

The text was updated successfully, but these errors were encountered:

robertnishihara · 2018-06-05T02:48:14Z

Thanks for filing the issue, is this the same as #2106?

npyoung · 2018-06-05T02:59:15Z

Yes, seems like the same issue.

In my application I am just using the autoscaler to launch GPU instances when work needs to be done, and kill them when the run ends. Monitoring utilization and periodically rescaling isn't really necessary. ~~The idle_timeout_minutes: 5 setting should handle killing my workers~~ [EDIT: this is not what idle_timeout_minutes does!], so I really just need a way to launch workers. Is it possible to programmatically launch a few workers explicitly every time work is submitted?

npyoung · 2018-06-05T07:33:51Z

Also it seems problematic that the autoscaling GPU cluster config provided in the docs fails in this way.

AdamGleave · 2018-06-06T01:21:23Z

A (nasty) workaround: does your application 100% need GPUs? If using TensorFlow you can normally run code designed for a GPU on a CPU, it will just run much more slowly. In this case, you could adjust the config to make Ray believe that the head_node actually has 1 GPU.

npyoung · 2018-06-06T07:43:35Z

CPU is about 20-30x slower for my application (comparing V100 GPU to E5-1650 v4 CPU). Really want to stick with GPU for this.

Faking a single GPU on the driver sounds promising! How would I go about doing that? To get the right behavior I would then need to convince the autoscaler that the fake GPU is at 100% utilization when there are jobs in the queue, but 0% when there are no jobs. How could I accomplish that?

Or better yet, where in the ray code base are these decisions made? Probably best to just fix this issue since, again, it causes a tutorial example to fail.

ericl · 2018-06-07T18:10:07Z

I think the right fix is to propagate the queue length for resources to the autoscaler, that way we can make a decision correctly when the current capacity is 0.

Btw, there are a couple workarounds for now:

Use ray create_or_update cluster.yaml --min-workers=1 --no-restart to force a resize when needed.
Launch a bunch of CPU tasks at the start of your job to force scale-up, e.g.

@ray.remote(num_cpus=1)
def f():
   time.sleep(10)

[f.remote() for _ in range(100)]  # force scale-up to >0 workers

npyoung changed the title ~~remote calls hang when requesting GPU resources from an autoscaling EC2 cluster~~ Remote calls hang when requesting GPU resources from an autoscaling EC2 cluster Jun 5, 2018

AdamGleave mentioned this issue Jun 28, 2018

Autoscaler node counting #2320

Merged

edoakes added the P3 Issue moderate in impact or severity label Mar 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote calls hang when requesting GPU resources from an autoscaling EC2 cluster #2192

Remote calls hang when requesting GPU resources from an autoscaling EC2 cluster #2192

npyoung commented Jun 5, 2018 •

edited

Loading

robertnishihara commented Jun 5, 2018

npyoung commented Jun 5, 2018 •

edited

Loading

npyoung commented Jun 5, 2018

AdamGleave commented Jun 6, 2018

npyoung commented Jun 6, 2018 •

edited

Loading

ericl commented Jun 7, 2018 •

edited

Loading

Remote calls hang when requesting GPU resources from an autoscaling EC2 cluster #2192

Remote calls hang when requesting GPU resources from an autoscaling EC2 cluster #2192

Comments

npyoung commented Jun 5, 2018 • edited Loading

System information

robertnishihara commented Jun 5, 2018

npyoung commented Jun 5, 2018 • edited Loading

npyoung commented Jun 5, 2018

AdamGleave commented Jun 6, 2018

npyoung commented Jun 6, 2018 • edited Loading

ericl commented Jun 7, 2018 • edited Loading

npyoung commented Jun 5, 2018 •

edited

Loading

npyoung commented Jun 5, 2018 •

edited

Loading

npyoung commented Jun 6, 2018 •

edited

Loading

ericl commented Jun 7, 2018 •

edited

Loading