Cannot spin up ECS GPU worker with current versions #384

cdc97 · 2022-10-13T22:00:22Z

Describe the issue:
My GPU worker cannot start via the

"command": [
                            "dask-cuda-worker" if self._worker_gpu else "dask-worker",
                            "--nthreads",
                            "{}".format(
                                max(int(self._worker_cpu / 1024), 1)
                                if self._worker_nthreads is None
                                else self._worker_nthreads
                            ),
                            "--memory-limit",
                            "{}MB".format(int(self._worker_mem)),
                            "--death-timeout",
                            "60",
                        ]

that gets passed in from ecs.py. Dask-cuda seems to have removed the --death-timeout option, so upon startup of the worker, I see

Usage: dask-cuda-worker [OPTIONS] [SCHEDULER] [PRELOAD_ARGV]...
Try 'dask-cuda-worker --help' for help.

Error: Got unexpected extra argument: (60)

I'm unfortunately trying to run this from prefect, so I can't pin to a low enough version of dask-cuda and distributed that do have this argument specified. When I try to pin to a low enough version on the scheduler/worker container, the more recent distributed version on the prefect agent container doesn't seem to play nice with the scheduler/worker with the error (from the prefect agent):

2022-10-13 21:25:50,708 - distributed.protocol.core - CRITICAL - Failed to deserialize
2022-10-13 14:25:50Traceback (most recent call last):
2022-10-13 14:25:50File "/usr/local/lib/python3.9/site-packages/distributed/protocol/core.py", line 158, in loads
2022-10-13 14:25:50return msgpack.loads(
2022-10-13 14:25:50File "msgpack/_unpacker.pyx", line 205, in msgpack._cmsgpack.unpackb
2022-10-13 14:25:50ValueError: Unpack failed: incomplete input
2022-10-13 14:22:1721:22:17.913 | INFO | prefect.task_runner.dask - Creating a new Dask cluster with `__prefect_loader__.<lambda>`

Minimal Complete Verifiable Example:
run a docker image with the following, and you should see the error.

RUN pip install prefect distributed dask-cuda dask
CMD ["dask-cuda-worker", "--nthreads", "1", "--death-timeout", "60"]

Anything else we need to know?:

Environment:

dask                          2022.9.2
dask-cuda                     22.10.0
distributed                   2022.9.2
prefect                       2.6.0

Dask version:
Python version: Python 3.8.0
Operating System:
Install method (conda, pip, source): pip

The text was updated successfully, but these errors were encountered:

jacobtomlinson · 2022-10-14T16:11:41Z

Looks like this was removed in rapidsai/dask-cuda#563. cc @charlesbluca @pentschev

I'm surprised this hasn't come up until now.

pentschev · 2022-10-14T18:32:33Z

I'm not familiar with dask-cloudprovider, is --death-timeout something generally important or is it something that it can live without?

jacobtomlinson · 2022-10-18T10:39:19Z

I would say it is an important feature. One of dask-cloudproviders goals is to fail cheaply, so if a worker cannot connect to a scheduler after a timeout it should shutdown/terminate to save money.

jacobtomlinson added bug Something isn't working provider/aws/ecs Cluster provider for AWS ECS labels Oct 14, 2022

jacobtomlinson mentioned this issue Oct 14, 2022

Reinstate --death-timeout rapidsai/dask-cuda#1017

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot spin up ECS GPU worker with current versions #384

Cannot spin up ECS GPU worker with current versions #384

cdc97 commented Oct 13, 2022

jacobtomlinson commented Oct 14, 2022

pentschev commented Oct 14, 2022

jacobtomlinson commented Oct 18, 2022

Cannot spin up ECS GPU worker with current versions #384

Cannot spin up ECS GPU worker with current versions #384

Comments

cdc97 commented Oct 13, 2022

jacobtomlinson commented Oct 14, 2022

pentschev commented Oct 14, 2022

jacobtomlinson commented Oct 18, 2022