You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the issue:
My GPU worker cannot start via the
"command": [
"dask-cuda-worker" if self._worker_gpu else "dask-worker",
"--nthreads",
"{}".format(
max(int(self._worker_cpu / 1024), 1)
if self._worker_nthreads is None
else self._worker_nthreads
),
"--memory-limit",
"{}MB".format(int(self._worker_mem)),
"--death-timeout",
"60",
]
that gets passed in from ecs.py. Dask-cuda seems to have removed the --death-timeout option, so upon startup of the worker, I see
Usage: dask-cuda-worker [OPTIONS] [SCHEDULER] [PRELOAD_ARGV]...
Try 'dask-cuda-worker --help' for help.
Error: Got unexpected extra argument: (60)
I'm unfortunately trying to run this from prefect, so I can't pin to a low enough version of dask-cuda and distributed that do have this argument specified. When I try to pin to a low enough version on the scheduler/worker container, the more recent distributed version on the prefect agent container doesn't seem to play nice with the scheduler/worker with the error (from the prefect agent):
2022-10-13 21:25:50,708 - distributed.protocol.core - CRITICAL - Failed to deserialize
2022-10-13 14:25:50Traceback (most recent call last):
2022-10-13 14:25:50File "/usr/local/lib/python3.9/site-packages/distributed/protocol/core.py", line 158, in loads
2022-10-13 14:25:50return msgpack.loads(
2022-10-13 14:25:50File "msgpack/_unpacker.pyx", line 205, in msgpack._cmsgpack.unpackb
2022-10-13 14:25:50ValueError: Unpack failed: incomplete input
2022-10-13 14:22:1721:22:17.913 | INFO | prefect.task_runner.dask - Creating a new Dask cluster with `__prefect_loader__.<lambda>`
Minimal Complete Verifiable Example:
run a docker image with the following, and you should see the error.
I would say it is an important feature. One of dask-cloudproviders goals is to fail cheaply, so if a worker cannot connect to a scheduler after a timeout it should shutdown/terminate to save money.
Describe the issue:
My GPU worker cannot start via the
that gets passed in from
ecs.py
. Dask-cuda seems to have removed the--death-timeout
option, so upon startup of the worker, I seeI'm unfortunately trying to run this from prefect, so I can't pin to a low enough version of dask-cuda and distributed that do have this argument specified. When I try to pin to a low enough version on the scheduler/worker container, the more recent distributed version on the prefect agent container doesn't seem to play nice with the scheduler/worker with the error (from the prefect agent):
Minimal Complete Verifiable Example:
run a docker image with the following, and you should see the error.
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered: