Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker run --gpus=all will fail as nvidia-smi not available after bootstrap #393

Open
rivershah opened this issue Dec 1, 2022 · 2 comments
Labels
bug Something isn't working provider/gcp/vm Cluster provider for GCP Instances

Comments

@rivershah
Copy link

During cluster bootstrap, the drivers are installed but they are not available as they are not loaded. It appears that a reboot must be done before nvidia-smi becomes available. As the nvidia drivers are not loaded, the command below will fail:

docker run --net=host --gpus=all ...
from dask_cloudprovider.gcp import GCPCluster


def test_dask_gcp_cluster_gpu():
    cluster = GCPCluster(
        machine_type="n1-standard-8",
        n_workers=1,
        filesystem_size=100,
        gpu_type="nvidia-tesla-t4",
        ngpus=1,
    )

cloud-init-output.log

Status: Downloaded newer image for daskdev/dask:latest
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.

If GPUs are being used, the default image should already have drivers installed and useable or alternatively after driver install the nvidia driver should be loaded without requiring a reboot.

Environment:

  • Dask version: 2022.9.2
  • Python version: 3.10
  • Operating System: ubuntu-os-cloud/global/images/ubuntu-minimal-1804-bionic-v20201014
  • Install method (conda, pip, source): pip
@jacobtomlinson jacobtomlinson added bug Something isn't working provider/gcp/vm Cluster provider for GCP Instances labels Jan 13, 2023
@siddharthab
Copy link

The mandatory presence of the --gpus=all flag is also a problem when using container optimized OS (COS). I can run GPU examples in the Ubuntu based CUDA docker images following the instructions at https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#e2e, but the --gpus=all flag is not needed and does not work when using nvidia-container-runtime.

kwargs needed to make COS work, if the --gpus=all flag was not there.

cos_args = {
    # Use COS image with an LTS milestone.
    # https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#requirements
    "source_image": "projects/cos-cloud/global/images/cos-101-lts",
    # https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#installing_drivers_through_cloud-init
    # This step takes ~2 minutes.
    "extra_bootstrap": [
        "cos-extensions install gpu",
        "mount --bind /var/lib/nvidia /var/lib/nvidia",
        "mount -o remount,exec /var/lib/nvidia",
    ],
    "docker_args": " ".join(
        [
            "--volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64",
            "--volume /var/lib/nvidia/bin:/usr/local/nvidia/bin",
            "--device /dev/nvidia0:/dev/nvidia0",
            "--device /dev/nvidia-uvm:/dev/nvidia-uvm",
            "--device /dev/nvidiactl:/dev/nvidiactl",
        ]
    ),
    "bootstrap": False,
}

@jacobtomlinson
Copy link
Member

@siddharthab only Ubuntu is supported currently in dask-cloudprovider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working provider/gcp/vm Cluster provider for GCP Instances
Projects
None yet
Development

No branches or pull requests

3 participants