-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCI runtime error: crun: error executing hook using podman –userns keep-id #46
Comments
Thanks for the detailed report. The final error that I see:
indicates that the original hook is still being detected and injected. When using CDI it's important that this is not the case. Please remove the installed hook and repeat the run. With regards to the |
Thanks looks like it was left behind during the attempts to get this working with the various package versions. It would be a nice inclusion to remove this hook (or alert the user of the duality) during update. Attempt 4: Pass (missing selinux modules) cd /usr/share/containers/oci/hooks.d && sudo rm oci-nvidia-hook.json
podman run --rm --device nvidia.com/gpu=gpu0 --userns keep-id docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 341, in get_device_name
return get_device_properties(device).name
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 371, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
For /dev/dri I have various /dev/dri/cardX with owned by root, within the video group, and /dev/dri/renderXXXX with root owner and render goup. The /dev/dri by path is all root. I am happy to assist with release testing. |
I have looked into the issue with the Would you be able to repeat your experiments with a CDI spec generated from the Also with regards to:
What do you mean by missing selinux modules? Any information you could provide here as to how you are able to work around this would be much appreciated. |
I checked out HEAD but have run into trouble getting a build to work correctly with podman and podman-docker as the runner. Currently I do not have a native docker install on my dev machine and have not dug into some of the issues of running both alongside each other as specefied here. I am assuming you currently run the build script with docker as the runner? |
With the latest version of Podman and the NVIDIA Container Toolkit 1.13.1 this runs now just fine on my machine:
|
I had this issue as well on Fedora Kinoite (Silverblue). @emanuelbuholzer's command did not work for me immediately, kept getting:
I could not find the actual name of the gpu's device file, but I did figure out that I could use
According to nvidia's documentation this is because CDI and the nvidia-ctk runtime hook are incompatible. To disable it I added
To fix this I had to disable selinux (or lower security settings, something like that) with
|
Versions:
I am attempting to get containers to run with access to the GPU with rootless podman and the --userns keep-id flag. My current steps include:
Generating the cdi spec via:
Attempt 1: Fails
I then removed references to the following in the devices section of the /etc/cdi/nvidia.yaml:
and removed the create symlink hooks in the devices section.
Finally I also removed the nvidia-ctk hook that changes the ownership of the /dev/dri path.
- args: - nvidia-ctk - hook - chmod - --mode - "755" - --path - /dev/dri hookName: createContainer path: /usr/bin/nvidia-ctk
Attempt 2: Pass (missing selinux modules)
I am not concerned about this error, I believe I need to just amend some policy modules as specified here.
However if I attempt to run the above with the –userns keep-id flag.
Attempt 3: Fail
I have also tried the different combinations for the flags of load-kmods and no-cgroups in /etc/nvidia-container-runtime/config.toml.
A lot of this trouble shooting has been directed from the following links.
I am unsure on the of the lifecycle of the permissions when running these hooks however it looks like the first issue where the mapped permissions may not add up is here.
The text was updated successfully, but these errors were encountered: