You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently support was added for criu to checkpoint cuda applications. I've tested this on plain old processes and it seems to work as advertised.
I wanted to try this runc/containers as well. So I created a container using the nvidia-container-runtime shim which seems to just modify the config.json, adding a prestart hook that does the heavy lifting. After which I can call runc start and invoke the test cuda application I created just fine. However when I try to take a snapshot, runc checkpoint just hangs regardless of if the cuda application is even running. Taking a look at the dump.log and I can see that criu error'ed out. Here are the last few lines, full dump attached below:
(07.507264) Error (criu/mount.c:1088): mnt: Mount 251 ./proc/driver/nvidia/gpus/0000:00:04.0 (master_id: 5 shared_id: 0) has unreachable sharing. Try --enable-external-masters.
(07.507282) net: Unlock network
(07.507285) Running network-unlock scripts
(07.507287) RPC
(07.519624) cuda_plugin: finished cuda_plugin stage 0 err -1
(10.996267) cuda_plugin: resuming devices on pid 404642
(10.996295) cuda_plugin: Restore thread pid 404694 found for real pid 404642
I'm happy to keep digging and and see if I can find a way to try the equivalent of --enable-external-masters with criu rpc. But I wanted to file this issue incase the more experienced had any pointers. I'm specifically wondering if the external masters thing is just a rabbit hole or not? It's not so easy to just 'try' this flag with RPC than I can see. But if it solves it I'm happy to submit a PR.
Please advise, thanks!
Steps to reproduce the issue
sudo nvidia-container-runtime create test
sudo runc run test
sudo runc checkpoint --image-path ./dump --work-path ./workdir/ --leave-running=false test
Without looking too deep into this, it looks like an issue with nvidia-container-runtime which creates a bind mount from the host instead of properly configuring device access. I see that nvidia-container-runtime has been deprecated in favor of https://github.com/NVIDIA/nvidia-container-toolkit -- maybe it does things differently?
Any progress on this? I encountered the same issue. If this is caused by nvidia-container-runtime, maybe we should raise an issue on nvidia-container-toolkit.
Description
Hi,
Recently support was added for criu to checkpoint cuda applications. I've tested this on plain old processes and it seems to work as advertised.
I wanted to try this runc/containers as well. So I created a container using the
nvidia-container-runtime
shim which seems to just modify the config.json, adding a prestart hook that does the heavy lifting. After which I can callrunc start
and invoke the test cuda application I created just fine. However when I try to take a snapshot,runc checkpoint
just hangs regardless of if the cuda application is even running. Taking a look at the dump.log and I can see that criu error'ed out. Here are the last few lines, full dump attached below:I'm happy to keep digging and and see if I can find a way to try the equivalent of
--enable-external-masters
with criu rpc. But I wanted to file this issue incase the more experienced had any pointers. I'm specifically wondering if the external masters thing is just a rabbit hole or not? It's not so easy to just 'try' this flag with RPC than I can see. But if it solves it I'm happy to submit a PR.Please advise, thanks!
Steps to reproduce the issue
I've attached the config.json.
config.json
Describe the results you received and expected
runc checkpoint
just hangs, but if you take a look at the dump.log you can criu errored out. Dump attached.dump.log
What version of runc are you using?
runc --version
runc version 1.2.1+dev
commit: v1.2.1-4-g2327ec22
spec: 1.2.0
go: go1.23.3
libseccomp: 2.5.5
riu --version
Version: 4.0
Host OS information
cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
Host kernel information
uname -a
Linux geoff-dev-testing 6.8.0-1015-gcp #17-Ubuntu SMP Mon Sep 2 17:57:02 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: