-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] PIDs are scrambled and No Such Process
is printed since update to NVIDIA drivers
#75
Comments
Hi @marcreichman-pfi, thanks for raising this. I have encountered the same issue before. I think this would be a bug on the upstream ( In [1]: import pynvml
In [2]: pynvml.nvmlInit()
In [3]: handle = pynvml.nvmlDeviceGetHandleByIndex(0)
In [4]: [p.pid for p in pynvml.nvmlDeviceGetComputeRunningProcesses(handle)]
Out[4]:
[1184,
0,
4294967295,
4294967295,
16040,
0,
4294967295,
4294967295,
19984,
0,
4294967295,
4294967295,
20884,
0,
4294967295,
4294967295,
26308,
0,
4294967295,
4294967295,
16336,
0,
4294967295,
4294967295,
5368,
0,
4294967295,
4294967295,
19828,
0,
4294967295] I haven't found a solution for this yet. This may be due to an internal API change in the NVML library. We may need to wait for the next As a temporary workaround, you could downgrade your NVIDIA driver version. See also: |
Hi @XuehaiPan and thanks for your response and excellent tool! We cannot downgrade because we need newer CUDA version support, so for now we'll just have to wait for an updated version with the NVML library fix. |
Hi @marcreichman-pfi, a new release of python3 -m pip install --upgrade nvidia-ml-py This would resolve the unrecognized PIDs with CUDA 12 drivers. I would also make a new release of |
Thanks @XuehaiPan - is there a way to do this in the docker version? |
@marcreichman-pfi You could upgrade |
Thanks this did the trick! Here was what I did from your
|
|
Required prerequisites
What version of nvitop are you using?
git hash
4093334972a334e9057f5acf7661a2c1a96bd021
Operating system and version
Docker image (under Centos 7 host)
NVIDIA driver version
535.54.03
NVIDIA-SMI
Python environment
This is the docker version from the latest git head (6/20/2023)
Problem description
The output shows scrambled PIDs for processes after the initial process in the lists for each card, and then shows
No Such Process
for the wrong PIDs. This only started after the driver update, so I assume something is changed in the nvidia drivers.Steps to Reproduce
The Python snippets (if any):
Command lines:
Traceback
No response
Logs
Expected behavior
Prior to the driver update, the information was present for the same PIDs included in
nvidia-smi
but with the full commandlines and the per-process resource statistics (e.g.GPU PID USER GPU-MEM %SM %CPU %MEM TIME
). Now it seems to be having an issue parsing proper PIDs from the nvidia libraries, and then failing downstream from there.Additional context
I'm not much of a Python programmer unfortunately so I'm not clear where to dig in, but I'd assume the issue is somewhere in the area of receiving the process list for the cards and deciphering the PIDs. My assumption is that something changed in the driver or some structure or class such that parsing code seems to have broken somewhere.
The text was updated successfully, but these errors were encountered: