Fix GetComputeRunningProcesses on CUDA 10.x #25

XuehaiPan · 2021-08-12T15:59:58Z

The definition of struct nvmlProcessInfo_st has introduce two new fields gpuInstanceId and computeInstanceId in newer NVIDIA drivers (CUDA 11.x). And this changes will break the backward compatibility for old NVIDIA drivers (CUDA 10.x).

This PR fixes issue #21 for nvmlDeviceGetComputeRunningProcesses nvmlDeviceGetGraphicsRunningProcesses on CUDA 10.x caused by the data structure size change. It simply calls the v1 version function (feed with v1 data structure) and convert the results into v2 ones on CUDA 10.x. And for CUDA 11.x, call the v2 version directly.

There is another solution (PR #22) that calls the v1 version (but feed with v2 data structure) and correct the results with some adjustments on CUDA 10.x. As commented in #21 (comment), this implementation should cover a large number of configuration situations (different CPU bit width / CPU arch / complie options, etc.).

The results on CUDA 10.x

$ python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
[1] 5404
$ python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
[2] 5456
$ python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
[3] 5481
$ python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
[4] 5504
$ python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
[5] 5529

$ go run go-nvml/examples/compute-processes
Found 5 processes on device 0
        [ 0] ProcessInfo: {Pid:5404 UsedGpuMemory:173015040 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
        [ 1] ProcessInfo: {Pid:5456 UsedGpuMemory:173015040 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
        [ 2] ProcessInfo: {Pid:5481 UsedGpuMemory:173015040 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
        [ 3] ProcessInfo: {Pid:5504 UsedGpuMemory:173015040 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
        [ 4] ProcessInfo: {Pid:5529 UsedGpuMemory:173015040 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 0 processes on device 1
Found 0 processes on device 2

Fixes #21
Closes #22

elezar

Thanks a lot for the submission @XuehaiPan. This is definitely more robust than what I had implemented.

I have left some comments below. There are one or two that need to be addressed before we can merge this.

gen/nvml/init.go

gen/nvml/device.go

elezar

Looks good @XuehaiPan.

Thanks for the work on this!

klueska · 2021-08-13T09:06:35Z

I think this is fine to unblock you for now, but I wouldn't spend too much time "perfecting" it as we will need tocome up with a more general solution in the near future.

I see the "real" solution as:

Generate a types.go for each new version of NVML we support and put it in its own package under e.g. pkg/types/v1.11.1.
For any type we care about making sure is usable in an array context, we need to have a helper function that:
a. Dynamically checks for the current version of NVML on the host.
b. Passes the correct (versioned) type to the underyling call that returns the array
c. Walks the array and copies each entry into the "latest" type (i.e. the one that may have extended fields for newer versions)
d. Returns an array of the latest type

elezar · 2021-08-13T09:37:40Z

Thanks for the input @klueska. It was also pointed out to me that more recent versions of the nvml.h file has an explicit _v1 type. As such the current workaround should be simpler then, but we would still have to ensure that the returned structs have the right types.

Fix GetComputeRunningProcesses on CUDA 10.x

c1aaa3d

elezar requested changes Aug 13, 2021

View reviewed changes

gen/nvml/init.go Show resolved Hide resolved

gen/nvml/init.go Show resolved Hide resolved

gen/nvml/device.go Show resolved Hide resolved

gen/nvml/device.go Outdated Show resolved Hide resolved

gen/nvml/device.go Outdated Show resolved Hide resolved

Split functions and improve readability

9eb2e80

XuehaiPan requested a review from elezar August 13, 2021 08:57

elezar approved these changes Aug 13, 2021

View reviewed changes

elezar merged commit 053ac68 into NVIDIA:master Aug 13, 2021

XuehaiPan deleted the fix-get-running-processes branch August 13, 2021 11:03

XuehaiPan mentioned this pull request Apr 16, 2023

Add GPU stats features giampaolo/psutil#526

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GetComputeRunningProcesses on CUDA 10.x #25

Fix GetComputeRunningProcesses on CUDA 10.x #25

XuehaiPan commented Aug 12, 2021 •

edited

Loading

elezar left a comment

elezar left a comment

klueska commented Aug 13, 2021

elezar commented Aug 13, 2021

Fix GetComputeRunningProcesses on CUDA 10.x #25

Fix GetComputeRunningProcesses on CUDA 10.x #25

Conversation

XuehaiPan commented Aug 12, 2021 • edited Loading

elezar left a comment

Choose a reason for hiding this comment

elezar left a comment

Choose a reason for hiding this comment

klueska commented Aug 13, 2021

elezar commented Aug 13, 2021

XuehaiPan commented Aug 12, 2021 •

edited

Loading