Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GetComputeRunningProcesses on CUDA 10.x #25

Merged
merged 2 commits into from
Aug 13, 2021

Conversation

XuehaiPan
Copy link
Contributor

@XuehaiPan XuehaiPan commented Aug 12, 2021

The definition of struct nvmlProcessInfo_st has introduce two new fields gpuInstanceId and computeInstanceId in newer NVIDIA drivers (CUDA 11.x). And this changes will break the backward compatibility for old NVIDIA drivers (CUDA 10.x).

This PR fixes issue #21 for nvmlDeviceGetComputeRunningProcesses nvmlDeviceGetGraphicsRunningProcesses on CUDA 10.x caused by the data structure size change. It simply calls the v1 version function (feed with v1 data structure) and convert the results into v2 ones on CUDA 10.x. And for CUDA 11.x, call the v2 version directly.

There is another solution (PR #22) that calls the v1 version (but feed with v2 data structure) and correct the results with some adjustments on CUDA 10.x. As commented in #21 (comment), this implementation should cover a large number of configuration situations (different CPU bit width / CPU arch / complie options, etc.).


The results on CUDA 10.x

$ python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
[1] 5404
$ python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
[2] 5456
$ python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
[3] 5481
$ python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
[4] 5504
$ python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
[5] 5529

$ go run go-nvml/examples/compute-processes
Found 5 processes on device 0
        [ 0] ProcessInfo: {Pid:5404 UsedGpuMemory:173015040 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
        [ 1] ProcessInfo: {Pid:5456 UsedGpuMemory:173015040 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
        [ 2] ProcessInfo: {Pid:5481 UsedGpuMemory:173015040 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
        [ 3] ProcessInfo: {Pid:5504 UsedGpuMemory:173015040 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
        [ 4] ProcessInfo: {Pid:5529 UsedGpuMemory:173015040 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 0 processes on device 1
Found 0 processes on device 2

Fixes #21
Closes #22

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the submission @XuehaiPan. This is definitely more robust than what I had implemented.

I have left some comments below. There are one or two that need to be addressed before we can merge this.

gen/nvml/init.go Show resolved Hide resolved
gen/nvml/init.go Show resolved Hide resolved
gen/nvml/device.go Show resolved Hide resolved
gen/nvml/device.go Outdated Show resolved Hide resolved
gen/nvml/device.go Outdated Show resolved Hide resolved
@XuehaiPan XuehaiPan requested a review from elezar August 13, 2021 08:57
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @XuehaiPan.

Thanks for the work on this!

@elezar elezar merged commit 053ac68 into NVIDIA:master Aug 13, 2021
@klueska
Copy link
Contributor

klueska commented Aug 13, 2021

I think this is fine to unblock you for now, but I wouldn't spend too much time "perfecting" it as we will need tocome up with a more general solution in the near future.

I see the "real" solution as:

  1. Generate a types.go for each new version of NVML we support and put it in its own package under e.g. pkg/types/v1.11.1.
  2. For any type we care about making sure is usable in an array context, we need to have a helper function that:
    a. Dynamically checks for the current version of NVML on the host.
    b. Passes the correct (versioned) type to the underyling call that returns the array
    c. Walks the array and copies each entry into the "latest" type (i.e. the one that may have extended fields for newer versions)
    d. Returns an array of the latest type

@elezar
Copy link
Member

elezar commented Aug 13, 2021

Thanks for the input @klueska. It was also pointed out to me that more recent versions of the nvml.h file has an explicit _v1 type. As such the current workaround should be simpler then, but we would still have to ensure that the returned structs have the right types.

@XuehaiPan XuehaiPan deleted the fix-get-running-processes branch August 13, 2021 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong output of device.GetComputeRunningProcesses() given multiple processes
3 participants