Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(api/libnvml): fix process info support for NVIDIA R535 driver (CUDA 12.2+) #79

Merged
merged 9 commits into from
Jul 16, 2023

Conversation

XuehaiPan
Copy link
Owner

Issue Type

  • Bug fix

Description

The start with the NVIDIA R510 driver, the new version 3 APIs have been added for nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses. But the version 3 functions still use the version 2 type struct as the function argument type:

class c_nvmlProcessInfo_v2_t(pynvml._PrintableStructure):
    _fields_ = [
        ('pid', ctypes.c_uint),
        ('usedGpuMemory', ctypes.c_ulonglong),
        ('gpuInstanceId', ctypes.c_uint),
        ('computeInstanceId', ctypes.c_uint),
    ]
    _fmt_ = {
        'usedGpuMemory': '%d B',
    }

Recently, the NVIDIA R535 driver came out. The version 3 APIs starts to use the new version 3 type struct without a version bump. This results in invalid memory access and produces the wrong results.

class c_nvmlProcessInfo_v3_t(pynvml._PrintableStructure):
    _fields_ = [
        ('pid', ctypes.c_uint),
        ('usedGpuMemory', ctypes.c_ulonglong),
        ('gpuInstanceId', ctypes.c_uint),
        ('computeInstanceId', ctypes.c_uint),
        ('usedGpuCcProtectedMemory', ctypes.c_ulonglong),
    ]
    _fmt_ = {
        'usedGpuMemory': '%d B',
        'usedGpuCcProtectedMemory': '%d B',
    }

The two type structs have different sizes:

>>> ctypes.sizeof(libnvml.c_nvmlProcessInfo_v2_t)
24
>>> ctypes.sizeof(libnvml.c_nvmlProcessInfo_v3_t)
32

This PR adds a helper function that determines the API version and type struct version of nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses on the first API call.

Motivation and Context

Fixes #75
Fixes #76

@XuehaiPan XuehaiPan added bug Something isn't working enhancement New feature or request upstream Something upstream related pynvml Something related to the `nvidia-ml-py` package api Something related to the core APIs labels Jul 14, 2023
@XuehaiPan XuehaiPan self-assigned this Jul 14, 2023
@XuehaiPan XuehaiPan changed the title fix(api/libnvml): fix process info support for NVIDIA R535 driver fix(api/libnvml): fix process info support for NVIDIA R535 driver (CUDA 12.2+) Jul 16, 2023
@XuehaiPan XuehaiPan merged commit c3487c0 into main Jul 16, 2023
@XuehaiPan XuehaiPan deleted the fix-r535-driver branch July 16, 2023 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Something related to the core APIs bug Something isn't working enhancement New feature or request pynvml Something related to the `nvidia-ml-py` package upstream Something upstream related
Projects
None yet
1 participant