Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Backward compatible NVML Python bindings #29

Closed
XuehaiPan opened this issue Jul 23, 2022 · 1 comment · Fixed by #30
Closed

[Enhancement] Backward compatible NVML Python bindings #29

XuehaiPan opened this issue Jul 23, 2022 · 1 comment · Fixed by #30
Assignees
Labels
enhancement New feature or request pynvml Something related to the `nvidia-ml-py` package upstream Something upstream related
Milestone

Comments

@XuehaiPan
Copy link
Owner

XuehaiPan commented Jul 23, 2022

Runtime Environment

  • Operating system and version: Ubuntu 20.04 LTS
  • Terminal emulator and version: GNOME Terminal 3.36.2
  • Python version: 3.9.13
  • NVML version (driver version): 470.129.06
  • nvitop version or commit: v0.7.1
  • python-ml-py version: 11.450.51
  • Locale: en_US.UTF-8

Context

The official NVML Python bindings (PyPI package nvidia-ml-py) do not guarantee backward compatibility for different NVIDIA drivers. For example, NVML added nvmlDeviceGetComputeRunningProcesses_v2 and nvmlDeviceGetGraphicsRunningProcesses_v2 in CUDA 11.x drivers (R450+). But the package nvidia-ml-py arbitrary call the latest version of the function in the unversioned function:

def nvmlDeviceGetComputeRunningProcesses_v2(handle):
    # first call to get the size
    c_count = c_uint(0)
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
    ret = fn(handle, byref(c_count), None)

    ...

def nvmlDeviceGetComputeRunningProcesses(handle):
    return nvmlDeviceGetComputeRunningProcesses_v2(handle);

This will cause NVMLError_FunctionNotFound error on CUDA 10.x drivers (e.g. R430).

Now there are the v3 version of nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses functions come with the R510+ drivers. E.g., in nvidia-ml-py==11.515.48:

def nvmlDeviceGetComputeRunningProcesses_v3(handle):
    # first call to get the size
    c_count = c_uint(0)
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
    ret = fn(handle, byref(c_count), None)

    ...

def nvmlDeviceGetComputeRunningProcesses(handle):
    return nvmlDeviceGetComputeRunningProcesses_v3(handle)

The v2 version of c_nvmlMemory_v2_t is appearing on the horizon (not found in R510 driver yet). This causes issue #13.

class c_nvmlMemory_t(_PrintableStructure):
    _fields_ = [
        ('total', c_ulonglong),
        ('free', c_ulonglong),
        ('used', c_ulonglong),
    ]
    _fmt_ = {'<default>': "%d B"}

class c_nvmlMemory_v2_t(_PrintableStructure):
    _fields_ = [
        ('version', c_uint),
        ('total', c_ulonglong),
        ('reserved', c_ulonglong),
        ('free', c_ulonglong),
        ('used', c_ulonglong),
    ]
    _fmt_ = {'<default>': "%d B"}

nvmlMemory_v2 = 0x02000028
def nvmlDeviceGetMemoryInfo(handle, version=None):
    if not version:
        c_memory = c_nvmlMemory_t()
        fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo")
    else:
        c_memory = c_nvmlMemory_v2_t()
        c_memory.version = version
        fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2")
    ret = fn(handle, byref(c_memory))
    _nvmlCheckReturn(ret)
    return c_memory

Possible Solutions

  1. Determine the best dependency version of nvidia-ml-py during installation.

    This requires the user to install the NVIDIA driver first, which may not be fulfilled on a freshly installed system. Besides, it's hard to list this driver dependency in the package metadata.

  2. Wait for the PyPI package nvidia-ml-py to become backward compatible.

    The package NVIDIA/go-nvml offers backward compatible APIs:

    The API is designed to be backwards compatible, so the latest bindings should work with any version of libnvidia-ml.so installed on your system.

    I posted this on the NVIDIA developer forums [PyPI/nvidia-ml-py] Issue Reports for nvidia-ml-py but did not get any official response yet.

  3. Vender the nvidia-ml-py in nvitop. (Note: nvidia-ml-py is released under the BSD License)

    This requires bumping the vendered version and making a minor release of nvitop each time a new version of nvidia-ml-py comes out.

  4. Automatically patch the pynvml module when the first call fails when calling the versioned APIs. This can achieve by manipulating the __dict__ attribute or the module.__class__ attribute.

    The goal of this solution is not to make fully backward-compatible Python bindings. That may be out of the scope of nvitop, e.g. ExcludedDeviceInfo -> BlacklistDeviceInfo. Also, note that this solution may cause performance issues for a much deeper call stack.

@XuehaiPan XuehaiPan self-assigned this Jul 23, 2022
@XuehaiPan XuehaiPan added enhancement New feature or request upstream Something upstream related pynvml Something related to the `nvidia-ml-py` package labels Jul 23, 2022
@XuehaiPan XuehaiPan added this to the v1.0.0 milestone Jul 23, 2022
@wookayin
Copy link

This is a great job. gpustat will have a conflicting dependency of nvidia-ml-py as it is still pinning at older versions, so I will also have to catch up to make them compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pynvml Something related to the `nvidia-ml-py` package upstream Something upstream related
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants