Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch between GPU driver and CUDA toolkit version in the current environment breaks Numba GPU functionality #1281

Closed
mnc909 opened this issue Aug 8, 2023 · 4 comments
Labels
bug bug & failures with existing packages help wanted

Comments

@mnc909
Copy link

mnc909 commented Aug 8, 2023

🐛 Bug

Current CUDA Toolkit version is 11.8, while the CUDA version in the Nvidia driver is 11.4. This is usually not a problem because of Minor Version Compatibility, however Numba in particular doesn't support MVC yet, so the entire CUDA functionality of Numba is not working with the current docker image (it used to work before), producing the error "[222] Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR" known to occur when a mismatch like this happens.

As Numba's CUDA wrapper is sort of the only sensible way of getting custom GPU algorithms to work together with pytorch in Kaggle, this is quite unfortunate.

To Reproduce

  1. Open any notebook implementing a Numba CUDA kernel, such as this one: https://www.kaggle.com/code/harshwalia/2-custom-cuda-kernels-in-python-with-numba/notebook
  2. Change the environment in notebook settings from pinned to latest
  3. Run the kernel (in the above example, the first 3 cells)

Expected behavior

Numba kernels working

@mnc909 mnc909 added bug bug & failures with existing packages help wanted labels Aug 8, 2023
@djherbis
Copy link
Contributor

djherbis commented Aug 8, 2023

Hey @mnc909 I dug into this a little bit to understand better.

Kaggle's CUDA toolkit version comes from our base Docker image, which we upgrade periodically:

GPU_BASE_IMAGE_NAME=tf2-gpu.2-12.py310

Our NVIDIA driver version on the other hand comes from our VM image.

We'll have to look into how we would go about upgrading the driver in the boot image since we don't actually directly manage that today and it seems like while the boot image gets updates, they haven't upgraded the nvidia driver.

@gmarkall
Copy link

gmarkall commented Aug 9, 2023

Numba in particular doesn't support MVC yet

The Numba MVC support is documented at https://numba.readthedocs.io/en/stable/cuda/minor_version_compatibility.html - can you make use of this in your environment?

djherbis added a commit that referenced this issue Aug 9, 2023
@djherbis
Copy link
Contributor

djherbis commented Aug 9, 2023

Thanks @gmarkall trying it 🤞

djherbis added a commit that referenced this issue Aug 14, 2023
@jakirkham
Copy link

In addition to PR ( #1282 ), which added the missing MVC bits for Numba, it looks like the base image was updated recently ( #1305 ), which appears to contain a newer driver version as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bug & failures with existing packages help wanted
Projects
None yet
Development

No branches or pull requests

4 participants