-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Using NVIDIA driver 570 for Volta leads to UVM not being used #853
Comments
Are you running the spark-rapids ETL plugin with a pooling allocator (default) enabled? |
No, Here are the spark properties retrieved from Web UI for one of the failed applications:
|
Ok. Can you validate that uvm works as expected on V100 + CUDA 12.8 in a python shell using the RMM python package? Also might be worth trying 25.02 releases which are now available for RAPIDS and spark-rapids ETL plugin. If spark-rapids-ml 24.12 is not compatible with cuML 25.02 you can pip install -e from the git repo 25.02 branch. |
After upgrading to CUDA 12.8, I started getting "com.nvidia.spark.rapids.jni.GpuRetryOOM: GPU OutOfMemory" when running the benchmarks in the spark-rapids-ml repo, even when UVM is enabled. I have confirmed that the Spark RAPIDS plugin has recognized the UVM flags since
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.ml.uvm.enabled
is set to true andINFO:[workload]:CUDA managed memory enabled.
is printed tostdout.log
. I tried enabling the internal flagsspark.rapids.python.memory.uvm.enabled
and spark.rapids.memory.uvm.enabled` as well since the problem seems to occur in cuDF; however, the problem persists.After downgrading to NVIDIA driver 565 with the CUDA version still set to 12.8 using update-alternatives, I got the application working again.
The nodes use V100, which is considered deprecated since 12.8, and the NVIDIA drivers were installed using
nvidia-driver-570
instead ofcuda-drivers
as installingcuda-drivers
now installs the open-source drivers.Here is the error when running RFC:
Since the problem seems to occur in Spark RAPIDS JNI, I'm not sure if this repo is the appropriate place to raise this issue, but I decided to raise the issue here since it occurred when running Spark RAPIDS ML. I can cross-post this issue to a more appropriate repo if needed.
The text was updated successfully, but these errors were encountered: