[BUG] Using NVIDIA driver 570 for Volta leads to UVM not being used #853

an-ys · 2025-02-25T03:45:29Z

After upgrading to CUDA 12.8, I started getting "com.nvidia.spark.rapids.jni.GpuRetryOOM: GPU OutOfMemory" when running the benchmarks in the spark-rapids-ml repo, even when UVM is enabled. I have confirmed that the Spark RAPIDS plugin has recognized the UVM flags since spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.ml.uvm.enabled is set to true and INFO:[workload]:CUDA managed memory enabled. is printed to stdout.log. I tried enabling the internal flags spark.rapids.python.memory.uvm.enabled and spark.rapids.memory.uvm.enabled` as well since the problem seems to occur in cuDF; however, the problem persists.

After downgrading to NVIDIA driver 565 with the CUDA version still set to 12.8 using update-alternatives, I got the application working again.

The nodes use V100, which is considered deprecated since 12.8, and the NVIDIA drivers were installed using nvidia-driver-570 instead of cuda-drivers as installing cuda-drivers now installs the open-source drivers.

Here is the error when running RFC:

Stage failed because barrier task ResultTask(4, 0) finished unsuccessfully.
com.nvidia.spark.rapids.jni.GpuRetryOOM: GPU OutOfMemory
	at ai.rapids.cudf.Table.concatenate(Native Method)
	at ai.rapids.cudf.Table.concatenate(Table.java:2094)
	at com.nvidia.spark.rapids.ConcatAndConsumeAll$.buildNonEmptyBatchFromTypes(GpuCoalesceBatches.scala:72)
	at com.nvidia.spark.rapids.ConcatAndConsumeAll$.buildNonEmptyBatch(GpuCoalesceBatches.scala:55)
	at org.apache.spark.sql.rapids.execution.python.RebatchingRoundoffIterator.fillAndConcat(GpuArrowEvalPythonExec.scala:115)
	at org.apache.spark.sql.rapids.execution.python.RebatchingRoundoffIterator.next(GpuArrowEvalPythonExec.scala:151)
	at org.apache.spark.sql.rapids.execution.python.RebatchingRoundoffIterator.next(GpuArrowEvalPythonExec.scala:51)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.sql.rapids.execution.python.shims.GpuArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(GpuArrowPythonRunner.scala:99)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.sql.rapids.execution.python.shims.GpuArrowPythonRunner$$anon$1.writeIteratorToStream(GpuArrowPythonRunner.scala:101)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:451)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1928)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:282)

Since the problem seems to occur in Spark RAPIDS JNI, I'm not sure if this repo is the appropriate place to raise this issue, but I decided to raise the issue here since it occurred when running Spark RAPIDS ML. I can cross-post this issue to a more appropriate repo if needed.

The text was updated successfully, but these errors were encountered:

eordentlich · 2025-02-25T05:38:50Z

Are you running the spark-rapids ETL plugin with a pooling allocator (default) enabled?

an-ys · 2025-02-25T05:49:35Z

No, spark.rapids.memory.gpu.pool is set to NONE.

Here are the spark properties retrieved from Web UI for one of the failed applications:

spark.app.id	app-20250224172754-0586
spark.app.initial.file.urls	spark://master:34825/files/get_gpus_resources.rb
spark.app.initial.jar.urls	spark://master:34825/jars/rapids-4-spark_2.12-24.12.1.jar
spark.app.name	benchmark_runner.py
spark.app.startTime	1740418072901
spark.app.submitTime	1740418070807
spark.driver.extraJavaOptions	-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false "-Duser.timezone=UTC"
spark.driver.host	master
spark.driver.memory	100g
spark.driver.port	34825
spark.driverEnv.NCCL_DEBUG	WARN
spark.dynamicAllocation.enabled	false
spark.eventLog.dir	file:///var/tmp/spark-events
spark.eventLog.enabled	true
spark.executor.cores	4
spark.executor.extraJavaOptions	-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false "-Duser.timezone=UTC"
spark.executor.heartbeatInterval	1000000s
spark.executor.id	driver
spark.executor.memory	50g
spark.executor.resource.gpu.amount	1
spark.executor.resource.gpu.discoveryScript	./get_gpus_resources.rb
spark.executorEnv.CUPY_CACHE_DIR	/tmp/cupy_cache
spark.executorEnv.NCCL_DEBUG	WARN
spark.executorEnv.PYTHONPATH	[PATH]/rapids-4-spark_2.12-24.12.1.jar
spark.executorEnv.UCX_ERROR_SIGNALS	""
spark.executorEnv.UCX_IB_RX_QUEUE_LEN	1024
spark.executorEnv.UCX_MAX_RNDV_RAILS	1
spark.executorEnv.UCX_MEMTYPE_CACHE	n
spark.executorEnv.UCX_RNDV_SCHEME	put_zcopy
spark.executorEnv.UCX_TLS	cuda_copy,cuda_ipc,tcp,rc
spark.files	file:///usr/lib/spark/scripts/gpu/get_gpus_resources.rb
spark.hadoop.fs.s3a.access.key	*********(redacted)
spark.hadoop.fs.s3a.attempts.maximum	1
spark.hadoop.fs.s3a.connection.establish.timeout	1000000
spark.hadoop.fs.s3a.connection.request.timeout	0
spark.hadoop.fs.s3a.connection.ssl.enabled	true
spark.hadoop.fs.s3a.connection.timeout	1000000
spark.hadoop.fs.s3a.endpoint	http://[MINIO IP]:19000
spark.hadoop.fs.s3a.path.style.access	true
spark.hadoop.fs.s3a.secret.key	*********(redacted)
spark.history.fs.logDirectory	file:///var/tmp/spark-events
spark.history.fs.update.interval	10s
spark.history.provider	org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port	18080
spark.jars	file:///home/ysan/fr/jars/rapids-4-spark_2.12-24.12.1.jar
spark.logConf	true
spark.master	spark://master:7077
spark.network.timeout	10000001s
spark.plugins	com.nvidia.spark.SQLPlugin
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.driver.user.timezone	Z
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.memory.gpu.pool	NONE
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.memory.pinnedPool.size	50g
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.memory.uvm.enabled	true
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.ml.uvm.enabled	true
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.python.memory.gpu.pooling.enabled	false
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.python.memory.uvm.enabled	true
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.shuffle.manager	com.nvidia.spark.rapids.spark353.RapidsShuffleManager
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.shuffle.mode	UCX
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.batchSizeBytes	128m
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.concurrentGpuTasks	2
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.multiThreadedRead.numThreads	20
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.python.gpu.enabled	true
spark.pyspark.driver.python	[PATH]/.venv/bin/python
spark.pyspark.python	[PATH]/.venv/bin/python
spark.rapids.driver.user.timezone	Z
spark.rapids.memory.gpu.pool	NONE
spark.rapids.memory.pinnedPool.size	50g
spark.rapids.memory.uvm.enabled	true
spark.rapids.ml.uvm.enabled	true
spark.rapids.python.memory.gpu.pooling.enabled	false
spark.rapids.python.memory.uvm.enabled	true
spark.rapids.shuffle.manager	com.nvidia.spark.rapids.spark353.RapidsShuffleManager
spark.rapids.shuffle.mode	UCX
spark.rapids.sql.batchSizeBytes	128m
spark.rapids.sql.concurrentGpuTasks	2
spark.rapids.sql.multiThreadedRead.numThreads	20
spark.rapids.sql.python.gpu.enabled	true
spark.rdd.compress	True
spark.repl.local.jars	file:///[PATH]/rapids-4-spark_2.12-24.12.1.jar
spark.scheduler.mode	FIFO
spark.serializer.objectStreamReset	100
spark.shuffle.service.enabled	false
spark.sql.adaptive.enabled	false
spark.sql.broadcastTimeout	1000000
spark.sql.cache.serializer	com.nvidia.spark.ParquetCachedBatchSerializer
spark.sql.execution.arrow.maxRecordsPerBatch	39993
spark.sql.execution.sortBeforeRepartition	false
spark.sql.extensions	com.nvidia.spark.rapids.SQLExecPlugin,com.nvidia.spark.udf.Plugin,com.nvidia.spark.DFUDFPlugin,com.nvidia.spark.rapids.optimizer.SQLOptimizerPlugin
spark.sql.files.maxPartitionBytes	512m
spark.submit.deployMode	client
spark.submit.pyFiles	
spark.task.cpus	1
spark.task.resource.gpu.amount	0.25

eordentlich · 2025-02-25T16:26:31Z

Ok. Can you validate that uvm works as expected on V100 + CUDA 12.8 in a python shell using the RMM python package?

Also might be worth trying 25.02 releases which are now available for RAPIDS and spark-rapids ETL plugin. If spark-rapids-ml 24.12 is not compatible with cuML 25.02 you can pip install -e from the git repo 25.02 branch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Using NVIDIA driver 570 for Volta leads to UVM not being used #853

[BUG] Using NVIDIA driver 570 for Volta leads to UVM not being used #853

an-ys commented Feb 25, 2025

eordentlich commented Feb 25, 2025

an-ys commented Feb 25, 2025

eordentlich commented Feb 25, 2025

[BUG] Using NVIDIA driver 570 for Volta leads to UVM not being used #853

[BUG] Using NVIDIA driver 570 for Volta leads to UVM not being used #853

Comments

an-ys commented Feb 25, 2025

eordentlich commented Feb 25, 2025

an-ys commented Feb 25, 2025

eordentlich commented Feb 25, 2025