Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Using NVIDIA driver 570 for Volta leads to UVM not being used #853

Open
an-ys opened this issue Feb 25, 2025 · 3 comments
Open

[BUG] Using NVIDIA driver 570 for Volta leads to UVM not being used #853

an-ys opened this issue Feb 25, 2025 · 3 comments

Comments

@an-ys
Copy link

an-ys commented Feb 25, 2025

After upgrading to CUDA 12.8, I started getting "com.nvidia.spark.rapids.jni.GpuRetryOOM: GPU OutOfMemory" when running the benchmarks in the spark-rapids-ml repo, even when UVM is enabled. I have confirmed that the Spark RAPIDS plugin has recognized the UVM flags since spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.ml.uvm.enabled is set to true and INFO:[workload]:CUDA managed memory enabled. is printed to stdout.log. I tried enabling the internal flags spark.rapids.python.memory.uvm.enabled and spark.rapids.memory.uvm.enabled` as well since the problem seems to occur in cuDF; however, the problem persists.

After downgrading to NVIDIA driver 565 with the CUDA version still set to 12.8 using update-alternatives, I got the application working again.

The nodes use V100, which is considered deprecated since 12.8, and the NVIDIA drivers were installed using nvidia-driver-570 instead of cuda-drivers as installing cuda-drivers now installs the open-source drivers.

Here is the error when running RFC:

Stage failed because barrier task ResultTask(4, 0) finished unsuccessfully.
com.nvidia.spark.rapids.jni.GpuRetryOOM: GPU OutOfMemory
	at ai.rapids.cudf.Table.concatenate(Native Method)
	at ai.rapids.cudf.Table.concatenate(Table.java:2094)
	at com.nvidia.spark.rapids.ConcatAndConsumeAll$.buildNonEmptyBatchFromTypes(GpuCoalesceBatches.scala:72)
	at com.nvidia.spark.rapids.ConcatAndConsumeAll$.buildNonEmptyBatch(GpuCoalesceBatches.scala:55)
	at org.apache.spark.sql.rapids.execution.python.RebatchingRoundoffIterator.fillAndConcat(GpuArrowEvalPythonExec.scala:115)
	at org.apache.spark.sql.rapids.execution.python.RebatchingRoundoffIterator.next(GpuArrowEvalPythonExec.scala:151)
	at org.apache.spark.sql.rapids.execution.python.RebatchingRoundoffIterator.next(GpuArrowEvalPythonExec.scala:51)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.sql.rapids.execution.python.shims.GpuArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(GpuArrowPythonRunner.scala:99)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.sql.rapids.execution.python.shims.GpuArrowPythonRunner$$anon$1.writeIteratorToStream(GpuArrowPythonRunner.scala:101)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:451)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1928)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:282)

Since the problem seems to occur in Spark RAPIDS JNI, I'm not sure if this repo is the appropriate place to raise this issue, but I decided to raise the issue here since it occurred when running Spark RAPIDS ML. I can cross-post this issue to a more appropriate repo if needed.

@eordentlich
Copy link
Collaborator

Are you running the spark-rapids ETL plugin with a pooling allocator (default) enabled?

@an-ys
Copy link
Author

an-ys commented Feb 25, 2025

No, spark.rapids.memory.gpu.pool is set to NONE.

Here are the spark properties retrieved from Web UI for one of the failed applications:

spark.app.id	app-20250224172754-0586
spark.app.initial.file.urls	spark://master:34825/files/get_gpus_resources.rb
spark.app.initial.jar.urls	spark://master:34825/jars/rapids-4-spark_2.12-24.12.1.jar
spark.app.name	benchmark_runner.py
spark.app.startTime	1740418072901
spark.app.submitTime	1740418070807
spark.driver.extraJavaOptions	-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false "-Duser.timezone=UTC"
spark.driver.host	master
spark.driver.memory	100g
spark.driver.port	34825
spark.driverEnv.NCCL_DEBUG	WARN
spark.dynamicAllocation.enabled	false
spark.eventLog.dir	file:///var/tmp/spark-events
spark.eventLog.enabled	true
spark.executor.cores	4
spark.executor.extraJavaOptions	-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false "-Duser.timezone=UTC"
spark.executor.heartbeatInterval	1000000s
spark.executor.id	driver
spark.executor.memory	50g
spark.executor.resource.gpu.amount	1
spark.executor.resource.gpu.discoveryScript	./get_gpus_resources.rb
spark.executorEnv.CUPY_CACHE_DIR	/tmp/cupy_cache
spark.executorEnv.NCCL_DEBUG	WARN
spark.executorEnv.PYTHONPATH	[PATH]/rapids-4-spark_2.12-24.12.1.jar
spark.executorEnv.UCX_ERROR_SIGNALS	""
spark.executorEnv.UCX_IB_RX_QUEUE_LEN	1024
spark.executorEnv.UCX_MAX_RNDV_RAILS	1
spark.executorEnv.UCX_MEMTYPE_CACHE	n
spark.executorEnv.UCX_RNDV_SCHEME	put_zcopy
spark.executorEnv.UCX_TLS	cuda_copy,cuda_ipc,tcp,rc
spark.files	file:///usr/lib/spark/scripts/gpu/get_gpus_resources.rb
spark.hadoop.fs.s3a.access.key	*********(redacted)
spark.hadoop.fs.s3a.attempts.maximum	1
spark.hadoop.fs.s3a.connection.establish.timeout	1000000
spark.hadoop.fs.s3a.connection.request.timeout	0
spark.hadoop.fs.s3a.connection.ssl.enabled	true
spark.hadoop.fs.s3a.connection.timeout	1000000
spark.hadoop.fs.s3a.endpoint	http://[MINIO IP]:19000
spark.hadoop.fs.s3a.path.style.access	true
spark.hadoop.fs.s3a.secret.key	*********(redacted)
spark.history.fs.logDirectory	file:///var/tmp/spark-events
spark.history.fs.update.interval	10s
spark.history.provider	org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port	18080
spark.jars	file:///home/ysan/fr/jars/rapids-4-spark_2.12-24.12.1.jar
spark.logConf	true
spark.master	spark://master:7077
spark.network.timeout	10000001s
spark.plugins	com.nvidia.spark.SQLPlugin
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.driver.user.timezone	Z
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.memory.gpu.pool	NONE
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.memory.pinnedPool.size	50g
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.memory.uvm.enabled	true
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.ml.uvm.enabled	true
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.python.memory.gpu.pooling.enabled	false
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.python.memory.uvm.enabled	true
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.shuffle.manager	com.nvidia.spark.rapids.spark353.RapidsShuffleManager
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.shuffle.mode	UCX
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.batchSizeBytes	128m
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.concurrentGpuTasks	2
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.multiThreadedRead.numThreads	20
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.python.gpu.enabled	true
spark.pyspark.driver.python	[PATH]/.venv/bin/python
spark.pyspark.python	[PATH]/.venv/bin/python
spark.rapids.driver.user.timezone	Z
spark.rapids.memory.gpu.pool	NONE
spark.rapids.memory.pinnedPool.size	50g
spark.rapids.memory.uvm.enabled	true
spark.rapids.ml.uvm.enabled	true
spark.rapids.python.memory.gpu.pooling.enabled	false
spark.rapids.python.memory.uvm.enabled	true
spark.rapids.shuffle.manager	com.nvidia.spark.rapids.spark353.RapidsShuffleManager
spark.rapids.shuffle.mode	UCX
spark.rapids.sql.batchSizeBytes	128m
spark.rapids.sql.concurrentGpuTasks	2
spark.rapids.sql.multiThreadedRead.numThreads	20
spark.rapids.sql.python.gpu.enabled	true
spark.rdd.compress	True
spark.repl.local.jars	file:///[PATH]/rapids-4-spark_2.12-24.12.1.jar
spark.scheduler.mode	FIFO
spark.serializer.objectStreamReset	100
spark.shuffle.service.enabled	false
spark.sql.adaptive.enabled	false
spark.sql.broadcastTimeout	1000000
spark.sql.cache.serializer	com.nvidia.spark.ParquetCachedBatchSerializer
spark.sql.execution.arrow.maxRecordsPerBatch	39993
spark.sql.execution.sortBeforeRepartition	false
spark.sql.extensions	com.nvidia.spark.rapids.SQLExecPlugin,com.nvidia.spark.udf.Plugin,com.nvidia.spark.DFUDFPlugin,com.nvidia.spark.rapids.optimizer.SQLOptimizerPlugin
spark.sql.files.maxPartitionBytes	512m
spark.submit.deployMode	client
spark.submit.pyFiles	
spark.task.cpus	1
spark.task.resource.gpu.amount	0.25

@eordentlich
Copy link
Collaborator

Ok. Can you validate that uvm works as expected on V100 + CUDA 12.8 in a python shell using the RMM python package?

Also might be worth trying 25.02 releases which are now available for RAPIDS and spark-rapids ETL plugin. If spark-rapids-ml 24.12 is not compatible with cuML 25.02 you can pip install -e from the git repo 25.02 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants