[BUG] Host memory leak in SerializedBatchIterator #8043

jbrennan333 · 2023-04-05T19:38:07Z

Describe the bug
While testing #7581 with NDS 3TB with GPU memory restricted to 6GB, I am seeing some leaked host memory buffers.
I see these with and without the fix in #8040

Executor task launch worker for task 127.0 in stage 1653.0 (TID 71217) 23/04/05 18:34:02:230 INFO GpuParquetMultiFilePartitionReaderFactory: Using the coalesce multi-file Parquet reader, files: hdfs://rl-r7525-d32-u38.raplab.nvidia.com:9000/data/nds2.0/parquet_sf3k_decimal/store_sales/ss_sold_date_sk=2452221/part-00179-9dcfb50c-76b0-4dbf-882b-b60e7ad5b925.c000.snappy.parquet,hdfs://rl-r7525-d32-u38.raplab.nvidia.com:9000/data/nds2.0/parquet_sf3k_decimal/store_sales/ss_sold_date_sk=2452220/part-00073-9dcfb50c-76b0-4dbf-882b-b60e7ad5b925.c000.snappy.parquet,hdfs://rl-r7525-d32-u38.raplab.nvidia.com:9000/data/nds2.0/parquet_sf3k_decimal/store_sales/ss_sold_date_sk=2452244/part-00154-9dcfb50c-76b0-4dbf-882b-b60e7ad5b925.c000.snappy.parquet task attemptid: 71217
Cleaner Thread 23/04/05 18:34:02:478 ERROR HostMemoryBuffer: A HOST BUFFER WAS LEAKED (ID: 1897514 7f6758548f10)
Executor task launch worker for task 121.0 in stage 1653.0 (TID 71211) 23/04/05 18:34:02:480 INFO Executor: Finished task 121.0 in stage 1653.0 (TID 71211). 4570 bytes result sent to driver
dispatcher-Executor 23/04/05 18:34:02:481 INFO CoarseGrainedExecutorBackend: Got assigned task 71255
Executor task launch worker for task 15.0 in stage 1656.0 (TID 71255) 23/04/05 18:34:02:481 INFO Executor: Running task 15.0 in stage 1656.0 (TID 71255)
Executor task launch worker for task 15.0 in stage 1656.0 (TID 71255) 23/04/05 18:34:02:482 INFO TorrentBroadcast: Started reading broadcast variable 1541 with 1 pieces (estimated total size 4.0 MiB)
Executor task launch worker for task 15.0 in stage 1656.0 (TID 71255) 23/04/05 18:34:02:483 INFO MemoryStore: Block broadcast_1541_piece0 stored as bytes in memory (estimated size 10.3 KiB, free 8.2 GiB)
Executor task launch worker for task 15.0 in stage 1656.0 (TID 71255) 23/04/05 18:34:02:483 INFO TorrentBroadcast: Reading broadcast variable 1541 took 1 ms
Executor task launch worker for task 15.0 in stage 1656.0 (TID 71255) 23/04/05 18:34:02:484 INFO MemoryStore: Block broadcast_1541 stored as values in memory (estimated size 22.8 KiB, free 8.2 GiB)
Executor task launch worker for task 15.0 in stage 1656.0 (TID 71255) 23/04/05 18:34:02:485 INFO TorrentBroadcast: Started reading broadcast variable 1540 with 1 pieces (estimated total size 4.0 MiB)
Cleaner Thread 23/04/05 18:34:02:485 ERROR MemoryCleaner: Leaked host buffer (ID: 1897514): 2023-04-05 18:33:59.0349 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1559)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:333)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:91)
ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:275)
ai.rapids.cudf.MemoryBuffer.<init>(MemoryBuffer.java:117)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:196)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:192)
ai.rapids.cudf.HostMemoryBuffer.allocate(HostMemoryBuffer.java:144)
com.nvidia.spark.rapids.SerializedBatchIterator.$anonfun$tryReadNext$1(GpuColumnarBatchSerializer.scala:78)
com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
com.nvidia.spark.rapids.SerializedBatchIterator.withResource(GpuColumnarBatchSerializer.scala:34)
com.nvidia.spark.rapids.SerializedBatchIterator.tryReadNext(GpuColumnarBatchSerializer.scala:72)
com.nvidia.spark.rapids.SerializedBatchIterator.next(GpuColumnarBatchSerializer.scala:97)
org.apache.spark.sql.rapids.RapidsShuffleThreadedReaderBase$RapidsShuffleThreadedBlockIterator$BlockState.next(RapidsShuffleInternalManagerBase.scala:657)
org.apache.spark.sql.rapids.RapidsShuffleThreadedReaderBase$RapidsShuffleThreadedBlockIterator.$anonfun$deserializeTask$1(RapidsShuffleInternalManagerBase.scala:733)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

Steps/Code to reproduce bug
Run NDS at 3TB on an 8 node A100 cluster with GPU memory restricted to 6GB

The text was updated successfully, but these errors were encountered:

jbrennan333 · 2023-04-27T19:54:39Z

I have been trying to repro this with a current 23.06 build, and so far I have not been able to. Even with gpu memory set to only 4G (with a lot of ooms), I did not see it. I'm not sure if anything changed that might have fixed it though.

From looking at the stack traces, I think what might have happened here is that in RapidsShuffleThreadedBlockIterator.deserializeTask, we put a batch into queued, and then the task was killed and this never got cleaned up? I'm not familiar with how cleanup is done for the shuffle manager, but I don't see any code that closes the contents of queued. @abellina, any ideas?

abellina · 2023-04-27T19:58:29Z

From looking at the stack traces, I think what might have happened here is that in RapidsShuffleThreadedBlockIterator.deserializeTask, we put a batch into queued, and then the task was killed and this never got cleaned up?

I think that seems absolutely plausible and explains why you can't repro it without task failures. The queue needs to be drained/closed using a task complete handler. This certainly looks like a bug.

jbrennan333 added bug Something isn't working ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Apr 5, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 12, 2023

mattahrens assigned jbrennan333 Apr 12, 2023

jbrennan333 mentioned this issue Apr 28, 2023

Clean up queued batches on task failures in RapidsShuffleThreadedBlockIterator #8203

Merged

jbrennan333 closed this as completed in #8203 May 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Host memory leak in SerializedBatchIterator #8043

[BUG] Host memory leak in SerializedBatchIterator #8043

jbrennan333 commented Apr 5, 2023

jbrennan333 commented Apr 27, 2023

abellina commented Apr 27, 2023

[BUG] Host memory leak in SerializedBatchIterator #8043

[BUG] Host memory leak in SerializedBatchIterator #8043

Comments

jbrennan333 commented Apr 5, 2023

jbrennan333 commented Apr 27, 2023

abellina commented Apr 27, 2023