Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Host memory leak in SerializedBatchIterator #8043

Closed
jbrennan333 opened this issue Apr 5, 2023 · 2 comments · Fixed by #8203
Closed

[BUG] Host memory leak in SerializedBatchIterator #8043

jbrennan333 opened this issue Apr 5, 2023 · 2 comments · Fixed by #8203
Assignees
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@jbrennan333
Copy link
Contributor

Describe the bug
While testing #7581 with NDS 3TB with GPU memory restricted to 6GB, I am seeing some leaked host memory buffers.
I see these with and without the fix in #8040

Executor task launch worker for task 127.0 in stage 1653.0 (TID 71217) 23/04/05 18:34:02:230 INFO GpuParquetMultiFilePartitionReaderFactory: Using the coalesce multi-file Parquet reader, files: hdfs://rl-r7525-d32-u38.raplab.nvidia.com:9000/data/nds2.0/parquet_sf3k_decimal/store_sales/ss_sold_date_sk=2452221/part-00179-9dcfb50c-76b0-4dbf-882b-b60e7ad5b925.c000.snappy.parquet,hdfs://rl-r7525-d32-u38.raplab.nvidia.com:9000/data/nds2.0/parquet_sf3k_decimal/store_sales/ss_sold_date_sk=2452220/part-00073-9dcfb50c-76b0-4dbf-882b-b60e7ad5b925.c000.snappy.parquet,hdfs://rl-r7525-d32-u38.raplab.nvidia.com:9000/data/nds2.0/parquet_sf3k_decimal/store_sales/ss_sold_date_sk=2452244/part-00154-9dcfb50c-76b0-4dbf-882b-b60e7ad5b925.c000.snappy.parquet task attemptid: 71217
Cleaner Thread 23/04/05 18:34:02:478 ERROR HostMemoryBuffer: A HOST BUFFER WAS LEAKED (ID: 1897514 7f6758548f10)
Executor task launch worker for task 121.0 in stage 1653.0 (TID 71211) 23/04/05 18:34:02:480 INFO Executor: Finished task 121.0 in stage 1653.0 (TID 71211). 4570 bytes result sent to driver
dispatcher-Executor 23/04/05 18:34:02:481 INFO CoarseGrainedExecutorBackend: Got assigned task 71255
Executor task launch worker for task 15.0 in stage 1656.0 (TID 71255) 23/04/05 18:34:02:481 INFO Executor: Running task 15.0 in stage 1656.0 (TID 71255)
Executor task launch worker for task 15.0 in stage 1656.0 (TID 71255) 23/04/05 18:34:02:482 INFO TorrentBroadcast: Started reading broadcast variable 1541 with 1 pieces (estimated total size 4.0 MiB)
Executor task launch worker for task 15.0 in stage 1656.0 (TID 71255) 23/04/05 18:34:02:483 INFO MemoryStore: Block broadcast_1541_piece0 stored as bytes in memory (estimated size 10.3 KiB, free 8.2 GiB)
Executor task launch worker for task 15.0 in stage 1656.0 (TID 71255) 23/04/05 18:34:02:483 INFO TorrentBroadcast: Reading broadcast variable 1541 took 1 ms
Executor task launch worker for task 15.0 in stage 1656.0 (TID 71255) 23/04/05 18:34:02:484 INFO MemoryStore: Block broadcast_1541 stored as values in memory (estimated size 22.8 KiB, free 8.2 GiB)
Executor task launch worker for task 15.0 in stage 1656.0 (TID 71255) 23/04/05 18:34:02:485 INFO TorrentBroadcast: Started reading broadcast variable 1540 with 1 pieces (estimated total size 4.0 MiB)
Cleaner Thread 23/04/05 18:34:02:485 ERROR MemoryCleaner: Leaked host buffer (ID: 1897514): 2023-04-05 18:33:59.0349 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1559)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:333)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:91)
ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:275)
ai.rapids.cudf.MemoryBuffer.<init>(MemoryBuffer.java:117)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:196)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:192)
ai.rapids.cudf.HostMemoryBuffer.allocate(HostMemoryBuffer.java:144)
com.nvidia.spark.rapids.SerializedBatchIterator.$anonfun$tryReadNext$1(GpuColumnarBatchSerializer.scala:78)
com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
com.nvidia.spark.rapids.SerializedBatchIterator.withResource(GpuColumnarBatchSerializer.scala:34)
com.nvidia.spark.rapids.SerializedBatchIterator.tryReadNext(GpuColumnarBatchSerializer.scala:72)
com.nvidia.spark.rapids.SerializedBatchIterator.next(GpuColumnarBatchSerializer.scala:97)
org.apache.spark.sql.rapids.RapidsShuffleThreadedReaderBase$RapidsShuffleThreadedBlockIterator$BlockState.next(RapidsShuffleInternalManagerBase.scala:657)
org.apache.spark.sql.rapids.RapidsShuffleThreadedReaderBase$RapidsShuffleThreadedBlockIterator.$anonfun$deserializeTask$1(RapidsShuffleInternalManagerBase.scala:733)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

Steps/Code to reproduce bug
Run NDS at 3TB on an 8 node A100 cluster with GPU memory restricted to 6GB

@jbrennan333 jbrennan333 added bug Something isn't working ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Apr 5, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 12, 2023
@jbrennan333
Copy link
Contributor Author

I have been trying to repro this with a current 23.06 build, and so far I have not been able to. Even with gpu memory set to only 4G (with a lot of ooms), I did not see it. I'm not sure if anything changed that might have fixed it though.

From looking at the stack traces, I think what might have happened here is that in RapidsShuffleThreadedBlockIterator.deserializeTask, we put a batch into queued, and then the task was killed and this never got cleaned up? I'm not familiar with how cleanup is done for the shuffle manager, but I don't see any code that closes the contents of queued. @abellina, any ideas?

@abellina
Copy link
Collaborator

From looking at the stack traces, I think what might have happened here is that in RapidsShuffleThreadedBlockIterator.deserializeTask, we put a batch into queued, and then the task was killed and this never got cleaned up?

I think that seems absolutely plausible and explains why you can't repro it without task failures. The queue needs to be drained/closed using a task complete handler. This certainly looks like a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
3 participants