Avoid processing empty batch in ParquetCachedBatchSerializer #8374

razajafri · 2023-05-24T05:49:58Z

This PR returns an empty ParquetCachedBatch if the incoming batch is empty.

Signed-off-by: Raza Jafri <[email protected]>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ParquetCachedBatchSerializer.scala

tests/src/test/scala/com/nvidia/spark/rapids/CachedBatchWriterSuite.scala

jlowe · 2023-05-24T14:15:30Z

tests/src/test/scala/com/nvidia/spark/rapids/CachedBatchWriterSuite.scala

@@ -50,6 +50,23 @@ class CachedBatchWriterSuite extends SparkQueryCompareTestSuite {
    }
  }

+  test("convert columnar batch to cached batch on single col table with 0 rows in a batch") {
+    if (!withCpuSparkSession(s => s.version < "3.1.0")) {
+      withResource(new TestResources()) { resources =>


Rather than a Scala unit test, is this reproducible by trying to cache an empty dataframe or some other higher level operations we could add to the cache pyspark tests? Curious how this happens in practice for a user and would like to see a test that mirrors not only the creation of that cache but the subsequent use in how this situation would occur for users.

Ideally, that would be my preferred test also but I haven't been able to reproduce it so far. This is a customer-reported issue and I wanted to get this in so they can test.
Do you think we can address this in a follow-on?

Yes, please file the issue. I assume you've tried the simple case of loading a table, performing a filter on it that removes all rows, caching the resulting dataframe, and subsequently operating on that.

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2023-05-24T21:39:49Z

build

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2023-05-24T23:06:13Z

build

avoid processing empty batch

b651d6d

Signed-off-by: Raza Jafri <[email protected]>

razajafri changed the title ~~Avoid processing empty batch~~ Avoid processing empty batch in ParquetCachedBatchSerializer May 24, 2023

jlowe reviewed May 24, 2023

View reviewed changes

sameerz added the bug Something isn't working label May 24, 2023

close gpu batch

27d487b

Signed-off-by: Raza Jafri <[email protected]>

removed version check for test

7af8e80

Signed-off-by: Raza Jafri <[email protected]>

jlowe approved these changes May 25, 2023

View reviewed changes

razajafri merged commit 8ebd3b1 into NVIDIA:branch-23.06 May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid processing empty batch in ParquetCachedBatchSerializer #8374

Avoid processing empty batch in ParquetCachedBatchSerializer #8374

razajafri commented May 24, 2023

jlowe May 24, 2023

razajafri May 24, 2023

jlowe May 25, 2023

razajafri commented May 24, 2023

razajafri commented May 24, 2023

Avoid processing empty batch in ParquetCachedBatchSerializer #8374

Avoid processing empty batch in ParquetCachedBatchSerializer #8374

Conversation

razajafri commented May 24, 2023

jlowe May 24, 2023

Choose a reason for hiding this comment

razajafri May 24, 2023

Choose a reason for hiding this comment

jlowe May 25, 2023

Choose a reason for hiding this comment

razajafri commented May 24, 2023

razajafri commented May 24, 2023