-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid processing empty batch in ParquetCachedBatchSerializer #8374
Conversation
Signed-off-by: Raza Jafri <[email protected]>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/ParquetCachedBatchSerializer.scala
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/CachedBatchWriterSuite.scala
Outdated
Show resolved
Hide resolved
@@ -50,6 +50,23 @@ class CachedBatchWriterSuite extends SparkQueryCompareTestSuite { | |||
} | |||
} | |||
|
|||
test("convert columnar batch to cached batch on single col table with 0 rows in a batch") { | |||
if (!withCpuSparkSession(s => s.version < "3.1.0")) { | |||
withResource(new TestResources()) { resources => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a Scala unit test, is this reproducible by trying to cache an empty dataframe or some other higher level operations we could add to the cache pyspark tests? Curious how this happens in practice for a user and would like to see a test that mirrors not only the creation of that cache but the subsequent use in how this situation would occur for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, that would be my preferred test also but I haven't been able to reproduce it so far. This is a customer-reported issue and I wanted to get this in so they can test.
Do you think we can address this in a follow-on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please file the issue. I assume you've tried the simple case of loading a table, performing a filter on it that removes all rows, caching the resulting dataframe, and subsequently operating on that.
Signed-off-by: Raza Jafri <[email protected]>
build |
Signed-off-by: Raza Jafri <[email protected]>
build |
This PR returns an empty ParquetCachedBatch if the incoming batch is empty.