Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid processing empty batch in ParquetCachedBatchSerializer #8374

Merged
merged 3 commits into from
May 25, 2023

Conversation

razajafri
Copy link
Collaborator

This PR returns an empty ParquetCachedBatch if the incoming batch is empty.

@razajafri razajafri changed the title Avoid processing empty batch Avoid processing empty batch in ParquetCachedBatchSerializer May 24, 2023
@@ -50,6 +50,23 @@ class CachedBatchWriterSuite extends SparkQueryCompareTestSuite {
}
}

test("convert columnar batch to cached batch on single col table with 0 rows in a batch") {
if (!withCpuSparkSession(s => s.version < "3.1.0")) {
withResource(new TestResources()) { resources =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a Scala unit test, is this reproducible by trying to cache an empty dataframe or some other higher level operations we could add to the cache pyspark tests? Curious how this happens in practice for a user and would like to see a test that mirrors not only the creation of that cache but the subsequent use in how this situation would occur for users.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, that would be my preferred test also but I haven't been able to reproduce it so far. This is a customer-reported issue and I wanted to get this in so they can test.
Do you think we can address this in a follow-on?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please file the issue. I assume you've tried the simple case of loading a table, performing a filter on it that removes all rows, caching the resulting dataframe, and subsequently operating on that.

@sameerz sameerz added the bug Something isn't working label May 24, 2023
Signed-off-by: Raza Jafri <[email protected]>
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

build

@razajafri razajafri merged commit 8ebd3b1 into NVIDIA:branch-23.06 May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants