-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions #8007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions #8007
Conversation
|
|
||
| // NB can be 0 if at end of page | ||
| let records_read_once = self.record_reader.read_records(records_to_read)?; | ||
| if records_read_once == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case of 0 rows being read is handled in the if records_read_once < records_to-read clause below -- namely in this case the code needs to try and get the next page of data from the page reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| >(2, 100, 2, message_type, 15, 50, converter); | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both these tests fail without the changes in this PR.
I don't like the copy/paste nature of these tests and I plan a minor PR building on this one proposing how to remove the duplication and make the tests easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
As the code that I removed in this PR added to support empty pages in 8a61570 (but it didn't seem to have any tests), I was worried that this deletion would cause some sort of reversion in behavior. To ensure it didn't, I added a test for empty pages in commit cfdea83 To try and ensure that the test covers the empty pages case, I and ran the new test on commit ed1f771 (the one right before 8a61570) and the test fails, thus leading me to conclude that the code removed in this PR is unnecessary and zero page parquet files are still supported. Log of what I did: Get commit right before support for empty pages: Apply new test for no pages: Then I had to edit the test a little to use And then I ran the test, and it fails: FYI @houqp and @paddyhoran |
|
@sunchao Could you review as well ? |
| } | ||
|
|
||
| #[test] | ||
| fn test_primitive_array_reader_empty_pages() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pinging! will do today. |
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @alamb .
|
|
||
| // NB can be 0 if at end of page | ||
| let records_read_once = self.record_reader.read_records(records_to_read)?; | ||
| if records_read_once == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I added two test cases in #8007, which increased coverage. However, upon further review, I noticed the choice of parameters to hit edge conditions didn't cover the string data types. Rather than adding a bunch more copies of basically the same test to add new parameters for different tests, I instead propose using the same set of parameters for all data types and drive the tests using a table in this PR. It makes the test logic slightly harder to follow, in my opinion, but it does increase coverage Closes #8009 from alamb/alamb/ARROW-9790-test-consolidation Authored-by: alamb <[email protected]> Signed-off-by: Andy Grove <[email protected]>
I added two test cases in apache/arrow#8007, which increased coverage. However, upon further review, I noticed the choice of parameters to hit edge conditions didn't cover the string data types. Rather than adding a bunch more copies of basically the same test to add new parameters for different tests, I instead propose using the same set of parameters for all data types and drive the tests using a table in this PR. It makes the test logic slightly harder to follow, in my opinion, but it does increase coverage Closes #8009 from alamb/alamb/ARROW-9790-test-consolidation Authored-by: alamb <[email protected]> Signed-off-by: Andy Grove <[email protected]>
When I was reading a parquet file into
RecordBatchesusingParquetFileArrowReaderthat had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this errorUpon investigation, I found that when reading with
ParquetFileArrowReader, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are readVisually:
I traced the issue down to a bug in
PrimitiveArrayReaderwhere it mistakenly interprets reading0rows from a page reader as being at the end of the column.This bug appears not to be present in the initial implementation #5378 -- FYI @andygrove and @liurenjie1024 (the test harness in this file is awesome, btw), but was introduced in #7140. I will do some more investigating to ensure the test case described in that ticket is handled