-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-4218: [Rust][Parquet] Initial support for array reader. #5378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@sunchao @andygrove @nevi-me @paddyhoran Please help to review. |
|
@sunchao @nevi-me @paddyhoran @andygrove Can you take a look at this when you are available? We are close to initial support of parquet reader. This is the last but one PR to finish this goal. |
|
I noticed that this had not gotten a review last night. I'm not too familiar with Parquet in general (although I do want to learn) so I was deferring to the others. I'll try and take a look today but it will need additional review from someone else also I think. |
|
@liurenjie1024 I have not forgotten about this, sorry it is taking so long. It's doubtful this will make it into 0.15 though so I will take a look once 0.15 is released. |
|
@liurenjie1024 Sorry for the delay. I will start reviewing this tomorrow. |
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| use std::cmp::min; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get a rustdoc comment explaining what is in in this module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used by another component of arrow reader and is not a public api. I'll add doc after the last component is ready.
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me. There a quite a few unwrap calls ... can these be handled in a safer way?
|
@paddyhoran I don't know the schedule of 0.15 release, but we have only one PR(exclude this one) left before initial support of arrow reader getting ready. And that PR will not be a large one. |
|
I think 0.15 is being cut this morning. We should keep moving though as it might be delayed. I don't have time now, but might have some later today. |
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @liurenjie1024 .. LGTM!
|
@paddyhoran OK if I merge this one? |
paddyhoran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I thought I had already approved earlier today. Thanks @liurenjie1024
When I was reading a parquet file into `RecordBatches` using `ParquetFileArrowReader` that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error
```
ParquetError("Parquet error: Not all children array length are the same!")
```
Upon investigation, I found that when reading with `ParquetFileArrowReader`, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read
Visually:
```
+-----+
| RG1 |
| |
+-----+ <-- If a batch ends exactly at the end of this row group (page), RG2 is never read
+-----+
| RG2 |
| |
+-----+
```
I traced the issue down to a bug in `PrimitiveArrayReader` where it mistakenly interprets reading `0` rows from a page reader as being at the end of the column.
This bug appears *not* to be present in the initial implementation #5378 -- FYI @andygrove and @liurenjie1024 (the test harness in this file is awesome, btw), but was introduced in #7140. I will do some more investigating to ensure the test case described in that ticket is handled
Closes #8007 from alamb/alamb/ARROW-9790-record-batch-boundaries
Authored-by: alamb <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
Initial support for array reader. List and map support will come later.