Skip to content

Conversation

@liurenjie1024
Copy link
Contributor

Initial support for array reader. List and map support will come later.

@liurenjie1024
Copy link
Contributor Author

@sunchao @andygrove @nevi-me @paddyhoran Please help to review.

@liurenjie1024
Copy link
Contributor Author

@sunchao @nevi-me @paddyhoran @andygrove Can you take a look at this when you are available? We are close to initial support of parquet reader. This is the last but one PR to finish this goal.

@paddyhoran
Copy link
Contributor

I noticed that this had not gotten a review last night. I'm not too familiar with Parquet in general (although I do want to learn) so I was deferring to the others. I'll try and take a look today but it will need additional review from someone else also I think.

@paddyhoran
Copy link
Contributor

@liurenjie1024 I have not forgotten about this, sorry it is taking so long. It's doubtful this will make it into 0.15 though so I will take a look once 0.15 is released.

@andygrove
Copy link
Member

@liurenjie1024 Sorry for the delay. I will start reviewing this tomorrow.

// specific language governing permissions and limitations
// under the License.

use std::cmp::min;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get a rustdoc comment explaining what is in in this module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used by another component of arrow reader and is not a public api. I'll add doc after the last component is ready.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me. There a quite a few unwrap calls ... can these be handled in a safer way?

@liurenjie1024
Copy link
Contributor Author

liurenjie1024 commented Sep 25, 2019

@paddyhoran I don't know the schedule of 0.15 release, but we have only one PR(exclude this one) left before initial support of arrow reader getting ready. And that PR will not be a large one.

@paddyhoran
Copy link
Contributor

I think 0.15 is being cut this morning. We should keep moving though as it might be delayed. I don't have time now, but might have some later today.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liurenjie1024 .. LGTM!

@andygrove
Copy link
Member

@paddyhoran OK if I merge this one?

Copy link
Contributor

@paddyhoran paddyhoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I thought I had already approved earlier today. Thanks @liurenjie1024

@andygrove andygrove closed this in 07ab508 Sep 26, 2019
sunchao pushed a commit that referenced this pull request Aug 20, 2020
When I was reading a parquet file into `RecordBatches` using `ParquetFileArrowReader` that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error

```
 ParquetError("Parquet error: Not all children array length are the same!")
```

Upon investigation, I found that when reading with `ParquetFileArrowReader`, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read

Visually:

```
+-----+
| RG1 |
|     |
+-----+  <-- If a batch ends exactly at the end of this row group (page), RG2 is never read
+-----+
| RG2 |
|     |
+-----+
```

I traced the issue down to a bug in `PrimitiveArrayReader` where it mistakenly interprets reading `0` rows from a page reader as being at the end of the column.

This bug appears *not* to be present in the initial implementation #5378 -- FYI @andygrove  and @liurenjie1024 (the test harness in this file is awesome, btw), but was introduced in #7140. I will do some more investigating to ensure the test case described in that ticket is handled

Closes #8007 from alamb/alamb/ARROW-9790-record-batch-boundaries

Authored-by: alamb <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants