ARROW-8751: [Rust] support empty parquet file in arrow array reader #7140

houqp · 2020-05-10T00:01:43Z

Sometimes spark will write out parquet files with zero row groups, which will result in error if read using ParquetFileArrowReader.

It would be more convenient if ParquetFileArrowReader can support this edge-case out of the box.

github-actions · 2020-05-10T00:04:24Z

https://issues.apache.org/jira/browse/ARROW-8751

paddyhoran

LGTM

@andygrove

When I was reading a parquet file into `RecordBatches` using `ParquetFileArrowReader` that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error ``` ParquetError("Parquet error: Not all children array length are the same!") ``` Upon investigation, I found that when reading with `ParquetFileArrowReader`, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read Visually: ``` +-----+ | RG1 | | | +-----+ <-- If a batch ends exactly at the end of this row group (page), RG2 is never read +-----+ | RG2 | | | +-----+ ``` I traced the issue down to a bug in `PrimitiveArrayReader` where it mistakenly interprets reading `0` rows from a page reader as being at the end of the column. This bug appears *not* to be present in the initial implementation #5378 -- FYI @andygrove and @liurenjie1024 (the test harness in this file is awesome, btw), but was introduced in #7140. I will do some more investigating to ensure the test case described in that ticket is handled Closes #8007 from alamb/alamb/ARROW-9790-record-batch-boundaries Authored-by: alamb <[email protected]> Signed-off-by: Chao Sun <[email protected]>

@andygrove

When I was reading a parquet file into `RecordBatches` using `ParquetFileArrowReader` that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error ``` ParquetError("Parquet error: Not all children array length are the same!") ``` Upon investigation, I found that when reading with `ParquetFileArrowReader`, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read Visually: ``` +-----+ | RG1 | | | +-----+ <-- If a batch ends exactly at the end of this row group (page), RG2 is never read +-----+ | RG2 | | | +-----+ ``` I traced the issue down to a bug in `PrimitiveArrayReader` where it mistakenly interprets reading `0` rows from a page reader as being at the end of the column. This bug appears *not* to be present in the initial implementation #5378 -- FYI @andygrove and @liurenjie1024 (the test harness in this file is awesome, btw), but was introduced in apache/arrow#7140. I will do some more investigating to ensure the test case described in that ticket is handled Closes #8007 from alamb/alamb/ARROW-9790-record-batch-boundaries Authored-by: alamb <[email protected]> Signed-off-by: Chao Sun <[email protected]>

ARROW-8751: [Rust] support empty parquet file in arrow array reader

72dd219

nevi-me approved these changes May 14, 2020

View reviewed changes

paddyhoran approved these changes May 14, 2020

View reviewed changes

paddyhoran closed this in 8a61570 May 14, 2020

houqp deleted the empty_parquet branch May 15, 2020 04:29

alamb mentioned this pull request Aug 19, 2020

ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions #8007

Closed

asfimport mentioned this pull request May 14, 2020

[Rust] ParquetFileArrowReader should be able to read empty parquet file without error #24901

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-8751: [Rust] support empty parquet file in arrow array reader #7140

ARROW-8751: [Rust] support empty parquet file in arrow array reader #7140

Uh oh!

houqp commented May 10, 2020

Uh oh!

github-actions bot commented May 10, 2020

Uh oh!

paddyhoran left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ARROW-8751: [Rust] support empty parquet file in arrow array reader #7140

ARROW-8751: [Rust] support empty parquet file in arrow array reader #7140

Uh oh!

Conversation

houqp commented May 10, 2020

Uh oh!

github-actions bot commented May 10, 2020

Uh oh!

paddyhoran left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants