Skip to content

Conversation

@houqp
Copy link
Member

@houqp houqp commented May 10, 2020

Sometimes spark will write out parquet files with zero row groups, which will result in error if read using ParquetFileArrowReader.

It would be more convenient if ParquetFileArrowReader can support this edge-case out of the box.

@github-actions
Copy link

Copy link
Contributor

@paddyhoran paddyhoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@houqp houqp deleted the empty_parquet branch May 15, 2020 04:29
sunchao pushed a commit that referenced this pull request Aug 20, 2020
When I was reading a parquet file into `RecordBatches` using `ParquetFileArrowReader` that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error

```
 ParquetError("Parquet error: Not all children array length are the same!")
```

Upon investigation, I found that when reading with `ParquetFileArrowReader`, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read

Visually:

```
+-----+
| RG1 |
|     |
+-----+  <-- If a batch ends exactly at the end of this row group (page), RG2 is never read
+-----+
| RG2 |
|     |
+-----+
```

I traced the issue down to a bug in `PrimitiveArrayReader` where it mistakenly interprets reading `0` rows from a page reader as being at the end of the column.

This bug appears *not* to be present in the initial implementation #5378 -- FYI @andygrove  and @liurenjie1024 (the test harness in this file is awesome, btw), but was introduced in #7140. I will do some more investigating to ensure the test case described in that ticket is handled

Closes #8007 from alamb/alamb/ARROW-9790-record-batch-boundaries

Authored-by: alamb <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
alamb added a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
When I was reading a parquet file into `RecordBatches` using `ParquetFileArrowReader` that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error

```
 ParquetError("Parquet error: Not all children array length are the same!")
```

Upon investigation, I found that when reading with `ParquetFileArrowReader`, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read

Visually:

```
+-----+
| RG1 |
|     |
+-----+  <-- If a batch ends exactly at the end of this row group (page), RG2 is never read
+-----+
| RG2 |
|     |
+-----+
```

I traced the issue down to a bug in `PrimitiveArrayReader` where it mistakenly interprets reading `0` rows from a page reader as being at the end of the column.

This bug appears *not* to be present in the initial implementation #5378 -- FYI @andygrove  and @liurenjie1024 (the test harness in this file is awesome, btw), but was introduced in apache/arrow#7140. I will do some more investigating to ensure the test case described in that ticket is handled

Closes #8007 from alamb/alamb/ARROW-9790-record-batch-boundaries

Authored-by: alamb <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants