Incorrect parsing for 'repeated group' in parquet format #5316

cleaton · 2016-05-19T03:37:51Z

We are using parquet-protobuf to generate parquet files. Parquet-protobuf will use 'repeated group TypeName' to represent a list of structs.

Reading such file in presto will cause error:
'Expected LIST column 'xxx.yyy' to only have one field, but has 5 fields'

https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetHiveRecordCursor.java#L915

commit 5994e35 added a comment about how the Parquet spec defines list type but I think it's not entirely correct in it's interpretation. The spec mentions:

A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field.

There was a merge request to unify the style in the parquet project due to an incompatibility issue with spark but they suggested to fix it upstream in spark instead.
apache/parquet-java#253

Both Hive and Spark has no problem reading parquet files created by parquet-protobuf.
I used Presto 0.143 (EMR) to do my testing.

example Parquet-protobuf parquet file:

repeated group Foo {
optional double bar_a;
optional int64 bar_b;
optional int64 bar_c;
}

Same schema rewritten to parquet through Spark:

required group Foo (LIST) {
repeated group list {
required group element {
optional double bar_a;
optional int64 bar_b;
optional int64 bar_c;
}
}
}

The text was updated successfully, but these errors were encountered:

cberner · 2016-05-19T03:57:05Z

CC: @nezihyigitbasi

usmanm · 2017-08-20T18:01:16Z

Hey guys, any update on this? We're facing the same issue as described here: https://medium.com/hadoop-noob/presto-parquet-reader-fc7c333fc0a4

lumost · 2017-08-30T21:07:25Z

It looks like apache/parquet-java#411 will resolve this issue in parquet-protobuf

gaohao · 2017-08-31T00:53:25Z

May be we should add compatibility to this protobuf-parquet since spark supports it

costimuraru · 2017-09-05T21:45:41Z

We've faced the exact issue in Presto (using AWS Athena). apache/parquet-java#411 fixes the issue, by generating a parquet schema which is compatible with the spec (has the extra LIST/MAP wrappers). Got some good feedback and seen some great results in Presto using the patch. We'll work on merging the PR.

stale · 2019-09-05T22:34:29Z

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.

stale bot added the stale label Sep 5, 2019

stale bot closed this as completed Sep 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect parsing for 'repeated group' in parquet format #5316

Incorrect parsing for 'repeated group' in parquet format #5316

cleaton commented May 19, 2016 •

edited

Loading

cberner commented May 19, 2016

usmanm commented Aug 20, 2017

lumost commented Aug 30, 2017

gaohao commented Aug 31, 2017

costimuraru commented Sep 5, 2017

stale bot commented Sep 5, 2019

Incorrect parsing for 'repeated group' in parquet format #5316

Incorrect parsing for 'repeated group' in parquet format #5316

Comments

cleaton commented May 19, 2016 • edited Loading

cberner commented May 19, 2016

usmanm commented Aug 20, 2017

lumost commented Aug 30, 2017

gaohao commented Aug 31, 2017

costimuraru commented Sep 5, 2017

stale bot commented Sep 5, 2019

cleaton commented May 19, 2016 •

edited

Loading