Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect parsing for 'repeated group' in parquet format #5316

Closed
cleaton opened this issue May 19, 2016 · 6 comments
Closed

Incorrect parsing for 'repeated group' in parquet format #5316

cleaton opened this issue May 19, 2016 · 6 comments
Labels

Comments

@cleaton
Copy link

cleaton commented May 19, 2016

We are using parquet-protobuf to generate parquet files. Parquet-protobuf will use 'repeated group TypeName' to represent a list of structs.

Reading such file in presto will cause error:
'Expected LIST column 'xxx.yyy' to only have one field, but has 5 fields'

https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetHiveRecordCursor.java#L915

commit 5994e35 added a comment about how the Parquet spec defines list type but I think it's not entirely correct in it's interpretation. The spec mentions:

A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field.

There was a merge request to unify the style in the parquet project due to an incompatibility issue with spark but they suggested to fix it upstream in spark instead.
apache/parquet-java#253

Both Hive and Spark has no problem reading parquet files created by parquet-protobuf.
I used Presto 0.143 (EMR) to do my testing.

example Parquet-protobuf parquet file:

repeated group Foo {
optional double bar_a;
optional int64 bar_b;
optional int64 bar_c;
}

Same schema rewritten to parquet through Spark:

required group Foo (LIST) {
repeated group list {
required group element {
optional double bar_a;
optional int64 bar_b;
optional int64 bar_c;
}
}
}

@cberner
Copy link
Contributor

cberner commented May 19, 2016

CC: @nezihyigitbasi

@usmanm
Copy link

usmanm commented Aug 20, 2017

Hey guys, any update on this? We're facing the same issue as described here: https://medium.com/hadoop-noob/presto-parquet-reader-fc7c333fc0a4

@lumost
Copy link

lumost commented Aug 30, 2017

It looks like apache/parquet-java#411 will resolve this issue in parquet-protobuf

@gaohao
Copy link

gaohao commented Aug 31, 2017

May be we should add compatibility to this protobuf-parquet since spark supports it

@costimuraru
Copy link

We've faced the exact issue in Presto (using AWS Athena). apache/parquet-java#411 fixes the issue, by generating a parquet schema which is compatible with the spec (has the extra LIST/MAP wrappers). Got some good feedback and seen some great results in Presto using the patch. We'll work on merging the PR.

@stale
Copy link

stale bot commented Sep 5, 2019

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.

@stale stale bot added the stale label Sep 5, 2019
@stale stale bot closed this as completed Sep 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants