-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect parsing for 'repeated group' in parquet format #5316
Comments
CC: @nezihyigitbasi |
Hey guys, any update on this? We're facing the same issue as described here: https://medium.com/hadoop-noob/presto-parquet-reader-fc7c333fc0a4 |
It looks like apache/parquet-java#411 will resolve this issue in parquet-protobuf |
May be we should add compatibility to this protobuf-parquet since spark supports it |
We've faced the exact issue in Presto (using AWS Athena). apache/parquet-java#411 fixes the issue, by generating a parquet schema which is compatible with the spec (has the extra LIST/MAP wrappers). Got some good feedback and seen some great results in Presto using the patch. We'll work on merging the PR. |
This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things. |
We are using parquet-protobuf to generate parquet files. Parquet-protobuf will use 'repeated group TypeName' to represent a list of structs.
Reading such file in presto will cause error:
'Expected LIST column 'xxx.yyy' to only have one field, but has 5 fields'
https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetHiveRecordCursor.java#L915
commit 5994e35 added a comment about how the Parquet spec defines list type but I think it's not entirely correct in it's interpretation. The spec mentions:
There was a merge request to unify the style in the parquet project due to an incompatibility issue with spark but they suggested to fix it upstream in spark instead.
apache/parquet-java#253
Both Hive and Spark has no problem reading parquet files created by parquet-protobuf.
I used Presto 0.143 (EMR) to do my testing.
example Parquet-protobuf parquet file:
Same schema rewritten to parquet through Spark:
The text was updated successfully, but these errors were encountered: