forked from apache/parquet-java
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet打包有问题,升级parquet到1.12 #2
Comments
hn5092
added a commit
that referenced
this issue
Apr 25, 2019
yabola
pushed a commit
that referenced
this issue
May 18, 2023
This PR adds Hive (https://github.com/apache/hive) and Presto (https://github.com/prestodb/presto) support for parquet messages written with ProtoParquetWriter. Hive and other tools, such as Presto (used by AWS Athena), rely on specific LIST/MAP wrappers (as defined in the parquet spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). These wrappers are currently missing from the ProtoParquet schema. AvroParquet works just fine, because it adds these wrappers when it deals with arrays and maps. This PR brings these wrappers in parquet-proto, providing the same functionality that already exists in parquet-avro. This is backward compatible. Messages written without the extra LIST/MAP wrappers are still being read successfully using the updated ProtoParquetReader. Regarding the change. Given the following protobuf schema: ``` message ListOfPrimitives { repeated int64 my_repeated_id = 1; } ``` Old parquet schema was: ``` message ListOfPrimitives { repeated int64 my_repeated_id = 1; } ``` New parquet schema is: ``` message ListOfPrimitives { required group my_repeated_id (LIST) = 1 { repeated group list { required int64 element; } } } ``` --- For list of messages, the changes look like this: Protobuf schema: ``` message ListOfMessages { string top_field = 1; repeated MyInnerMessage first_array = 2; } message MyInnerMessage { int32 inner_field = 1; } ``` Old parquet schema was: ``` message TestProto3.ListOfMessages { optional binary top_field (UTF8) = 1; repeated group first_array = 2 { optional int32 inner_field = 1; } } ``` The expected parquet schema, compatible with Hive (and similar to parquet-avro) is the following (notice the LIST wrapper): ``` message TestProto3.ListOfMessages { optional binary top_field (UTF8) = 1; required group first_array (LIST) = 2 { repeated group list { optional group element { optional int32 inner_field = 1; } } } } ``` --- Similar for maps. Protobuf schema: ``` message TopMessage { map<int64, MyInnerMessage> myMap = 1; } message MyInnerMessage { int32 inner_field = 1; } ``` Old parquet schema: ``` message TestProto3.TopMessage { repeated group myMap = 1 { optional int64 key = 1; optional group value = 2 { optional int32 inner_field = 1; } } } ``` New parquet schema (notice the `MAP` wrapper): ``` message TestProto3.TopMessage { required group myMap (MAP) = 1 { repeated group key_value { required int64 key; optional group value { optional int32 inner_field = 1; } } } } ``` Jira: https://issues.apache.org/jira/browse/PARQUET-968 Author: Constantin Muraru <[email protected]> Author: Benoît Hanotte <[email protected]> Closes apache#411 from costimuraru/PARQUET-968 and squashes the following commits: 16eafcb [Benoît Hanotte] PARQUET-968 add proto flag to enable writing using specs-compliant schemas (#2) a8bd704 [Constantin Muraru] Pick up commit from @andredasilvapinto 5cf9248 [Constantin Muraru] PARQUET-968 Add Hive support in ProtoParquet
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
parquet对于it.unimi.dsi:fastutil的shade用了不同的前缀.导致查询时候出现class not found
The text was updated successfully, but these errors were encountered: