PARQUET-968 Add Hive/Presto support in ProtoParquet #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds Hive (https://github.com/apache/hive) and Presto (https://github.com/prestodb/presto) support for parquet messages written with ProtoParquetWriter. Hive and other tools, such as Presto (used by AWS Athena), rely on specific LIST/MAP wrappers (as defined in the parquet spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). These wrappers are currently missing from the ProtoParquet schema. AvroParquet works just fine, because it adds these wrappers when it deals with arrays and maps. This PR brings these wrappers in parquet-proto, providing the same functionality that already exists in parquet-avro.
This is backward compatible. Messages written without the extra LIST/MAP wrappers are still being read successfully using the updated ProtoParquetReader.
Regarding the change.
Given the following protobuf schema:
Old parquet schema was:
New parquet schema is:
For list of messages, the changes look like this:
Protobuf schema:
Old parquet schema was:
The expected parquet schema, compatible with Hive (and similar to parquet-avro) is the following (notice the LIST wrapper):
Similar for maps. Protobuf schema:
Old parquet schema:
New parquet schema (notice the
MAP
wrapper):Jira: https://issues.apache.org/jira/browse/PARQUET-968
Author: Constantin Muraru [email protected]
Author: Benoît Hanotte [email protected]
Closes apache#411 from costimuraru/PARQUET-968 and squashes the following commits:
16eafcb [Benoît Hanotte] PARQUET-968 add proto flag to enable writing using specs-compliant schemas (#2)
a8bd704 [Constantin Muraru] Pick up commit from @andredasilvapinto
5cf9248 [Constantin Muraru] PARQUET-968 Add Hive support in ProtoParquet