-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-968 Add Hive/Presto support in ProtoParquet #411
Conversation
Hello @costimuraru
with
not with
as described in documentation https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists |
Hi @kgalieva, You raise a good point. What I had in mind was to make it similar to what parquet-avro is doing, like you can see here: https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java#L82 Cheers, |
Here are examples of how Spark and Hive handle repeated fields.
Hive
Both are compliant with the specification. |
Hi @kgalieva, I've spent the last couple of hours trying to add the inner layer for primitive values, but the changes needed to support this are quite involved. However, looking again over the spec, it says this:
This is exactly the same as what avro and this PR is producing. (By the way, this format is working perfect with Hive and Presto - tested on our own data set, with a massive protobuf schema (40+ fields)). Also, the spec does not mention a best practice for this use case: List<Tuple<String, Integer>>.
Clearly we can't just use |
Hi @costimuraru and @kgalieva: great to see design discussions happening :)
CC: @rdblue |
@julienledem, @kgalieva, I've made the changes so that the resulting parquet schema now follows the spec. @julienledem, I've also made the LIST See the changes in ProtoSchemaConverterTest.java Updated the pull request description to reflect the schema changes. |
@julienledem @costimuraru |
@matt-martin @lukasnalezenec @costimuraru @qinghui-xu @kgalieva this looks great. What is left to resolve before we can merge this? I'm not using parquet-proto myself at the moment but I'm happy to organize a google hangout if that helps getting to a resolution. |
@julienledem sounds good. I think this is ready for merge. |
8e6f8d1
to
28837b3
Compare
I've also encountered this issue with ProtoParquet formatted parquet files. Is it possible for this to be merged in the near future? Also happy to pitch in on any outstanding items @costimuraru @julienledem |
I had problems with this when using Proto3. I've done a few changes to get it to work. I also noticed the field names and structures for lists and maps don't conform with the official representation here ( https://github.com/apache/parquet-format/blob/master/LogicalTypes.md ), but with one of the backward compatible ones. This was not enough to get us to read the Parquet files in all our tech stack (Spark, Hive, Athena/Presto), so I also changed that. I might be able to push my changes sometime soon. |
Hey Andre. There are multiple commits on this PR, including ones that make the schema compatible with the spec defined at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md including for lists and maps. We've been using this patch to produce proto3-parquet files that we're feeding to Athena for quite some time, with a schema containing 40 fields including maps, lists and inner groups and with 10s of TB of data. |
Ah no. I did this ~ 3 weeks ago. Nice that you fixed it ;) |
You didn't fix the lists representation though. The documentation says:
Your single field is not named element. Also why did you pick In Protocol Buffers 3 everything is Optional. |
Thanks for the input, @andredasilvapinto.
Again, with one of the latest commits, the proto-parquet list should look something like this:
Are you not seeing this with the latest version on this patch?
It's a good question! In the spec it says:
AFAIK, protobuf does not have lists/maps that are null. In fact "it makes no distinction between an empty list and a null list." - so I think it doesn't matter what the repetion is here. I tried it with Repetion.REQUIRED and it worked fine even without adding any values to the protobuf map. If you know otherwise, feedback is appreciated. |
You are only doing it for primitive types: https://github.com/apache/parquet-mr/pull/411/files#diff-3b093ba1a3c729ad39bd47b0c148a586R298 even your tests show it: I went with OPTIONAL because in Proto3 there is no REQUIRED, so I thought that an optional parquet field was a more adequate type to represent the equivalent Protobuf 3 optional field. |
If you find it useful, these were the changes I did on top of your last "Implement review" commit: Some of the changes are already present on your latest commit. We have been running this for dozens of different data sets (Protobuf 3) for a few weeks already without any known problems. |
You raise an interesting point, @andredasilvapinto! Case 1.
Ideally, I would like to be able to query each individual sub-field (someId/otherId) individually, in Athena or Hive. Something like this:
Where the table (Presto) would look something like this (notice the
This works fine with the current version. The current parquet schema looks like this:
My question here would be: where should the Case 2.
Again, ideally I would like to be able to select that field specifically:
And have a CREATE table with a struct containing one field:
However... this does not work! I'm getting a parquet parsing error in Presto ( It works however when I change the CREATE table to remove the
Though I'm left with no way of querying the
Which will return an If this is the desired behavior, then indeed @andredasilvapinto, we can have message TestProto3.MyTopMessage { What do you think? Later Edit: Ah, I see in your commit (andredasilvapinto@dfa9701) what should be done here. It should actually be:
The same goes for above. Nice catch! I'll give this a try and will reply. |
@andredasilvapinto, you were right! After adding the extra |
Nice @costimuraru. No problem with the "copyright". Just as long as this gets merged I'm happy (one less reason to keep our internal parquet-mr fork!). cheers! |
Are there any efforts currently being made in order to merge this to master? |
Just noticed that this doesn't write the values of Protobuf fields that are equal to their default values. This happens because in Protobuf3, setting a field to its default value is equivalent to clearing the field. Therefore the conversion to Parquet needs to take that in consideration. |
Hi when I use this patch, this need protoc3(install Protobuf 3.4.0 ), if the same case on protoc2(Protobuf 2.5.0) , have other solution? |
@costimuraru @julienledem |
seconding @qinghui-xu's comment, we've had a version of this Patch in production for nearly 6 months now. |
Hello @lukasnalezenec, have you had time to have a look? Thanks |
Hi, I already did. |
This looks good. |
When can this be expected to be merged to master and released? |
@BenoitHanotte @costimuraru @julienledem There is no way to instantiate ProtoParquetWriter with parquet.proto.writeSpecsCompliant flag enabled. Am I missing something or is this intentional? It would have been great if a constructor to enable the flag was provided.
|
@chawlakunal you can manually create your ParquetWriter by providing the ProtoWriteSupport as following:
(or any variation of this as ParquetWriter has multiple constructors that accept a configuration in which we can set the flag) |
@BenoitHanotte That's exactly how I am using it right now but it kind of defeats the purpose of having ProtoParquetWriter class. |
@BenoitHanotte Is there a timeline of when this will be released? |
I believe the 1.10.0 has just been released, so this will liekly land in the next "major" release, unfortunately I am not aware of any plan to have a new release in the near future. For the |
@BenoitHanotte Here's the PR for constructor with flag #473 Also, if a minor release can be done for this fix it will be greatly appreciated. |
@chawlakunal I had a look at your PR (#473) , it looks good, there is just a comment that I believe needs to be changed (regarding the block size). |
This PR adds Hive (https://github.com/apache/hive) and Presto (https://github.com/prestodb/presto) support for parquet messages written with ProtoParquetWriter. Hive and other tools, such as Presto (used by AWS Athena), rely on specific LIST/MAP wrappers (as defined in the parquet spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). These wrappers are currently missing from the ProtoParquet schema. AvroParquet works just fine, because it adds these wrappers when it deals with arrays and maps. This PR brings these wrappers in parquet-proto, providing the same functionality that already exists in parquet-avro. This is backward compatible. Messages written without the extra LIST/MAP wrappers are still being read successfully using the updated ProtoParquetReader. Regarding the change. Given the following protobuf schema: ``` message ListOfPrimitives { repeated int64 my_repeated_id = 1; } ``` Old parquet schema was: ``` message ListOfPrimitives { repeated int64 my_repeated_id = 1; } ``` New parquet schema is: ``` message ListOfPrimitives { required group my_repeated_id (LIST) = 1 { repeated group list { required int64 element; } } } ``` --- For list of messages, the changes look like this: Protobuf schema: ``` message ListOfMessages { string top_field = 1; repeated MyInnerMessage first_array = 2; } message MyInnerMessage { int32 inner_field = 1; } ``` Old parquet schema was: ``` message TestProto3.ListOfMessages { optional binary top_field (UTF8) = 1; repeated group first_array = 2 { optional int32 inner_field = 1; } } ``` The expected parquet schema, compatible with Hive (and similar to parquet-avro) is the following (notice the LIST wrapper): ``` message TestProto3.ListOfMessages { optional binary top_field (UTF8) = 1; required group first_array (LIST) = 2 { repeated group list { optional group element { optional int32 inner_field = 1; } } } } ``` --- Similar for maps. Protobuf schema: ``` message TopMessage { map<int64, MyInnerMessage> myMap = 1; } message MyInnerMessage { int32 inner_field = 1; } ``` Old parquet schema: ``` message TestProto3.TopMessage { repeated group myMap = 1 { optional int64 key = 1; optional group value = 2 { optional int32 inner_field = 1; } } } ``` New parquet schema (notice the `MAP` wrapper): ``` message TestProto3.TopMessage { required group myMap (MAP) = 1 { repeated group key_value { required int64 key; optional group value { optional int32 inner_field = 1; } } } } ``` Jira: https://issues.apache.org/jira/browse/PARQUET-968 Author: Constantin Muraru <[email protected]> Author: Benoît Hanotte <[email protected]> Closes apache#411 from costimuraru/PARQUET-968 and squashes the following commits: 16eafcb [Benoît Hanotte] PARQUET-968 add proto flag to enable writing using specs-compliant schemas (#2) a8bd704 [Constantin Muraru] Pick up commit from @andredasilvapinto 5cf9248 [Constantin Muraru] PARQUET-968 Add Hive support in ProtoParquet
Hi @andredasilvapinto , Do you have any progress regarding the default value is not persisted in parquet? It is quite an annoying bug when read enum value as ['null', 'Type1', 'Type2'], not ['Type0', 'Type1', 'Type2'] |
Yes, if you look at the commit I linked to several months ago
costimuraru@9a4c016
it contains that flag to decide whether to print the default values or not.
…On Tue, Oct 30, 2018, 08:21 CHuAn ***@***.***> wrote:
Yes. The way I solved it was to add a flag to ProtoWriteSupport to define
whether to include default values or not. If set to true I always set empty
fields to their default protobuf values (except one of fields).
I can share the commit if people are interested.
Hi @andredasilvapinto <https://github.com/andredasilvapinto> , Do you
have any progress regarding the default value is not persisted in parquet?
It is quite an annoying bug when read enum value as ['null', 'Type1',
'Type2'], not ['Type0', 'Type1', 'Type2']
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#411 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACjD6U3beSyBsjmTkF6EGtxHV8jdpEPCks5uqAwHgaJpZM4NMYWi>
.
|
ok, it is on your personal branch, great patch! when will it merge to master? any plan? |
I have no idea. I think a few changes were already made to the base version during this approval process. |
Although it's closed but I'm a bit confused... why I always get the old schema version? I'd highly appreciate if someone could point out something stupid in my code! Or it's the same issue you are experiencing? My goal is to be able query data via Athena/Presto, or Hive Metastore, so need the the new parquet schema version. Method 1:
Method 2:
Parquet output Metadata: Protobuf Messasge: ` package AddressBookProtos; option java_multiple_files = true; message Person { enum PhoneType { message PhoneNumber { repeated PhoneNumber phones = 4; message AddressBook { |
Was it merged ? facing same issue with 1.13.0(latest) |
I believe it was merged: f849384 |
@wgtmac thanks for reply. It means version 1.13.0 with proto3 should work, right ? |
It was merged long ago and I don't have any context about it. If this is the fix to the issue that you have seen, then yes it should not appear in 1.13.0. |
i am using proto2, could it be a reason ? |
Probably that's the reason. parquet-proto requires proto 3 as dependency. |
@qinghui-xu i am using for parquet-proto with proto2 for few years, but question is to take the changes of this pr in consideration do i need to upgrade to proto3 or this solution should work with proto2 as well ? |
@ccpstephanie was you able to resolve ? |
This PR adds Hive (https://github.com/apache/hive) and Presto (https://github.com/prestodb/presto) support for parquet messages written with ProtoParquetWriter. Hive and other tools, such as Presto (used by AWS Athena), rely on specific LIST/MAP wrappers (as defined in the parquet spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). These wrappers are currently missing from the ProtoParquet schema. AvroParquet works just fine, because it adds these wrappers when it deals with arrays and maps. This PR brings these wrappers in parquet-proto, providing the same functionality that already exists in parquet-avro.
This is backward compatible. Messages written without the extra LIST/MAP wrappers are still being read successfully using the updated ProtoParquetReader.
Regarding the change.
Given the following protobuf schema:
Old parquet schema was:
New parquet schema is:
For list of messages, the changes look like this:
Protobuf schema:
Old parquet schema was:
The expected parquet schema, compatible with Hive (and similar to parquet-avro) is the following (notice the LIST wrapper):
Similar for maps. Protobuf schema:
Old parquet schema:
New parquet schema (notice the
MAP
wrapper):Jira: https://issues.apache.org/jira/browse/PARQUET-968