Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-968 Add Hive/Presto support in ProtoParquet #411

Closed
wants to merge 3 commits into from

Conversation

costimuraru
Copy link

@costimuraru costimuraru commented Apr 29, 2017

This PR adds Hive (https://github.com/apache/hive) and Presto (https://github.com/prestodb/presto) support for parquet messages written with ProtoParquetWriter. Hive and other tools, such as Presto (used by AWS Athena), rely on specific LIST/MAP wrappers (as defined in the parquet spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). These wrappers are currently missing from the ProtoParquet schema. AvroParquet works just fine, because it adds these wrappers when it deals with arrays and maps. This PR brings these wrappers in parquet-proto, providing the same functionality that already exists in parquet-avro.

This is backward compatible. Messages written without the extra LIST/MAP wrappers are still being read successfully using the updated ProtoParquetReader.

Regarding the change.
Given the following protobuf schema:

message ListOfPrimitives {
    repeated int64 my_repeated_id = 1;
}

Old parquet schema was:

message ListOfPrimitives {
  repeated int64 my_repeated_id = 1;
}

New parquet schema is:

message ListOfPrimitives {
  required group my_repeated_id (LIST) = 1 {
    repeated group list {
      required int64 element;
    }
  }
}

For list of messages, the changes look like this:

Protobuf schema:

message ListOfMessages {
    string top_field = 1;
    repeated MyInnerMessage first_array = 2;
}

message MyInnerMessage {
    int32 inner_field = 1;
}

Old parquet schema was:

message TestProto3.ListOfMessages {
  optional binary top_field (UTF8) = 1;
  repeated group first_array = 2 {
    optional int32 inner_field = 1;
  }
}

The expected parquet schema, compatible with Hive (and similar to parquet-avro) is the following (notice the LIST wrapper):

message TestProto3.ListOfMessages {
  optional binary top_field (UTF8) = 1;
  required group first_array (LIST) = 2 {
    repeated group list {
      optional group element {
        optional int32 inner_field = 1;
      }
    }
  }
}

Similar for maps. Protobuf schema:

message TopMessage {
    map<int64, MyInnerMessage> myMap = 1;
}

message MyInnerMessage {
    int32 inner_field = 1;
}

Old parquet schema:

message TestProto3.TopMessage {
  repeated group myMap = 1 {
    optional int64 key = 1;
    optional group value = 2 {
      optional int32 inner_field = 1;
    }
  }
}

New parquet schema (notice the MAP wrapper):

message TestProto3.TopMessage {
  required group myMap (MAP) = 1 {
    repeated group key_value {
      required int64 key;
      optional group value {
        optional int32 inner_field = 1;
      }
    }
  }
}

Jira: https://issues.apache.org/jira/browse/PARQUET-968

@kgalieva
Copy link
Contributor

kgalieva commented May 5, 2017

Hello @costimuraru
Could you please clarify why you decided to replace

repeated int32 repeatedPrimitive = 3;

with

required group repeatedPrimitive (LIST) = 3 {
    repeated int32 array;
 }

not with

optional group repeatedPrimitive (LIST) {
 repeated group list {
   optional int32 element;
 }
}

as described in documentation https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists

@costimuraru
Copy link
Author

Hi @kgalieva,

You raise a good point. What I had in mind was to make it similar to what parquet-avro is doing, like you can see here: https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java#L82
But indeed, it seems the right approach to always add the list wrapper.

Cheers,
Costi

@kgalieva
Copy link
Contributor

kgalieva commented May 5, 2017

Here are examples of how Spark and Hive handle repeated fields.
Spark:

 optional group repeatedPrimitive (LIST) {
    repeated group list {
      optional int32 element;
    }
  }

Hive

 optional group repeatedPrimitive (LIST) {
    repeated group bag {
      optional int32 array_element;
    }
  }

Both are compliant with the specification.
Would you consider implementing it Spark/Hive way?

@costimuraru
Copy link
Author

costimuraru commented May 5, 2017

Hi @kgalieva,

I've spent the last couple of hours trying to add the inner layer for primitive values, but the changes needed to support this are quite involved.

However, looking again over the spec, it says this:

Backward-compatibility rules
[...] Some existing data does not include the inner element layer. [...] 
Examples that can be interpreted using these rules:

// List<Integer> (nullable list, non-null elements)
optional group my_list (LIST) {
  repeated int32 element;
}

This is exactly the same as what avro and this PR is producing. (By the way, this format is working perfect with Hive and Presto - tested on our own data set, with a massive protobuf schema (40+ fields)).

Also, the spec does not mention a best practice for this use case: List<Tuple<String, Integer>>.
Specifically in protobuf:

message ListOfMessages {
    repeated MyInnerMessage my_array = 1;
}

message MyInnerMessage {
    string field1 = 1;
    int32 field2 = 2;
}

Clearly we can't just use element here, since we have two of them (field1/field2).

@julienledem
Copy link
Member

julienledem commented May 12, 2017

Hi @costimuraru and @kgalieva: great to see design discussions happening :)
If you want to have systematically a 3 level parquet list here are some hints:

  • example 1:
    Proto: (note that this list can not be null and does not contain null)
    repeated int32 repeatedPrimitive = 3;
    should map to (almost what @kgalieva was saying, just added required where it can not be null)
*required* group repeatedPrimitive (LIST) {
 repeated group list {
   *required* int32 element;
 }
}
  • example 2
    repeated MyInnerMessage my_array = 1;
    In this case element is just of type MyInnerMessage
*required* group my_array (LIST) {
 repeated group list {
   *required* MyInnerMessage element;
 }
}

CC: @rdblue

@costimuraru
Copy link
Author

costimuraru commented May 18, 2017

@julienledem, @kgalieva, I've made the changes so that the resulting parquet schema now follows the spec.

@julienledem, I've also made the LIST required and the element is also required now.

See the changes in ProtoSchemaConverterTest.java

Updated the pull request description to reflect the schema changes.

@qinghui-xu
Copy link
Contributor

qinghui-xu commented May 30, 2017

@julienledem @costimuraru
This PR would be quite interesting if the wrapper could be defined as optional in parquet schema, from the point of view of our use cases in which we need to distinguish whether a list is null or empty.
If using optional on the first level, the list will be nullable (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists). I'll take the first example:
optional group repeatedPrimitive (LIST) {
repeated group list {
required int32 element;
}
}
Such, we have an optional list containing not null integers.

@julienledem
Copy link
Member

@matt-martin @lukasnalezenec @costimuraru @qinghui-xu @kgalieva this looks great. What is left to resolve before we can merge this? I'm not using parquet-proto myself at the moment but I'm happy to organize a google hangout if that helps getting to a resolution.

@costimuraru
Copy link
Author

@julienledem sounds good. I think this is ready for merge.

@lumost
Copy link

lumost commented Aug 30, 2017

I've also encountered this issue with ProtoParquet formatted parquet files. Is it possible for this to be merged in the near future? Also happy to pitch in on any outstanding items @costimuraru @julienledem

@andredasilvapinto
Copy link

andredasilvapinto commented Sep 1, 2017

I had problems with this when using Proto3. I've done a few changes to get it to work. I also noticed the field names and structures for lists and maps don't conform with the official representation here ( https://github.com/apache/parquet-format/blob/master/LogicalTypes.md ), but with one of the backward compatible ones. This was not enough to get us to read the Parquet files in all our tech stack (Spark, Hive, Athena/Presto), so I also changed that.

I might be able to push my changes sometime soon.

@costimuraru
Copy link
Author

costimuraru commented Sep 1, 2017

Hey Andre. There are multiple commits on this PR, including ones that make the schema compatible with the spec defined at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md including for lists and maps.
For instance the last commit 28837b3 makes it compatible with Spark (which was tested and validated by AWS).
Are you using the latest version here?

We've been using this patch to produce proto3-parquet files that we're feeding to Athena for quite some time, with a schema containing 40 fields including maps, lists and inner groups and with 10s of TB of data.

@andredasilvapinto
Copy link

Ah no. I did this ~ 3 weeks ago. Nice that you fixed it ;)

@andredasilvapinto
Copy link

You didn't fix the lists representation though. The documentation says:

The middle level, named list, must be a repeated group with a single field named element.

Your single field is not named element.

Also why did you pick Repetition.REQUIRED here?
28837b3#diff-e5fd77f88cc2bb6c2ff8fa3b53f4d56bR140

In Protocol Buffers 3 everything is Optional.

@costimuraru
Copy link
Author

costimuraru commented Sep 4, 2017

Thanks for the input, @andredasilvapinto.

Your single field is not named element.

Again, with one of the latest commits, the proto-parquet list should look something like this:

required group my_repeated_id (LIST) = 1 {
  repeated group list {
    **required int64 element;**
  }
}

Are you not seeing this with the latest version on this patch?

Also why did you pick Repetition.REQUIRED here?
28837b3#diff-e5fd77f88cc2bb6c2ff8fa3b53f4d56bR140

It's a good question! In the spec it says:

The outer-most level must be a group annotated with MAP that contains a single field named key_value. The repetition of this level must be either optional or required and determines whether the list is nullable.

AFAIK, protobuf does not have lists/maps that are null. In fact "it makes no distinction between an empty list and a null list." - so I think it doesn't matter what the repetion is here. I tried it with Repetion.REQUIRED and it worked fine even without adding any values to the protobuf map. If you know otherwise, feedback is appreciated.

@andredasilvapinto
Copy link

You are only doing it for primitive types: https://github.com/apache/parquet-mr/pull/411/files#diff-3b093ba1a3c729ad39bd47b0c148a586R298

even your tests show it:
https://github.com/apache/parquet-mr/pull/411/files#diff-ae1342df26f3212198daf98364cde51dR161

I went with OPTIONAL because in Proto3 there is no REQUIRED, so I thought that an optional parquet field was a more adequate type to represent the equivalent Protobuf 3 optional field.

@andredasilvapinto
Copy link

andredasilvapinto commented Sep 5, 2017

If you find it useful, these were the changes I did on top of your last "Implement review" commit: d694f20:

andredasilvapinto@dfa9701

Some of the changes are already present on your latest commit.

We have been running this for dozens of different data sets (Protobuf 3) for a few weeks already without any known problems.

@costimuraru
Copy link
Author

costimuraru commented Sep 5, 2017

You are only doing it for primitive types: https://github.com/apache/parquet-mr/pull/411/files#diff-3b093ba1a3c729ad39bd47b0c148a586R298
even your tests show it:
https://github.com/apache/parquet-mr/pull/411/files#diff-ae1342df26f3212198daf98364cde51dR161

You raise an interesting point, @andredasilvapinto!

Case 1.
Suppose we have the following protobuf schema, containing a list of messages, where the inner message has two fields:

message MyTopMessage {
     repeated MyInnerMessage repeatedMessage = 1;
}

message MyInnerMessage {
    int32 someId = 1;
    int32 otherId = 2;   
}

Ideally, I would like to be able to query each individual sub-field (someId/otherId) individually, in Athena or Hive. Something like this:

SELECT repeatedMessage[1].someId FROM athenalist limit 10;
SELECT repeatedMessage[1].otherId FROM athenalist limit 10;

Where the table (Presto) would look something like this (notice the array of struct):

CREATE EXTERNAL TABLE IF NOT EXISTS athenalist (
`repeatedComplexMessage` array<struct<`someId`:int,`otherId`:int>>)
STORED AS PARQUET

This works fine with the current version. The current parquet schema looks like this:

message TestProto3.MyTopMessage {
  required group repeatedComplexMessage (LIST) {
    repeated group list {
      optional int32 someId;
      optional int32 otherId;
    }
  }
}

My question here would be: where should the element be? The spec does not specify what to do when we're dealing with a list of messages with multiple fields. Thoughts?


Case 2.
The second case is the one present in the unit test, where the inner message has just one field:

message MyTopMessage {
     repeated MyInnerMessage repeatedMessage = 1;
}

message MyInnerMessage {
    int32 someId = 1;
}

Again, ideally I would like to be able to select that field specifically:

SELECT repeatedMessage[1].someId FROM athenalist2 limit 10;

And have a CREATE table with a struct containing one field:

CREATE EXTERNAL TABLE IF NOT EXISTS athenalist (
`repeatedComplexMessage` array<struct<`someId`:int>>)
STORED AS PARQUET

However... this does not work! I'm getting a parquet parsing error in Presto (HIVE_CURSOR_ERROR: Can not read value at 0 in block 0).

It works however when I change the CREATE table to remove the struct:

CREATE EXTERNAL TABLE IF NOT EXISTS athenalist (
`repeatedComplexMessage` array<int>)
STORED AS PARQUET

Though I'm left with no way of querying the someId field directly. I can only do:

SELECT repeatedMessage[1] FROM athenalist2 limit 10;

Which will return an int.

If this is the desired behavior, then indeed @andredasilvapinto, we can have element as the inner field name instead of someId. Something like:

message TestProto3.MyTopMessage {
  required group repeatedMessage (LIST) {
    repeated group list {
      optional int32 someId element;
    }
  }
}
But this will work only when the inner message has just one field. And again, this seems to prevent the ability to SELECT that specific field (someId).

What do you think?

Later Edit: Ah, I see in your commit (andredasilvapinto@dfa9701) what should be done here. It should actually be:

message TestProto3.MyTopMessage {
  required group repeatedMessage (LIST) {
    repeated group list {
      optional group element {
        optional int32 someId;
      }
    }
  }
}

The same goes for above. Nice catch! I'll give this a try and will reply.

@costimuraru
Copy link
Author

@andredasilvapinto, you were right! After adding the extra element wrapper (like you suggested, even for non-primitive types) it started working also for Case 2. Great job, man!
I picked your commit, which contains also the fixes for the MAP fields.
If you wish to preserve the "copyright", I'd be more than happy to do a cherry pick from your fork after you rebase on top of the PARQUET-968 Implement feedback 28837b3

@andredasilvapinto
Copy link

Nice @costimuraru. No problem with the "copyright". Just as long as this gets merged I'm happy (one less reason to keep our internal parquet-mr fork!). cheers!

@costimuraru costimuraru changed the title PARQUET-968 Add Hive support in ProtoParquet PARQUET-968 Add Hive/Presto support in ProtoParquet Sep 5, 2017
@andredasilvapinto
Copy link

Are there any efforts currently being made in order to merge this to master?

@andredasilvapinto
Copy link

Just noticed that this doesn't write the values of Protobuf fields that are equal to their default values. This happens because in Protobuf3, setting a field to its default value is equivalent to clearing the field. Therefore the conversion to Parquet needs to take that in consideration.

@abelke
Copy link

abelke commented Nov 13, 2017

Hi when I use this patch, this need protoc3(install Protobuf 3.4.0 ), if the same case on protoc2(Protobuf 2.5.0) , have other solution?

@qinghui-xu
Copy link
Contributor

@costimuraru @julienledem
Hey, it seems this patch stays here for a while, and it is indeed important for us to have this fix. Could somebody merge it?

@lumost
Copy link

lumost commented Feb 13, 2018

seconding @qinghui-xu's comment, we've had a version of this Patch in production for nearly 6 months now.

@BenoitHanotte
Copy link

Hello @lukasnalezenec, have you had time to have a look? Thanks

@lukasnalezenec
Copy link
Contributor

Hi, I already did.
There is one typo in comment and it is little bit harder to read - i wanted to check flow once more. I think we can commit it as it is.

@julienledem
Copy link
Member

This looks good.
Thank you for this collaborative effort!

@chawlakunal
Copy link

When can this be expected to be merged to master and released?

@chawlakunal
Copy link

chawlakunal commented Apr 27, 2018

@BenoitHanotte @costimuraru @julienledem There is no way to instantiate ProtoParquetWriter with parquet.proto.writeSpecsCompliant flag enabled. Am I missing something or is this intentional? It would have been great if a constructor to enable the flag was provided.

public ProtoParquetWriter(Path file, Class<? extends Message> protoMessage,
          CompressionCodecName compressionCodecName, int blockSize, int pageSize, boolean enableDictionary,
          boolean validating, boolean writeSpecsCompliant) throws IOException {
      super(file, new ProtoWriteSupport(protoMessage), compressionCodecName, blockSize, pageSize, pageSize,
              enableDictionary, validating, DEFAULT_WRITER_VERSION,
              getConfigWithWriteSpecsCompliant(writeSpecsCompliant));
  }
  
  private static Configuration getConfigWithWriteSpecsCompliant(boolean writeSpecsCompliant) {
      Configuration config = new Configuration();
      ProtoWriteSupport.setWriteSpecsCompliant(config, writeSpecsCompliant);
      return config;
  }

@BenoitHanotte
Copy link

@chawlakunal you can manually create your ParquetWriter by providing the ProtoWriteSupport as following:

Configuration conf = new Configuration();
ProtoWriteSupport.setWriteSpecsCompliant(true, conf); // set the flag in the configuration
new ParquetWriter(file, conf, new ProtoWriteSupport(protoClass));

(or any variation of this as ParquetWriter has multiple constructors that accept a configuration in which we can set the flag)

@chawlakunal
Copy link

@BenoitHanotte That's exactly how I am using it right now but it kind of defeats the purpose of having ProtoParquetWriter class.

@chawlakunal
Copy link

@BenoitHanotte Is there a timeline of when this will be released?

@BenoitHanotte
Copy link

I believe the 1.10.0 has just been released, so this will liekly land in the next "major" release, unfortunately I am not aware of any plan to have a new release in the near future.

For the ProtoParquetWriter class, we have discussed it with @costimuraru and we will add a constructor with the flag in a future PR but I can't commit on a timeframe.

@chawlakunal
Copy link

@BenoitHanotte Here's the PR for constructor with flag #473

Also, if a minor release can be done for this fix it will be greatly appreciated.

@BenoitHanotte
Copy link

@chawlakunal I had a look at your PR (#473) , it looks good, there is just a comment that I believe needs to be changed (regarding the block size).
For the release, that's a decision that will need to be taken by the maintainers, I will ask them about their plans for releases the next time I have the chance to talk with them.

ghost pushed a commit to RMS/parquet-mr that referenced this pull request Aug 18, 2018
This PR adds Hive (https://github.com/apache/hive) and Presto (https://github.com/prestodb/presto) support for parquet messages written with ProtoParquetWriter. Hive and other tools, such as Presto (used by AWS Athena), rely on specific LIST/MAP wrappers (as defined in the parquet spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). These wrappers are currently missing from the ProtoParquet schema. AvroParquet works just fine, because it adds these wrappers when it deals with arrays and maps. This PR brings these wrappers in parquet-proto, providing the same functionality that already exists in parquet-avro.

This is backward compatible. Messages written without the extra LIST/MAP wrappers are still being read successfully using the updated ProtoParquetReader.

Regarding the change.
Given the following protobuf schema:

```
message ListOfPrimitives {
    repeated int64 my_repeated_id = 1;
}
```

Old parquet schema was:
```
message ListOfPrimitives {
  repeated int64 my_repeated_id = 1;
}
```

New parquet schema is:
```
message ListOfPrimitives {
  required group my_repeated_id (LIST) = 1 {
    repeated group list {
      required int64 element;
    }
  }
}
```
---

For list of messages, the changes look like this:

Protobuf schema:
```
message ListOfMessages {
    string top_field = 1;
    repeated MyInnerMessage first_array = 2;
}

message MyInnerMessage {
    int32 inner_field = 1;
}
```

Old parquet schema was:
```
message TestProto3.ListOfMessages {
  optional binary top_field (UTF8) = 1;
  repeated group first_array = 2 {
    optional int32 inner_field = 1;
  }
}
```

The expected parquet schema, compatible with Hive (and similar to parquet-avro) is the following (notice the LIST wrapper):

```
message TestProto3.ListOfMessages {
  optional binary top_field (UTF8) = 1;
  required group first_array (LIST) = 2 {
    repeated group list {
      optional group element {
        optional int32 inner_field = 1;
      }
    }
  }
}
```

---

Similar for maps. Protobuf schema:
```
message TopMessage {
    map<int64, MyInnerMessage> myMap = 1;
}

message MyInnerMessage {
    int32 inner_field = 1;
}
```

Old parquet schema:
```
message TestProto3.TopMessage {
  repeated group myMap = 1 {
    optional int64 key = 1;
    optional group value = 2 {
      optional int32 inner_field = 1;
    }
  }
}
```

New parquet schema (notice the `MAP` wrapper):
```
message TestProto3.TopMessage {
  required group myMap (MAP) = 1 {
    repeated group key_value {
      required int64 key;
      optional group value {
        optional int32 inner_field = 1;
      }
    }
  }
}
```

Jira: https://issues.apache.org/jira/browse/PARQUET-968

Author: Constantin Muraru <[email protected]>
Author: Benoît Hanotte <[email protected]>

Closes apache#411 from costimuraru/PARQUET-968 and squashes the following commits:

16eafcb [Benoît Hanotte] PARQUET-968 add proto flag to enable writing using specs-compliant schemas (#2)
a8bd704 [Constantin Muraru] Pick up commit from @andredasilvapinto
5cf9248 [Constantin Muraru] PARQUET-968 Add Hive support in ProtoParquet
@CCv5
Copy link

CCv5 commented Oct 30, 2018

Yes. The way I solved it was to add a flag to ProtoWriteSupport to define whether to include default values or not. If set to true I always set empty fields to their default protobuf values (except one of fields).

I can share the commit if people are interested.

Hi @andredasilvapinto , Do you have any progress regarding the default value is not persisted in parquet? It is quite an annoying bug when read enum value as ['null', 'Type1', 'Type2'], not ['Type0', 'Type1', 'Type2']

@andredasilvapinto
Copy link

andredasilvapinto commented Oct 30, 2018 via email

@CCv5
Copy link

CCv5 commented Nov 5, 2018

Yes, if you look at the commit I linked to several months ago costimuraru@9a4c016 it contains that flag to decide whether to print the default values or not.

On Tue, Oct 30, 2018, 08:21 CHuAn @.***> wrote: Yes. The way I solved it was to add a flag to ProtoWriteSupport to define whether to include default values or not. If set to true I always set empty fields to their default protobuf values (except one of fields). I can share the commit if people are interested. Hi @andredasilvapinto https://github.com/andredasilvapinto , Do you have any progress regarding the default value is not persisted in parquet? It is quite an annoying bug when read enum value as ['null', 'Type1', 'Type2'], not ['Type0', 'Type1', 'Type2'] — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#411 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ACjD6U3beSyBsjmTkF6EGtxHV8jdpEPCks5uqAwHgaJpZM4NMYWi .

ok, it is on your personal branch, great patch! when will it merge to master? any plan?

@andredasilvapinto
Copy link

I have no idea. I think a few changes were already made to the base version during this approval process.

@ccpstephanie
Copy link

Although it's closed but I'm a bit confused... why I always get the old schema version? parquet.proto.writeSpecsCompliant=false, and directly using ParquetWriter. I'm using the latest version currently 1.12.0.

I'd highly appreciate if someone could point out something stupid in my code! Or it's the same issue you are experiencing?

My goal is to be able query data via Athena/Presto, or Hive Metastore, so need the the new parquet schema version.

Method 1:

  // Doesn't work!
    Configuration conf = new Configuration();
    ProtoWriteSupport.setWriteSpecsCompliant(conf, false); // If set to true, the old schema style will be used (without wrappers).

    ParquetWriter<MessageOrBuilder> writer =
    ProtoParquetWriter.<MessageOrBuilder>builder(file).withMessage(cls).withConf(conf).build();

    for (MessageOrBuilder record : records) {
        writer.write(record);
    }

    writer.close();
    System.err.println(writer.getFooter());

Method 2:

  // Doesn't work!
    Configuration conf = new Configuration();
    ProtoWriteSupport.setWriteSpecsCompliant(conf, false); // If set to true, the old schema style will be used (without wrappers).

    try (ParquetWriter writer = new ParquetWriter(
                                            file,
                                            new ProtoWriteSupport<AddressBook>(AddressBook.class),
                                            CompressionCodecName.GZIP,
                                            128 * 1024 * 1024,//PARQUET_BLOCK_SIZE,
                                            ParquetProperties.DEFAULT_PAGE_SIZE,
                                            ParquetProperties.DEFAULT_PAGE_SIZE, 
                                            true,
                                            false,
                                            ParquetProperties.DEFAULT_WRITER_VERSION,
                                            conf)) {
        for (Object record : messages) {
            writer.write(record);
        }
        writer.close();
        System.err.println(writer.getFooter());

Parquet output Metadata:
_ParquetMetaData{FileMetaData{schema: message AddressBookProtos.AddressBook { repeated group people = 1 { optional binary name (STRING) = 1; optional int32 id = 2; optional binary email (STRING) = 3; repeated group phones = 4 { optional binary number (STRING) = 1; optional binary type (ENUM) = 2; } }} , metadata: {parquet.proto.descriptor=name: "AddressBook" field { name: "people" number: 1 label: LABEL_REPEATED type: TYPE_MESSAGE type_name: ".AddressBookProtos.Person"} , parquet.proto.writeSpecsCompliant=false, ...}

Protobuf Messasge:

`
syntax = "proto3";

package AddressBookProtos;

option java_multiple_files = true;
option java_package = "com.mycompany.app";
option java_outer_classname = "AddressBookProtos";

message Person {
string name = 1;
int32 id = 2;
string email = 3;

enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}

message PhoneNumber {
string number = 1;
PhoneType type = 2;
}

repeated PhoneNumber phones = 4;
}

message AddressBook {
repeated Person people = 1;
}
`

@Srb1996
Copy link

Srb1996 commented May 8, 2024

Was it merged ? facing same issue with 1.13.0(latest)

@wgtmac
Copy link
Member

wgtmac commented May 8, 2024

I believe it was merged: f849384

@Srb1996
Copy link

Srb1996 commented May 8, 2024

@wgtmac thanks for reply. It means version 1.13.0 with proto3 should work, right ?

@wgtmac
Copy link
Member

wgtmac commented May 8, 2024

It was merged long ago and I don't have any context about it. If this is the fix to the issue that you have seen, then yes it should not appear in 1.13.0.

@Srb1996
Copy link

Srb1996 commented May 8, 2024

i am using proto2, could it be a reason ?

@qinghui-xu
Copy link
Contributor

i am using proto2, could it be a reason ?

Probably that's the reason. parquet-proto requires proto 3 as dependency.

@Srb1996
Copy link

Srb1996 commented May 8, 2024

@qinghui-xu i am using for parquet-proto with proto2 for few years, but question is to take the changes of this pr in consideration do i need to upgrade to proto3 or this solution should work with proto2 as well ?
Currently, i am using proto2 with parquet-protobuf(1.13.0) and querying using presto is breaking.

@Srb1996
Copy link

Srb1996 commented May 8, 2024

@ccpstephanie was you able to resolve ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.