Change parquet-protobuf to use same list structure as avro etc. Also … #253

dguy · 2015-08-04T12:16:55Z

…backward compatible with existing structure.
There are issues with Spark SQL not being able to read parquet files that have been written using parquet-protobuf. This is because it expects the parquet schema for a list to look something like:
optional group repeatedPrimitive (LIST) {
repeated int32 array = 3;
}

but instead it does a 1-1 mapping from the protobuf representation.
This change means the representation of a list converted from protobuf is the same as one converted from avro. It also means that SparkSQL can read files encoded with this schema.

…backward compatible with existing structure

julienledem · 2015-08-06T23:14:50Z

Thanks for looking into this.

I think SparkSQL should support those repeated fields. Let's fix it there instead. Could you open a JIRA on Spark ?
See the spec there:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types
In particular:
"This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field."

As Parquet Schemas are really using the protobuf schema model, this should be a 1-1 mapping.

dguy · 2015-08-07T07:46:09Z

Hi julien,
Thanks for getting back to me. I opened a jira against Spark already - https://issues.apache.org/jira/browse/SPARK-9340
I appreciate what the spec says, but... Wouldn't it be nice to have a consistent representation of all types? Everyone can then handle all parquet files the same irrespective of its original format. Certainly make clients code, like SparkSQL and Hive, much simpler.

rdblue · 2015-08-11T00:06:26Z

@dguy, if we implement the repeated rules as defined, then the compatibility irrespective of original format is addressed. The difference between that and what you suggest is the representation that you get. This was something that we always thought was reasonable, but I'm open to changing it. If you haven't already, could you start a thread on the dev list to discuss it?

Change parquet-protobuf to use same list structure as avro etc. Also …

12c8a39

…backward compatible with existing structure

cleaton mentioned this pull request May 19, 2016

Incorrect parsing for 'repeated group' in parquet format prestodb/presto#5316

Closed

dguy closed this Jun 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change parquet-protobuf to use same list structure as avro etc. Also … #253

Change parquet-protobuf to use same list structure as avro etc. Also … #253

dguy commented Aug 4, 2015

julienledem commented Aug 6, 2015

dguy commented Aug 7, 2015

rdblue commented Aug 11, 2015

Change parquet-protobuf to use same list structure as avro etc. Also … #253

Change parquet-protobuf to use same list structure as avro etc. Also … #253

Conversation

dguy commented Aug 4, 2015

julienledem commented Aug 6, 2015

dguy commented Aug 7, 2015

rdblue commented Aug 11, 2015