Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change parquet-protobuf to use same list structure as avro etc. Also … #253

Closed
wants to merge 1 commit into from

Conversation

dguy
Copy link

@dguy dguy commented Aug 4, 2015

…backward compatible with existing structure.
There are issues with Spark SQL not being able to read parquet files that have been written using parquet-protobuf. This is because it expects the parquet schema for a list to look something like:
optional group repeatedPrimitive (LIST) {
repeated int32 array = 3;
}

but instead it does a 1-1 mapping from the protobuf representation.
This change means the representation of a list converted from protobuf is the same as one converted from avro. It also means that SparkSQL can read files encoded with this schema.

…backward compatible with existing structure
@julienledem
Copy link
Member

Thanks for looking into this.

I think SparkSQL should support those repeated fields. Let's fix it there instead. Could you open a JIRA on Spark ?
See the spec there:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types
In particular:
"This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field."

As Parquet Schemas are really using the protobuf schema model, this should be a 1-1 mapping.

@dguy
Copy link
Author

dguy commented Aug 7, 2015

Hi julien,
Thanks for getting back to me. I opened a jira against Spark already - https://issues.apache.org/jira/browse/SPARK-9340
I appreciate what the spec says, but... Wouldn't it be nice to have a consistent representation of all types? Everyone can then handle all parquet files the same irrespective of its original format. Certainly make clients code, like SparkSQL and Hive, much simpler.

@rdblue
Copy link
Contributor

rdblue commented Aug 11, 2015

@dguy, if we implement the repeated rules as defined, then the compatibility irrespective of original format is addressed. The difference between that and what you suggest is the representation that you get. This was something that we always thought was reasonable, but I'm open to changing it. If you haven't already, could you start a thread on the dev list to discuss it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants