[SparkSPARK-9340] - make SparkSQL work with nested types in parquet-protobuf #8063

dguy · 2015-08-09T21:13:08Z

…rotobuf

liancheng · 2015-08-10T07:42:36Z

The issue this PR tries to fix should be a special case of the Parquet interoperability issue which has already been fixed in Spark 1.5. Please see my comments in SPARK-9340 for details.

liancheng · 2015-08-10T15:11:08Z

ok to test

liancheng · 2015-08-10T16:02:58Z

sql/core/src/test/scala/org/apache/spark/sql/parquet/ProtoParquetTypesConverterTest.scala

The first null here should be an empty Seq. The schema of the testing Parquet data file is:

message TestProtobuf.SchemaConverterRepetition { optional int32 optionalPrimitive; required int32 requiredPrimitive; repeated int32 repeatedPrimitive; optional group optionalMessage { optional int32 someId; } required group requiredMessage { optional int32 someId; } repeated group repeatedMessage { optional int32 someId; } }

As stated by parquet-format spec, repeatedPrimitive should be interpreted as a required list of required elements, so it should never be null.

SparkQA · 2015-08-10T17:03:00Z

Test build #40294 has finished for PR 8063 at commit fb23f28.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait NeedsResetArray
- trait SchemaConverter

liancheng · 2015-08-10T18:36:06Z

We can close this one now since #8070 supersedes it.

This PR is inspired by #8063 authored by dguy. Especially, testing Parquet files added here are all taken from that PR. **Committer who merges this PR should attribute it to "Damian Guy <damian.guygmail.com>".** ---- SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement backwards-compatibility rules defined in `parquet-format` spec. However, both Spark SQL and `parquet-avro` neglected the following statement in `parquet-format`: > This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field. One of the consequences is that, Parquet files generated by `parquet-protobuf` containing unannotated repeated fields are not correctly converted to Catalyst arrays. This PR fixes this issue by 1. Handling unannotated repeated fields in `CatalystSchemaConverter`. 2. Converting this kind of special repeated fields to Catalyst arrays in `CatalystRowConverter`. Two special converters, `RepeatedPrimitiveConverter` and `RepeatedGroupConverter`, are added. They delegate actual conversion work to a child `elementConverter` and accumulates elements in an `ArrayBuffer`. Two extra methods, `start()` and `end()`, are added to `ParentContainerUpdater`. So that they can be used to initialize new `ArrayBuffer`s for unannotated repeated fields, and propagate converted array values to upstream. Author: Cheng Lian <[email protected]> Closes #8070 from liancheng/spark-9340/unannotated-parquet-list and squashes the following commits: ace6df7 [Cheng Lian] Moves ParquetProtobufCompatibilitySuite f1c7bfd [Cheng Lian] Updates .rat-excludes 420ad2b [Cheng Lian] Fixes converting unannotated Parquet lists (cherry picked from commit 071bbad) Signed-off-by: Cheng Lian <[email protected]>

This PR is inspired by #8063 authored by dguy. Especially, testing Parquet files added here are all taken from that PR. **Committer who merges this PR should attribute it to "Damian Guy <damian.guygmail.com>".** ---- SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement backwards-compatibility rules defined in `parquet-format` spec. However, both Spark SQL and `parquet-avro` neglected the following statement in `parquet-format`: > This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field. One of the consequences is that, Parquet files generated by `parquet-protobuf` containing unannotated repeated fields are not correctly converted to Catalyst arrays. This PR fixes this issue by 1. Handling unannotated repeated fields in `CatalystSchemaConverter`. 2. Converting this kind of special repeated fields to Catalyst arrays in `CatalystRowConverter`. Two special converters, `RepeatedPrimitiveConverter` and `RepeatedGroupConverter`, are added. They delegate actual conversion work to a child `elementConverter` and accumulates elements in an `ArrayBuffer`. Two extra methods, `start()` and `end()`, are added to `ParentContainerUpdater`. So that they can be used to initialize new `ArrayBuffer`s for unannotated repeated fields, and propagate converted array values to upstream. Author: Cheng Lian <[email protected]> Closes #8070 from liancheng/spark-9340/unannotated-parquet-list and squashes the following commits: ace6df7 [Cheng Lian] Moves ParquetProtobufCompatibilitySuite f1c7bfd [Cheng Lian] Updates .rat-excludes 420ad2b [Cheng Lian] Fixes converting unannotated Parquet lists

This PR is inspired by apache#8063 authored by dguy. Especially, testing Parquet files added here are all taken from that PR. **Committer who merges this PR should attribute it to "Damian Guy <damian.guygmail.com>".** ---- SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement backwards-compatibility rules defined in `parquet-format` spec. However, both Spark SQL and `parquet-avro` neglected the following statement in `parquet-format`: > This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field. One of the consequences is that, Parquet files generated by `parquet-protobuf` containing unannotated repeated fields are not correctly converted to Catalyst arrays. This PR fixes this issue by 1. Handling unannotated repeated fields in `CatalystSchemaConverter`. 2. Converting this kind of special repeated fields to Catalyst arrays in `CatalystRowConverter`. Two special converters, `RepeatedPrimitiveConverter` and `RepeatedGroupConverter`, are added. They delegate actual conversion work to a child `elementConverter` and accumulates elements in an `ArrayBuffer`. Two extra methods, `start()` and `end()`, are added to `ParentContainerUpdater`. So that they can be used to initialize new `ArrayBuffer`s for unannotated repeated fields, and propagate converted array values to upstream. Author: Cheng Lian <[email protected]> Closes apache#8070 from liancheng/spark-9340/unannotated-parquet-list and squashes the following commits: ace6df7 [Cheng Lian] Moves ParquetProtobufCompatibilitySuite f1c7bfd [Cheng Lian] Updates .rat-excludes 420ad2b [Cheng Lian] Fixes converting unannotated Parquet lists

[SparkSPARK-9340] - make SparkSQL work with nested types in parquet-p…

fb23f28

…rotobuf

nssalian mentioned this pull request Aug 9, 2015

[SparkSPARK-9340] - make SparkSQL work with nested types in parquet-p… #8032

Closed

liancheng reviewed Aug 10, 2015
View reviewed changes

liancheng mentioned this pull request Aug 10, 2015

[SPARK-9340] [SQL] Fixes converting unannotated Parquet lists #8070

Closed

dguy closed this Aug 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SparkSPARK-9340] - make SparkSQL work with nested types in parquet-protobuf #8063

[SparkSPARK-9340] - make SparkSQL work with nested types in parquet-protobuf #8063

Uh oh!

dguy commented Aug 9, 2015

Uh oh!

liancheng commented Aug 10, 2015

Uh oh!

liancheng commented Aug 10, 2015

Uh oh!

liancheng Aug 10, 2015

Uh oh!

SparkQA commented Aug 10, 2015

Uh oh!

liancheng commented Aug 10, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SparkSPARK-9340] - make SparkSQL work with nested types in parquet-protobuf #8063

[SparkSPARK-9340] - make SparkSQL work with nested types in parquet-protobuf #8063

Uh oh!

Conversation

dguy commented Aug 9, 2015

Uh oh!

liancheng commented Aug 10, 2015

Uh oh!

liancheng commented Aug 10, 2015

Uh oh!

liancheng Aug 10, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 10, 2015

Uh oh!

liancheng commented Aug 10, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants