[SPARK-25407][SQL] Allow nested access for non-existent field for Parquet file when nested pruning is enabled #24307

dongjoon-hyun · 2019-04-05T23:28:19Z

What changes were proposed in this pull request?

As part of schema clipping in ParquetReadSupport.scala, we add fields in the Catalyst requested schema which are missing from the Parquet file schema to the Parquet clipped schema. However, nested schema pruning requires we ignore unrequested field data when reading from a Parquet file. Therefore we pass two schema to ParquetRecordMaterializer: the schema of the file data we want to read and the schema of the rows we want to return. The reader is responsible for reconciling the differences between the two.

Aside from checking whether schema pruning is enabled, there is an additional complication to constructing the Parquet requested schema. The manner in which Spark's two Parquet readers reconcile the differences between the Parquet requested schema and the Catalyst requested schema differ. Spark's vectorized reader does not (currently) support reading Parquet files with complex types in their schema. Further, it assumes that the Parquet requested schema includes all fields requested in the Catalyst requested schema. It includes logic in its read path to skip fields in the Parquet requested schema which are not present in the file.

Spark's parquet-mr based reader supports reading Parquet files of any kind of complex schema, and it supports nested schema pruning as well. Unlike the vectorized reader, the parquet-mr reader requires that the Parquet requested schema include only those fields present in the underlying Parquet file's schema. Therefore, in the case where we use the parquet-mr reader we intersect the Parquet clipped schema with the Parquet file's schema to construct the Parquet requested schema that's set in the ReadContext.

Additional description (by @HyukjinKwon):

Let's suppose that we have a Parquet schema as below:

message spark_schema {
  required int32 id;
  optional group name {
    optional binary first (UTF8);
    optional binary last (UTF8);
  }
  optional binary address (UTF8);
}

Currently, the clipped schema as follows:

message spark_schema {
  optional group name {
    optional binary middle (UTF8);
  }
  optional binary address (UTF8);
}

Parquet MR does not support access to the nested non-existent field (name.middle).

To workaround this, this PR removes name.middle request at all to Parquet reader as below:

message spark_schema {
  optional binary address (UTF8);
}

and produces the record (name.middle) properly as the requested Catalyst schema.

root
-- name: struct (nullable = true)
    |-- middle: string (nullable = true)
-- address: string (nullable = true)

I think technically this is what Parquet library should support since Parquet library made a design decision to produce null for non-existent fields IIRC. This PR targets to work around it.

How was this patch tested?

A previously ignored test case which exercises the failure scenario this PR addresses has been enabled.

This closes #22880

ParquetRowConverter.start() into their own loop for clarity

dongjoon-hyun · 2019-04-05T23:30:38Z

cc @mallman , @dbtsai , @viirya , @maropu , @cloud-fan , @gatorsmile .

dongjoon-hyun · 2019-04-05T23:34:42Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

  }

-  ignore("partial schema intersection - select missing subfield") {
+  testSchemaPruning("partial schema intersection - select missing subfield") {


Note that this test case is tested in both Parquet and ORC recently. In addition, before this PR, it fails in Parquet path only.

SparkQA · 2019-04-06T01:13:31Z

Test build #104337 has finished for PR 24307 at commit 8ad3032.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala

dongjoon-hyun · 2019-04-06T17:33:25Z

@mallman If you don't want to rebase your PR and do the diff, please see the commit log in this PR.
The last two commits. That's all.

Again, I prefer your contribution. It's not a good way to take over like this when you are active! Thanks!

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

dongjoon-hyun · 2019-04-08T00:24:12Z

Also, @HyukjinKwon . Could you review this please?

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala

HyukjinKwon · 2019-04-08T00:54:21Z

It doesn't quite matter which PR is merged as of 51bee7a. Multiple people will be authored together and you will be the main author if this one gets merged.

HyukjinKwon · 2019-04-08T00:59:29Z

@dongjoon-hyun, can we go ahead with this one since you already opened this? I will make @mallman as its main author. Looks pretty good, as I reviewed before. Let me leave a sign-off today or tomorrow.

dongjoon-hyun · 2019-04-08T01:04:11Z

Thank you for review, @HyukjinKwon . Sure, I guess it's okay to proceed because

This PR is already making @mallman as a main author. (We don't need to change it)
During my review, I changed some logical transformation and add a potential bug fix due to ClassCastException, but the main PR content is not changed. It's the same with @mallman 's first PR .

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala

HyukjinKwon · 2019-04-08T02:38:37Z

LGTM otherwise.

HyukjinKwon · 2019-04-08T02:39:16Z

cc @gatorsmile, @cloud-fan and @liancheng. Let me get this in if there's no more comments.

dongjoon-hyun · 2019-04-08T03:34:06Z

Thank you for review, @HyukjinKwon .

SparkQA · 2019-04-08T04:51:25Z

Test build #104367 has finished for PR 24307 at commit 1da3ba3.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mallman · 2019-04-08T05:11:10Z

Thank you @dongjoon-hyun for picking this up, and thank you @HyukjinKwon for your review.

SparkQA · 2019-04-08T05:29:12Z

Test build #104368 has finished for PR 24307 at commit 9e88cf3.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-08T07:05:01Z

Test build #104369 has finished for PR 24307 at commit 95e1c63.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-04-08T07:59:27Z

Retest this please.

SparkQA · 2019-04-08T12:06:49Z

Test build #104376 has finished for PR 24307 at commit 95e1c63.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-04-08T13:26:09Z

Merged to master.

dongjoon-hyun · 2019-04-08T15:08:17Z

Thank you, @HyukjinKwon !

mallman and others added 6 commits April 5, 2019 13:50

Ensure we pass a compatible pruned schema to ParquetRowConverter

a5eb54f

Replace an unnecessarily partial function with a "total" function

c7cbb29

Extract all calls to currentRow.setNullAt(i) in

8b67c79

ParquetRowConverter.start() into their own loop for clarity

Change some log levels and make a stylistic change

c305da4

Update

29c0495

Prevent ClassCastException.

8ad3032

dongjoon-hyun changed the title ~~Spark 25407~~ [SPARK-25407][SQL] Ensure we pass a compatible pruned schema to ParquetRowConverter Apr 5, 2019

dongjoon-hyun commented Apr 5, 2019

View reviewed changes

dongjoon-hyun commented Apr 6, 2019

View reviewed changes

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala Show resolved Hide resolved

dongjoon-hyun commented Apr 6, 2019

View reviewed changes

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala Show resolved Hide resolved