-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25407][SQL] Allow nested access for non-existent field for Parquet file when nested pruning is enabled #24307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ParquetRowConverter.start() into their own loop for clarity
|
cc @mallman , @dbtsai , @viirya , @maropu , @cloud-fan , @gatorsmile . |
| } | ||
|
|
||
| ignore("partial schema intersection - select missing subfield") { | ||
| testSchemaPruning("partial schema intersection - select missing subfield") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this test case is tested in both Parquet and ORC recently. In addition, before this PR, it fails in Parquet path only.
|
Test build #104337 has finished for PR 24307 at commit
|
...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
Show resolved
Hide resolved
|
@mallman If you don't want to rebase your PR and do the diff, please see the commit log in this PR. Again, I prefer your contribution. It's not a good way to take over like this when you are active! Thanks! |
.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
Show resolved
Hide resolved
|
Also, @HyukjinKwon . Could you review this please? |
...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
Show resolved
Hide resolved
...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
Outdated
Show resolved
Hide resolved
|
It doesn't quite matter which PR is merged as of 51bee7a. Multiple people will be authored together and you will be the main author if this one gets merged. |
|
@dongjoon-hyun, can we go ahead with this one since you already opened this? I will make @mallman as its main author. Looks pretty good, as I reviewed before. Let me leave a sign-off today or tomorrow. |
|
Thank you for review, @HyukjinKwon . Sure, I guess it's okay to proceed because |
...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
Outdated
Show resolved
Hide resolved
...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
Show resolved
Hide resolved
...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
Outdated
Show resolved
Hide resolved
...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
Show resolved
Hide resolved
|
LGTM otherwise. |
|
cc @gatorsmile, @cloud-fan and @liancheng. Let me get this in if there's no more comments. |
|
Thank you for review, @HyukjinKwon . |
|
Test build #104367 has finished for PR 24307 at commit
|
|
Thank you @dongjoon-hyun for picking this up, and thank you @HyukjinKwon for your review. |
|
Test build #104368 has finished for PR 24307 at commit
|
|
Test build #104369 has finished for PR 24307 at commit
|
|
Retest this please. |
|
Test build #104376 has finished for PR 24307 at commit
|
|
Merged to master. |
|
Thank you, @HyukjinKwon ! |
What changes were proposed in this pull request?
As part of schema clipping in
ParquetReadSupport.scala, we add fields in the Catalyst requested schema which are missing from the Parquet file schema to the Parquet clipped schema. However, nested schema pruning requires we ignore unrequested field data when reading from a Parquet file. Therefore we pass two schema toParquetRecordMaterializer: the schema of the file data we want to read and the schema of the rows we want to return. The reader is responsible for reconciling the differences between the two.Aside from checking whether schema pruning is enabled, there is an additional complication to constructing the Parquet requested schema. The manner in which Spark's two Parquet readers reconcile the differences between the Parquet requested schema and the Catalyst requested schema differ. Spark's vectorized reader does not (currently) support reading Parquet files with complex types in their schema. Further, it assumes that the Parquet requested schema includes all fields requested in the Catalyst requested schema. It includes logic in its read path to skip fields in the Parquet requested schema which are not present in the file.
Spark's parquet-mr based reader supports reading Parquet files of any kind of complex schema, and it supports nested schema pruning as well. Unlike the vectorized reader, the parquet-mr reader requires that the Parquet requested schema include only those fields present in the underlying Parquet file's schema. Therefore, in the case where we use the parquet-mr reader we intersect the Parquet clipped schema with the Parquet file's schema to construct the Parquet requested schema that's set in the
ReadContext.Additional description (by @HyukjinKwon):
Let's suppose that we have a Parquet schema as below:
Currently, the clipped schema as follows:
Parquet MR does not support access to the nested non-existent field (
name.middle).To workaround this, this PR removes
name.middlerequest at all to Parquet reader as below:and produces the record (
name.middle) properly as the requested Catalyst schema.I think technically this is what Parquet library should support since Parquet library made a design decision to produce
nullfor non-existent fields IIRC. This PR targets to work around it.How was this patch tested?
A previously ignored test case which exercises the failure scenario this PR addresses has been enabled.
This closes #22880