[SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups by dtenedor · Pull Request #37280 · apache/spark

dtenedor · 2022-07-25T22:39:48Z

What changes were proposed in this pull request?

Fix two bugs in existence DEFAULT value lookups during data source scans:

The ResolveDefaultColumns.getExistenceDefaultValues method should inspect the string in the column metadata at the EXISTS_DEFAULT_COLUMN_METADATA_KEY. It was erroneously using the CURRENT_DEFAULT_COLUMN_METADATA_KEY instead; this PR fixes that bug.
The Orc file scan operator should return NULL for any row/column where the Orc data itself indicates NULL. It was erroneously using the existence default value instead; this PR fixes that bug.

Why are the changes needed?

This fixes a bug in column DEFAULT analysis for V2 data source tables.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This PR changes existing unit tests and adds a new case covering this bug.

dtenedor · 2022-07-25T22:41:57Z

Hi @gengliangwang, this PR fixes a bug in V2 data source analysis support for column DEFAULT values in Apache Spark.

AmplabJenkins · 2022-07-26T05:19:05Z

Can one of the admins verify this patch?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala

dtenedor

Thanks @beliefer for your review! Responded to changes, please look again when you have a chance.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala

beliefer · 2022-07-27T01:44:25Z

Thanks @beliefer for your review! Responded to changes, please look again when you have a chance.

Thank you for fix. Although my suggestions may contain my personal habits.

sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala

gengliangwang · 2022-07-27T22:15:51Z

LGTM. Thanks for the fix!

fetch from master

gengliangwang · 2022-07-28T20:31:03Z

@dtenedor could you also update the PR description about the ORC fix?

dtenedor · 2022-07-28T22:04:07Z

@gengliangwang Sure, this is done.

gengliangwang · 2022-07-28T23:06:40Z

Thanks, merging to master

YetiCuzMountain · 2022-09-22T02:11:28Z

I find a bug. When u show create table, there does not exist DEFAULT on some fields event if they were defined.

dongjoon-hyun

Hi, @dtenedor , @beliefer , @gengliangwang .
I found that this commit causes a huge regression on ORC native readers during Apache Spark 3.4.0 preparation via benchmark.

#39301 (comment)

I verified that this commit causes the regression.

cc @HyukjinKwon

beliefer · 2023-01-03T10:15:46Z

@dongjoon-hyun Thank you for you ping. @dtenedor Could you investigate the issue ?

dongjoon-hyun · 2023-01-03T10:49:11Z

Thanks. I created a PR to recover the ORC reader performance first by a partial and logical revert.

[SPARK-41858][SQL] Fix ORC reader perf regression due to DEFAULT value feature #39362

For the new DEFAULT-value feature's ORC bugs, we should handle separately more careful review in terms of the performance.

…e feature ### What changes were proposed in this pull request? This PR is a partial and logical revert of SPARK-39862, #37280, to fix the huge ORC reader perf regression (3x slower). SPARK-39862 should propose a fix without perf regression. ### Why are the changes needed? During Apache Spark 3.4.0 preparation, SPARK-41782 identified a perf regression. - #39301 (comment) ### Does this PR introduce _any_ user-facing change? After this PR, the regression is removed. However, the bug of DEFAULT value feature will remain. This should be handled separately. ### How was this patch tested? Pass the CI. Closes #39362 from dongjoon-hyun/SPARK-41858. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…rc reader ### What changes were proposed in this pull request? This PR fixes a correctness bug related to column DEFAULT values in Orc reader. * #37280 introduced a performance regression in the Orc reader. * #39362 fixed the performance regression, but stopped the column DEFAULT feature from working, causing a temporary correctness regression that we agreed for me to fix later. * This PR restores column DEFAULT functionality for Orc scans and fixes the correctness regression while not reintroducing the performance regression. ### Why are the changes needed? This PR fixes a correctness bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR updates a unit test to exercise that the Orc scan functionality is correct. Closes #39370 from dtenedor/fix-perf-bug-orc-reader. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

initial implementation

e706aa5

github-actions bot added the SQL label Jul 25, 2022

dtenedor added 2 commits July 25, 2022 17:09

fix test

e65b741

fix test

637769b