[SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups#37280
[SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups#37280dtenedor wants to merge 11 commits intoapache:masterfrom dtenedor:fix-bug-dsv2
Conversation
|
Hi @gengliangwang, this PR fixes a bug in V2 data source analysis support for column DEFAULT values in Apache Spark. |
|
Can one of the admins verify this patch? |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala
Outdated
Show resolved
Hide resolved
Thank you for fix. Although my suggestions may contain my personal habits. |
sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala
Show resolved
Hide resolved
|
LGTM. Thanks for the fix! |
|
@dtenedor could you also update the PR description about the ORC fix? |
|
@gengliangwang Sure, this is done. |
|
Thanks, merging to master |
|
I find a bug. When u show create table, there does not exist DEFAULT on some fields event if they were defined. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Hi, @dtenedor , @beliefer , @gengliangwang .
I found that this commit causes a huge regression on ORC native readers during Apache Spark 3.4.0 preparation via benchmark.
I verified that this commit causes the regression.
cc @HyukjinKwon
|
@dongjoon-hyun Thank you for you ping. @dtenedor Could you investigate the issue ? |
|
Thanks. I created a PR to recover the ORC reader performance first by a partial and logical revert. For the new DEFAULT-value feature's ORC bugs, we should handle separately more careful review in terms of the performance. |
…e feature ### What changes were proposed in this pull request? This PR is a partial and logical revert of SPARK-39862, #37280, to fix the huge ORC reader perf regression (3x slower). SPARK-39862 should propose a fix without perf regression. ### Why are the changes needed? During Apache Spark 3.4.0 preparation, SPARK-41782 identified a perf regression. - #39301 (comment) ### Does this PR introduce _any_ user-facing change? After this PR, the regression is removed. However, the bug of DEFAULT value feature will remain. This should be handled separately. ### How was this patch tested? Pass the CI. Closes #39362 from dongjoon-hyun/SPARK-41858. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…rc reader ### What changes were proposed in this pull request? This PR fixes a correctness bug related to column DEFAULT values in Orc reader. * #37280 introduced a performance regression in the Orc reader. * #39362 fixed the performance regression, but stopped the column DEFAULT feature from working, causing a temporary correctness regression that we agreed for me to fix later. * This PR restores column DEFAULT functionality for Orc scans and fixes the correctness regression while not reintroducing the performance regression. ### Why are the changes needed? This PR fixes a correctness bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR updates a unit test to exercise that the Orc scan functionality is correct. Closes #39370 from dtenedor/fix-perf-bug-orc-reader. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Fix two bugs in existence DEFAULT value lookups during data source scans:
The
ResolveDefaultColumns.getExistenceDefaultValuesmethod should inspect the string in the column metadata at theEXISTS_DEFAULT_COLUMN_METADATA_KEY. It was erroneously using theCURRENT_DEFAULT_COLUMN_METADATA_KEYinstead; this PR fixes that bug.The Orc file scan operator should return NULL for any row/column where the Orc data itself indicates NULL. It was erroneously using the existence default value instead; this PR fixes that bug.
Why are the changes needed?
This fixes a bug in column DEFAULT analysis for V2 data source tables.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
This PR changes existing unit tests and adds a new case covering this bug.