[SPARK-41858][SQL] Fix ORC reader perf regression due to DEFAULT value feature by dongjoon-hyun · Pull Request #39362 · apache/spark

dongjoon-hyun · 2023-01-03T10:43:00Z

What changes were proposed in this pull request?

This PR is a partial and logical revert of SPARK-39862, #37280, to fix the huge ORC reader perf regression (3x slower).

SPARK-39862 should propose a fix without perf regression.

Why are the changes needed?

During Apache Spark 3.4.0 preparation, SPARK-41782 identified a perf regression.

[SPARK-41782][TESTS] Regenerate benchmark results #39301 (comment)

Does this PR introduce any user-facing change?

After this PR, the regression is removed. However, the bug of DEFAULT value feature will remain. This should be handled separately.

How was this patch tested?

Pass the CI.

…e feature

LuciferYang

+1, LGTM if test pass

LuciferYang · 2023-01-03T13:09:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala

-                val writerFunc = newWriter(f.dataType, rowUpdater)
-                writerFunc(ordinal, value)
-              }
+          val writer = newWriter(f.dataType, rowUpdater)


A little magical, let me do some investigate too

dongjoon-hyun · 2023-01-03T17:03:17Z

Could you review this, @viirya ?

dongjoon-hyun · 2023-01-03T17:30:19Z

Also, cc @dtenedor , @beliefer , @HyukjinKwon , too

dtenedor

Thanks for doing the debugging investigation, and apologies for the regression! I'm happy to help with the fix if/where needed. I left a comment on the PR here, it looks like it may inadvertently break the DEFAULT value functionality for the Orc data source, possibly causing incorrectness for users. Let's figure it out together and come up with a fix.

dtenedor · 2023-01-03T17:29:30Z

sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala

              Seq(Map(true -> "xyz"))),
            Row(2,
-              null,
+              if (config.dataSource != "orc") {


It looks like this PR breaks the DEFAULT value functionality for the Orc data source (as shown by this unit test). If anyone is using this functionality, this PR will make their results incorrect. It would be better if we can fix the performance regression without changing the behavior. Is there some profile to show why the performance regression takes place? For example, is it because this change to the writer function introduces a new level of function call?

Thank you for review, @dtenedor .

Please see [SPARK-41782][TESTS] Regenerate benchmark results #39301 (comment) . We have a benchmark to detect this kind of perf regression. You can run it locally in your environment.

This is a partial revert to the original code which is the existing behavior before your PR like the previous Spark. As I mentioned in this PR description, SPARK-39862 should propose a new fix without perf regression.

New feature is good as long as not breaking the old behavior.

Thank you for review, @dtenedor .

Please see https://issues.apache.org/jira/browse/SPARK-41782 . We have a benchmark to detect this kind of perf regression. You can run it locally in your environment.

Thanks @dongjoon-hyun for the benchmark! The Jira simply comprises the title Regenerate benchmark results. Is there some instructions for how to run the benchmark?

This is a partial revert to the original code which is the existing behavior before your PR like the previous Spark. As I mentioned in the PR description, SPARK-39862 should propose a fix without perf regression.

New feature is good as long as not breaking the old behavior.

Agree on this. However, that bug fix was merged into Spark 3.3 on Jul. 28, 2022. Is it possible that users could have built pipelines since then using the new feature that would return incorrect results if we merged this PR?

If someone can tell me how to run the benchmark, I can play around with a fix for the perf regression that also does not change the behavior. I suspect that it's due to the extra function call overhead that takes place each time a value is written, but not sure without a profile :)

Sorry, I copied a wrong link. Here is the exact link, @dtenedor .
#39301 (comment)

Thanks, but I filed SPARK-41858 as the blocker of Apache Spark 3.4 because this really blocks Apache Spark 3.4 preparation from my side. If you don't mind, I'd prefer to merge this first and help you cleanly.

@dongjoon-hyun that is OK, but we should probably not release Spark 3.4 with a correctness regression either. It is equally important as performance. If we create another Jira blocking the release of 3.4 that covers fixing the correctness bug, it is fine to merge this PR.

At any rate, I am hoping to figure this out today. Then we should be unblocked 🤔

Thank you so much for your understanding. I'll file another blocker JIRA for that of course.

LGTM, let's merge this then

You can assign the new correctness Jira for me to fix.

viirya

Looks good to me as this reverts previous patch due to performance regression. Seems there are investigation going on.

Either we can wait for it, or, we can revert it first and apply updated patch later.

dongjoon-hyun · 2023-01-03T18:40:14Z

Thank you, @viirya , @dtenedor , @beliefer , @LuciferYang . Merged to master.

dongjoon-hyun · 2023-01-03T18:44:30Z

Here is the blocker-level correctness issue JIRA, SPARK-41862, for ORC reader part.

…rc reader ### What changes were proposed in this pull request? This PR fixes a correctness bug related to column DEFAULT values in Orc reader. * #37280 introduced a performance regression in the Orc reader. * #39362 fixed the performance regression, but stopped the column DEFAULT feature from working, causing a temporary correctness regression that we agreed for me to fix later. * This PR restores column DEFAULT functionality for Orc scans and fixes the correctness regression while not reintroducing the performance regression. ### Why are the changes needed? This PR fixes a correctness bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR updates a unit test to exercise that the Orc scan functionality is correct. Closes #39370 from dtenedor/fix-perf-bug-orc-reader. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

beliefer · 2023-01-04T01:43:23Z

LGTM. Although later.

[SPARK-41858][SQL] Fix ORC reader perf regression due to DEFAULT valu…

3a454e4

…e feature

github-actions bot added the SQL label Jan 3, 2023

This was referenced Jan 3, 2023

[SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups #37280

Closed

[SPARK-41782][TESTS] Regenerate benchmark results #39301

Closed

LuciferYang approved these changes Jan 3, 2023

View reviewed changes

dtenedor reviewed Jan 3, 2023

View reviewed changes

viirya approved these changes Jan 3, 2023

View reviewed changes

dtenedor approved these changes Jan 3, 2023

View reviewed changes

dongjoon-hyun closed this in d81e55e Jan 3, 2023

dongjoon-hyun deleted the SPARK-41858 branch January 3, 2023 18:44

dtenedor mentioned this pull request Jan 3, 2023

[SPARK-41862][SQL] Fix correctness bug related to DEFAULT values in Orc reader #39370

Closed

Conversation

dongjoon-hyun commented Jan 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jan 3, 2023

Uh oh!

dongjoon-hyun commented Jan 3, 2023

Uh oh!

dtenedor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jan 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jan 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jan 3, 2023

Uh oh!

dongjoon-hyun commented Jan 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beliefer commented Jan 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dongjoon-hyun commented Jan 3, 2023 •

edited

Loading

dongjoon-hyun Jan 3, 2023 •

edited

Loading

dongjoon-hyun Jan 3, 2023 •

edited

Loading

dongjoon-hyun commented Jan 3, 2023 •

edited

Loading