Skip to content

[SPARK-41858][SQL] Fix ORC reader perf regression due to DEFAULT value feature#39362

Closed
dongjoon-hyun wants to merge 1 commit intoapache:masterfrom
dongjoon-hyun:SPARK-41858
Closed

[SPARK-41858][SQL] Fix ORC reader perf regression due to DEFAULT value feature#39362
dongjoon-hyun wants to merge 1 commit intoapache:masterfrom
dongjoon-hyun:SPARK-41858

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jan 3, 2023

What changes were proposed in this pull request?

This PR is a partial and logical revert of SPARK-39862, #37280, to fix the huge ORC reader perf regression (3x slower).

SPARK-39862 should propose a fix without perf regression.

Why are the changes needed?

During Apache Spark 3.4.0 preparation, SPARK-41782 identified a perf regression.

Does this PR introduce any user-facing change?

After this PR, the regression is removed. However, the bug of DEFAULT value feature will remain. This should be handled separately.

How was this patch tested?

Pass the CI.

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM if test pass

val writerFunc = newWriter(f.dataType, rowUpdater)
writerFunc(ordinal, value)
}
val writer = newWriter(f.dataType, rowUpdater)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little magical, let me do some investigate too

@dongjoon-hyun
Copy link
Member Author

Could you review this, @viirya ?

@dongjoon-hyun
Copy link
Member Author

Also, cc @dtenedor , @beliefer , @HyukjinKwon , too

Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing the debugging investigation, and apologies for the regression! I'm happy to help with the fix if/where needed. I left a comment on the PR here, it looks like it may inadvertently break the DEFAULT value functionality for the Orc data source, possibly causing incorrectness for users. Let's figure it out together and come up with a fix.

Seq(Map(true -> "xyz"))),
Row(2,
null,
if (config.dataSource != "orc") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this PR breaks the DEFAULT value functionality for the Orc data source (as shown by this unit test). If anyone is using this functionality, this PR will make their results incorrect. It would be better if we can fix the performance regression without changing the behavior. Is there some profile to show why the performance regression takes place? For example, is it because this change to the writer function introduces a new level of function call?

image

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @dtenedor .

  • Please see [SPARK-41782][TESTS] Regenerate benchmark results #39301 (comment) . We have a benchmark to detect this kind of perf regression. You can run it locally in your environment.
  • This is a partial revert to the original code which is the existing behavior before your PR like the previous Spark. As I mentioned in this PR description, SPARK-39862 should propose a new fix without perf regression.

New feature is good as long as not breaking the old behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @dtenedor .

Thanks @dongjoon-hyun for the benchmark! The Jira simply comprises the title Regenerate benchmark results. Is there some instructions for how to run the benchmark?

  • This is a partial revert to the original code which is the existing behavior before your PR like the previous Spark. As I mentioned in the PR description, SPARK-39862 should propose a fix without perf regression.

New feature is good as long as not breaking the old behavior.

Agree on this. However, that bug fix was merged into Spark 3.3 on Jul. 28, 2022. Is it possible that users could have built pipelines since then using the new feature that would return incorrect results if we merged this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone can tell me how to run the benchmark, I can play around with a fix for the perf regression that also does not change the behavior. I suspect that it's due to the extra function call overhead that takes place each time a value is written, but not sure without a profile :)

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I copied a wrong link. Here is the exact link, @dtenedor .
#39301 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but I filed SPARK-41858 as the blocker of Apache Spark 3.4 because this really blocks Apache Spark 3.4 preparation from my side. If you don't mind, I'd prefer to merge this first and help you cleanly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun that is OK, but we should probably not release Spark 3.4 with a correctness regression either. It is equally important as performance. If we create another Jira blocking the release of 3.4 that covers fixing the correctness bug, it is fine to merge this PR.

At any rate, I am hoping to figure this out today. Then we should be unblocked 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for your understanding. I'll file another blocker JIRA for that of course.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's merge this then

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can assign the new correctness Jira for me to fix.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as this reverts previous patch due to performance regression. Seems there are investigation going on.

Either we can wait for it, or, we can revert it first and apply updated patch later.

@dongjoon-hyun
Copy link
Member Author

Thank you, @viirya , @dtenedor , @beliefer , @LuciferYang . Merged to master.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jan 3, 2023

Here is the blocker-level correctness issue JIRA, SPARK-41862, for ORC reader part.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-41858 branch January 3, 2023 18:44
HyukjinKwon pushed a commit that referenced this pull request Jan 4, 2023
…rc reader

### What changes were proposed in this pull request?

This PR fixes a correctness bug related to column DEFAULT values in Orc reader.

* #37280 introduced a performance regression in the Orc reader.
* #39362 fixed the performance regression, but stopped the column DEFAULT feature from working, causing a temporary correctness regression that we agreed for me to fix later.
* This PR restores column DEFAULT functionality for Orc scans and fixes the correctness regression while not reintroducing the performance regression.

### Why are the changes needed?

This PR fixes a correctness bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This PR updates a unit test to exercise that the Orc scan functionality is correct.

Closes #39370 from dtenedor/fix-perf-bug-orc-reader.

Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@beliefer
Copy link
Contributor

beliefer commented Jan 4, 2023

LGTM. Although later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants