[SPARK-41858][SQL] Fix ORC reader perf regression due to DEFAULT value feature#39362
[SPARK-41858][SQL] Fix ORC reader perf regression due to DEFAULT value feature#39362dongjoon-hyun wants to merge 1 commit intoapache:masterfrom dongjoon-hyun:SPARK-41858
Conversation
LuciferYang
left a comment
There was a problem hiding this comment.
+1, LGTM if test pass
| val writerFunc = newWriter(f.dataType, rowUpdater) | ||
| writerFunc(ordinal, value) | ||
| } | ||
| val writer = newWriter(f.dataType, rowUpdater) |
There was a problem hiding this comment.
A little magical, let me do some investigate too
|
Could you review this, @viirya ? |
|
Also, cc @dtenedor , @beliefer , @HyukjinKwon , too |
dtenedor
left a comment
There was a problem hiding this comment.
Thanks for doing the debugging investigation, and apologies for the regression! I'm happy to help with the fix if/where needed. I left a comment on the PR here, it looks like it may inadvertently break the DEFAULT value functionality for the Orc data source, possibly causing incorrectness for users. Let's figure it out together and come up with a fix.
| Seq(Map(true -> "xyz"))), | ||
| Row(2, | ||
| null, | ||
| if (config.dataSource != "orc") { |
There was a problem hiding this comment.
It looks like this PR breaks the DEFAULT value functionality for the Orc data source (as shown by this unit test). If anyone is using this functionality, this PR will make their results incorrect. It would be better if we can fix the performance regression without changing the behavior. Is there some profile to show why the performance regression takes place? For example, is it because this change to the writer function introduces a new level of function call?
There was a problem hiding this comment.
Thank you for review, @dtenedor .
- Please see [SPARK-41782][TESTS] Regenerate benchmark results #39301 (comment) . We have a benchmark to detect this kind of perf regression. You can run it locally in your environment.
- This is a partial revert to the original code which is the existing behavior before your PR like the previous Spark. As I mentioned in this PR description, SPARK-39862 should propose a new fix without perf regression.
New feature is good as long as not breaking the old behavior.
There was a problem hiding this comment.
Thank you for review, @dtenedor .
- Please see https://issues.apache.org/jira/browse/SPARK-41782 . We have a benchmark to detect this kind of perf regression. You can run it locally in your environment.
Thanks @dongjoon-hyun for the benchmark! The Jira simply comprises the title Regenerate benchmark results. Is there some instructions for how to run the benchmark?
- This is a partial revert to the original code which is the existing behavior before your PR like the previous Spark. As I mentioned in the PR description, SPARK-39862 should propose a fix without perf regression.
New feature is good as long as not breaking the old behavior.
Agree on this. However, that bug fix was merged into Spark 3.3 on Jul. 28, 2022. Is it possible that users could have built pipelines since then using the new feature that would return incorrect results if we merged this PR?
There was a problem hiding this comment.
If someone can tell me how to run the benchmark, I can play around with a fix for the perf regression that also does not change the behavior. I suspect that it's due to the extra function call overhead that takes place each time a value is written, but not sure without a profile :)
There was a problem hiding this comment.
Sorry, I copied a wrong link. Here is the exact link, @dtenedor .
#39301 (comment)
There was a problem hiding this comment.
Thanks, but I filed SPARK-41858 as the blocker of Apache Spark 3.4 because this really blocks Apache Spark 3.4 preparation from my side. If you don't mind, I'd prefer to merge this first and help you cleanly.
There was a problem hiding this comment.
@dongjoon-hyun that is OK, but we should probably not release Spark 3.4 with a correctness regression either. It is equally important as performance. If we create another Jira blocking the release of 3.4 that covers fixing the correctness bug, it is fine to merge this PR.
At any rate, I am hoping to figure this out today. Then we should be unblocked 🤔
There was a problem hiding this comment.
Thank you so much for your understanding. I'll file another blocker JIRA for that of course.
There was a problem hiding this comment.
LGTM, let's merge this then
There was a problem hiding this comment.
You can assign the new correctness Jira for me to fix.
viirya
left a comment
There was a problem hiding this comment.
Looks good to me as this reverts previous patch due to performance regression. Seems there are investigation going on.
Either we can wait for it, or, we can revert it first and apply updated patch later.
|
Thank you, @viirya , @dtenedor , @beliefer , @LuciferYang . Merged to master. |
|
Here is the blocker-level correctness issue JIRA, SPARK-41862, for ORC reader part. |
…rc reader ### What changes were proposed in this pull request? This PR fixes a correctness bug related to column DEFAULT values in Orc reader. * #37280 introduced a performance regression in the Orc reader. * #39362 fixed the performance regression, but stopped the column DEFAULT feature from working, causing a temporary correctness regression that we agreed for me to fix later. * This PR restores column DEFAULT functionality for Orc scans and fixes the correctness regression while not reintroducing the performance regression. ### Why are the changes needed? This PR fixes a correctness bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR updates a unit test to exercise that the Orc scan functionality is correct. Closes #39370 from dtenedor/fix-perf-bug-orc-reader. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
|
LGTM. Although later. |

What changes were proposed in this pull request?
This PR is a partial and logical revert of SPARK-39862, #37280, to fix the huge ORC reader perf regression (3x slower).
SPARK-39862 should propose a fix without perf regression.
Why are the changes needed?
During Apache Spark 3.4.0 preparation, SPARK-41782 identified a perf regression.
Does this PR introduce any user-facing change?
After this PR, the regression is removed. However, the bug of DEFAULT value feature will remain. This should be handled separately.
How was this patch tested?
Pass the CI.