-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-33273][SQL] Fix a race condition in subquery execution #30765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @cloud-fan .
cc @HyukjinKwon
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
|
Kubernetes integration test status success |
|
Oh, it becomes more complex than the initial one. @cloud-fan , is this a final? |
|
BTW, cc @sarutak since he has an another proposal. |
|
Test build #132780 has finished for PR 30765 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala
Outdated
Show resolved
Hide resolved
3aa84c3 to
d66caca
Compare
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
Added more comments, it's final now. Sorry I missed the perf impact in the first commit. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (Pending CIs). Thanks!
|
Test build #132777 has finished for PR 30765 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing this.
|
Test build #132781 has finished for PR 30765 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix explain test failures?
- org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite.explain.sql
- org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite.explain-aqe.sql
Also, please rebase to the master to bring the lint recovery patch.
|
Good catch! |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #132803 has finished for PR 30765 at commit
|
|
Merged to master and branch-3.1. |
### What changes were proposed in this pull request? If we call `SubqueryExec.executeTake`, it will call `SubqueryExec.execute` which will trigger the codegen of the query plan and create an RDD. However, `SubqueryExec` already has a thread (`SubqueryExec.relationFuture`) to execute the query plan, which means we have 2 threads triggering codegen of the same query plan at the same time. Spark codegen is not thread-safe, as we have places like `HashAggregateExec.bufferVars` that is a shared variable. The bug in `SubqueryExec` may lead to correctness bugs. Since https://issues.apache.org/jira/browse/SPARK-33119, `ScalarSubquery` will call `SubqueryExec.executeTake`, so flaky tests start to appear. This PR fixes the bug by reimplementing #30016 . We should pass the number of rows we want to collect to `SubqueryExec` at planning time, so that we can use `executeTake` inside `SubqueryExec.relationFuture`, and the caller side should always call `SubqueryExec.executeCollect`. This PR also adds checks so that we can make sure only `SubqueryExec.executeCollect` is called. ### Why are the changes needed? fix correctness bug. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? run `build/sbt "sql/testOnly *SQLQueryTestSuite -- -z scalar-subquery-select"` more than 10 times. Previously it fails, now it passes. Closes #30765 from cloud-fan/bug. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
If we call
SubqueryExec.executeTake, it will callSubqueryExec.executewhich will trigger the codegen of the query plan and create an RDD. However,SubqueryExecalready has a thread (SubqueryExec.relationFuture) to execute the query plan, which means we have 2 threads triggering codegen of the same query plan at the same time.Spark codegen is not thread-safe, as we have places like
HashAggregateExec.bufferVarsthat is a shared variable. The bug inSubqueryExecmay lead to correctness bugs.Since https://issues.apache.org/jira/browse/SPARK-33119,
ScalarSubquerywill callSubqueryExec.executeTake, so flaky tests start to appear.This PR fixes the bug by reimplementing #30016 . We should pass the number of rows we want to collect to
SubqueryExecat planning time, so that we can useexecuteTakeinsideSubqueryExec.relationFuture, and the caller side should always callSubqueryExec.executeCollect. This PR also adds checks so that we can make sure onlySubqueryExec.executeCollectis called.Why are the changes needed?
fix correctness bug.
Does this PR introduce any user-facing change?
no
How was this patch tested?
run
build/sbt "sql/testOnly *SQLQueryTestSuite -- -z scalar-subquery-select"more than 10 times. Previously it fails, now it passes.