[SPARK-33273][SQL] Fix a race condition in subquery execution #30765

cloud-fan · 2020-12-14T17:59:36Z

What changes were proposed in this pull request?

If we call SubqueryExec.executeTake, it will call SubqueryExec.execute which will trigger the codegen of the query plan and create an RDD. However, SubqueryExec already has a thread (SubqueryExec.relationFuture) to execute the query plan, which means we have 2 threads triggering codegen of the same query plan at the same time.

Spark codegen is not thread-safe, as we have places like HashAggregateExec.bufferVars that is a shared variable. The bug in SubqueryExec may lead to correctness bugs.

Since https://issues.apache.org/jira/browse/SPARK-33119, ScalarSubquery will call SubqueryExec.executeTake, so flaky tests start to appear.

This PR fixes the bug by reimplementing #30016 . We should pass the number of rows we want to collect to SubqueryExec at planning time, so that we can use executeTake inside SubqueryExec.relationFuture, and the caller side should always call SubqueryExec.executeCollect. This PR also adds checks so that we can make sure only SubqueryExec.executeCollect is called.

Why are the changes needed?

fix correctness bug.

Does this PR introduce any user-facing change?

no

How was this patch tested?

run build/sbt "sql/testOnly *SQLQueryTestSuite -- -z scalar-subquery-select" more than 10 times. Previously it fails, now it passes.

cloud-fan · 2020-12-14T18:00:49Z

cc @viirya @maropu @dongjoon-hyun @wangyum

SparkQA · 2020-12-14T18:44:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37378/

dongjoon-hyun

Thank you, @cloud-fan .

cc @HyukjinKwon

viirya

Nice catch!

SparkQA · 2020-12-14T19:09:17Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37378/

dongjoon-hyun · 2020-12-14T19:14:48Z

Oh, it becomes more complex than the initial one. @cloud-fan , is this a final?

dongjoon-hyun · 2020-12-14T19:15:05Z

BTW, cc @sarutak since he has an another proposal.

Patch is changed.

need to look new change.

SparkQA · 2020-12-14T19:21:02Z

Test build #132780 has finished for PR 30765 at commit 7dbf490.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class SubqueryExec(name: String, child: SparkPlan, maxNumRows: Option[Int])

sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala

viirya

lgtm

cloud-fan · 2020-12-14T19:30:35Z

Added more comments, it's final now. Sorry I missed the perf impact in the first commit.

dongjoon-hyun

+1, LGTM (Pending CIs). Thanks!

SparkQA · 2020-12-14T19:59:42Z

Test build #132777 has finished for PR 30765 at commit a5bb677.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM, thanks for fixing this.

SparkQA · 2020-12-14T20:43:22Z

Test build #132781 has finished for PR 30765 at commit d66caca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class SubqueryExec(name: String, child: SparkPlan, maxNumRows: Option[Int] = None)

SparkQA · 2020-12-14T20:44:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37383/

SparkQA · 2020-12-14T21:16:06Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37383/

dongjoon-hyun

Could you fix explain test failures?

org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite.explain.sql
org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite.explain-aqe.sql

Also, please rebase to the master to bring the lint recovery patch.

yaooqinn · 2020-12-15T01:55:56Z

Good catch!

SparkQA · 2020-12-15T05:51:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37405/

SparkQA · 2020-12-15T05:57:34Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37405/

SparkQA · 2020-12-15T09:27:59Z

Test build #132803 has finished for PR 30765 at commit 30a0f3f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-12-15T09:29:06Z

Merged to master and branch-3.1.

### What changes were proposed in this pull request? If we call `SubqueryExec.executeTake`, it will call `SubqueryExec.execute` which will trigger the codegen of the query plan and create an RDD. However, `SubqueryExec` already has a thread (`SubqueryExec.relationFuture`) to execute the query plan, which means we have 2 threads triggering codegen of the same query plan at the same time. Spark codegen is not thread-safe, as we have places like `HashAggregateExec.bufferVars` that is a shared variable. The bug in `SubqueryExec` may lead to correctness bugs. Since https://issues.apache.org/jira/browse/SPARK-33119, `ScalarSubquery` will call `SubqueryExec.executeTake`, so flaky tests start to appear. This PR fixes the bug by reimplementing #30016 . We should pass the number of rows we want to collect to `SubqueryExec` at planning time, so that we can use `executeTake` inside `SubqueryExec.relationFuture`, and the caller side should always call `SubqueryExec.executeCollect`. This PR also adds checks so that we can make sure only `SubqueryExec.executeCollect` is called. ### Why are the changes needed? fix correctness bug. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? run `build/sbt "sql/testOnly *SQLQueryTestSuite -- -z scalar-subquery-select"` more than 10 times. Previously it fails, now it passes. Closes #30765 from cloud-fan/bug. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

github-actions bot added the SQL label Dec 14, 2020

dongjoon-hyun previously approved these changes Dec 14, 2020

View reviewed changes

viirya previously approved these changes Dec 14, 2020

View reviewed changes

dongjoon-hyun mentioned this pull request Dec 14, 2020

[SPARK-33273][SQL] Fix the race condition issue of subquery execution and hash aggregation. #30766

Closed

dongjoon-hyun reviewed Dec 14, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala Outdated Show resolved Hide resolved

cloud-fan force-pushed the bug branch 2 times, most recently from 3aa84c3 to d66caca Compare December 14, 2020 19:29

viirya approved these changes Dec 14, 2020

View reviewed changes

dongjoon-hyun approved these changes Dec 14, 2020

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-33273][SQL] fix a race condition in subquery execution~~ [SPARK-33273][SQL] Fix a race condition in subquery execution Dec 14, 2020

HyukjinKwon approved these changes Dec 14, 2020

View reviewed changes

dongjoon-hyun reviewed Dec 14, 2020

View reviewed changes

gengliangwang approved these changes Dec 15, 2020

View reviewed changes

cloud-fan added 3 commits December 15, 2020 12:34

fix a race condition in subquery execution

a580897

bring back the perf optimization

8d64d3c

fix explain

30a0f3f

cloud-fan force-pushed the bug branch from d66caca to 30a0f3f Compare December 15, 2020 05:03

wangyum approved these changes Dec 15, 2020

View reviewed changes

yaooqinn approved these changes Dec 15, 2020

View reviewed changes

HyukjinKwon closed this in 0304252 Dec 15, 2020

[SPARK-33273][SQL] Fix a race condition in subquery execution #30765

[SPARK-33273][SQL] Fix a race condition in subquery execution #30765

Uh oh!

Conversation

cloud-fan commented Dec 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Dec 14, 2020

Uh oh!

SparkQA commented Dec 14, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 14, 2020

Uh oh!

dongjoon-hyun commented Dec 14, 2020

Uh oh!

dongjoon-hyun commented Dec 14, 2020

Uh oh!

SparkQA commented Dec 14, 2020

Uh oh!

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 14, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 14, 2020

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 14, 2020

Uh oh!

SparkQA commented Dec 14, 2020

Uh oh!

SparkQA commented Dec 14, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Dec 15, 2020

Uh oh!

SparkQA commented Dec 15, 2020

Uh oh!

SparkQA commented Dec 15, 2020

Uh oh!

SparkQA commented Dec 15, 2020

Uh oh!

HyukjinKwon commented Dec 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

cloud-fan commented Dec 14, 2020 •

edited

Loading