Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Dec 14, 2020

What changes were proposed in this pull request?

If we call SubqueryExec.executeTake, it will call SubqueryExec.execute which will trigger the codegen of the query plan and create an RDD. However, SubqueryExec already has a thread (SubqueryExec.relationFuture) to execute the query plan, which means we have 2 threads triggering codegen of the same query plan at the same time.

Spark codegen is not thread-safe, as we have places like HashAggregateExec.bufferVars that is a shared variable. The bug in SubqueryExec may lead to correctness bugs.

Since https://issues.apache.org/jira/browse/SPARK-33119, ScalarSubquery will call SubqueryExec.executeTake, so flaky tests start to appear.

This PR fixes the bug by reimplementing #30016 . We should pass the number of rows we want to collect to SubqueryExec at planning time, so that we can use executeTake inside SubqueryExec.relationFuture, and the caller side should always call SubqueryExec.executeCollect. This PR also adds checks so that we can make sure only SubqueryExec.executeCollect is called.

Why are the changes needed?

fix correctness bug.

Does this PR introduce any user-facing change?

no

How was this patch tested?

run build/sbt "sql/testOnly *SQLQueryTestSuite -- -z scalar-subquery-select" more than 10 times. Previously it fails, now it passes.

@github-actions github-actions bot added the SQL label Dec 14, 2020
@cloud-fan
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Dec 14, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37378/

dongjoon-hyun
dongjoon-hyun previously approved these changes Dec 14, 2020
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @cloud-fan .

cc @HyukjinKwon

viirya
viirya previously approved these changes Dec 14, 2020
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@SparkQA
Copy link

SparkQA commented Dec 14, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37378/

@dongjoon-hyun
Copy link
Member

Oh, it becomes more complex than the initial one. @cloud-fan , is this a final?

@dongjoon-hyun
Copy link
Member

BTW, cc @sarutak since he has an another proposal.

@dongjoon-hyun dongjoon-hyun dismissed their stale review December 14, 2020 19:17

Patch is changed.

@viirya viirya dismissed their stale review December 14, 2020 19:20

need to look new change.

@SparkQA
Copy link

SparkQA commented Dec 14, 2020

Test build #132780 has finished for PR 30765 at commit 7dbf490.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class SubqueryExec(name: String, child: SparkPlan, maxNumRows: Option[Int])

@cloud-fan cloud-fan force-pushed the bug branch 2 times, most recently from 3aa84c3 to d66caca Compare December 14, 2020 19:29
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cloud-fan
Copy link
Contributor Author

Added more comments, it's final now. Sorry I missed the perf impact in the first commit.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs). Thanks!

@SparkQA
Copy link

SparkQA commented Dec 14, 2020

Test build #132777 has finished for PR 30765 at commit a5bb677.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [SPARK-33273][SQL] fix a race condition in subquery execution [SPARK-33273][SQL] Fix a race condition in subquery execution Dec 14, 2020
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this.

@SparkQA
Copy link

SparkQA commented Dec 14, 2020

Test build #132781 has finished for PR 30765 at commit d66caca.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class SubqueryExec(name: String, child: SparkPlan, maxNumRows: Option[Int] = None)

@SparkQA
Copy link

SparkQA commented Dec 14, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37383/

@SparkQA
Copy link

SparkQA commented Dec 14, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37383/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix explain test failures?

  • org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite.explain.sql
  • org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite.explain-aqe.sql

Also, please rebase to the master to bring the lint recovery patch.

@yaooqinn
Copy link
Member

Good catch!

@SparkQA
Copy link

SparkQA commented Dec 15, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37405/

@SparkQA
Copy link

SparkQA commented Dec 15, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37405/

@SparkQA
Copy link

SparkQA commented Dec 15, 2020

Test build #132803 has finished for PR 30765 at commit 30a0f3f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master and branch-3.1.

HyukjinKwon pushed a commit that referenced this pull request Dec 15, 2020
### What changes were proposed in this pull request?

If we call `SubqueryExec.executeTake`, it will call `SubqueryExec.execute` which will trigger the codegen of the query plan and create an RDD. However, `SubqueryExec` already has a thread (`SubqueryExec.relationFuture`) to execute the query plan, which means we have 2 threads triggering codegen of the same query plan at the same time.

Spark codegen is not thread-safe, as we have places like `HashAggregateExec.bufferVars` that is a shared variable. The bug in `SubqueryExec` may lead to correctness bugs.

Since https://issues.apache.org/jira/browse/SPARK-33119, `ScalarSubquery` will call `SubqueryExec.executeTake`, so flaky tests start to appear.

This PR fixes the bug by reimplementing #30016 . We should pass the number of rows we want to collect to `SubqueryExec` at planning time, so that we can use `executeTake` inside `SubqueryExec.relationFuture`, and the caller side should always call `SubqueryExec.executeCollect`. This PR also adds checks so that we can make sure only `SubqueryExec.executeCollect` is called.

### Why are the changes needed?

fix correctness bug.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

run `build/sbt "sql/testOnly *SQLQueryTestSuite  -- -z scalar-subquery-select"` more than 10 times. Previously it fails, now it passes.

Closes #30765 from cloud-fan/bug.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants