[SPARK-33119][SQL] ScalarSubquery should returns the first two rows to avoid Driver OOM #30016

wangyum · 2020-10-12T10:37:30Z

What changes were proposed in this pull request?

ScalarSubquery should returns the first two rows.

Why are the changes needed?

To avoid Driver OOM.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing test:

spark/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

Lines 147 to 154 in d6f3138

    
           test("runtime error when the number of rows is greater than 1") { 
        
             val error2 = intercept[RuntimeException] { 
        
               sql("select (select a from (select 1 as a union all select 2 as a) t) as b").collect() 
        
             } 
        
             assert(error2.getMessage.contains( 
        
               "more than one row returned by a subquery used as an expression") 
        
             ) 
        
           }

SparkQA · 2020-10-12T11:43:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34289/

SparkQA · 2020-10-12T12:08:30Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34289/

SparkQA · 2020-10-12T15:03:48Z

Test build #129683 has finished for PR 30016 at commit b4d7ffc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-10-13T08:41:38Z

cc @cloud-fan and @maryannxue FYI.

Merged to master.

### What changes were proposed in this pull request? If we call `SubqueryExec.executeTake`, it will call `SubqueryExec.execute` which will trigger the codegen of the query plan and create an RDD. However, `SubqueryExec` already has a thread (`SubqueryExec.relationFuture`) to execute the query plan, which means we have 2 threads triggering codegen of the same query plan at the same time. Spark codegen is not thread-safe, as we have places like `HashAggregateExec.bufferVars` that is a shared variable. The bug in `SubqueryExec` may lead to correctness bugs. Since https://issues.apache.org/jira/browse/SPARK-33119, `ScalarSubquery` will call `SubqueryExec.executeTake`, so flaky tests start to appear. This PR fixes the bug by reimplementing #30016 . We should pass the number of rows we want to collect to `SubqueryExec` at planning time, so that we can use `executeTake` inside `SubqueryExec.relationFuture`, and the caller side should always call `SubqueryExec.executeCollect`. This PR also adds checks so that we can make sure only `SubqueryExec.executeCollect` is called. ### Why are the changes needed? fix correctness bug. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? run `build/sbt "sql/testOnly *SQLQueryTestSuite -- -z scalar-subquery-select"` more than 10 times. Previously it fails, now it passes. Closes #30765 from cloud-fan/bug. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

Only return the first two rows as an array to avoid Driver OOM.

b4d7ffc

maropu approved these changes Oct 12, 2020

View reviewed changes

HyukjinKwon approved these changes Oct 13, 2020

View reviewed changes

HyukjinKwon closed this in e34f2d8 Oct 13, 2020

wangyum deleted the SPARK-33119 branch October 13, 2020 08:55

cloud-fan mentioned this pull request Dec 14, 2020

[SPARK-33273][SQL] Fix a race condition in subquery execution #30765

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-33119][SQL] ScalarSubquery should returns the first two rows to avoid Driver OOM #30016

[SPARK-33119][SQL] ScalarSubquery should returns the first two rows to avoid Driver OOM #30016

Uh oh!

wangyum commented Oct 12, 2020 •

edited

Loading

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

HyukjinKwon commented Oct 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	test("runtime error when the number of rows is greater than 1") {
	val error2 = intercept[RuntimeException] {
	sql("select (select a from (select 1 as a union all select 2 as a) t) as b").collect()
	}
	assert(error2.getMessage.contains(
	"more than one row returned by a subquery used as an expression")
	)
	}

[SPARK-33119][SQL] ScalarSubquery should returns the first two rows to avoid Driver OOM #30016

[SPARK-33119][SQL] ScalarSubquery should returns the first two rows to avoid Driver OOM #30016

Uh oh!

Conversation

wangyum commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

HyukjinKwon commented Oct 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wangyum commented Oct 12, 2020 •

edited

Loading