Skip to content

Conversation

@Hisoka-X
Copy link
Member

@Hisoka-X Hisoka-X commented Jun 8, 2023

What changes were proposed in this pull request?

When we use spark shell to submit job like this:

$ spark-shell --conf spark.driver.memory=1g

val df = spark.range(5000000).withColumn("str", lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf"))
val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer")

df2.collect

This will cause the driver to hang indefinitely.
When we disable AQE, the java.lang.OutOfMemoryError will be throws.

After I check the code, the reason are wrong way to use Throwable::initCause. It happened when OOM be throw on https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L184 . Then https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L2401 will be executed.
It use new SparkException(..., case=oe).initCause(oe.getCause).
The doc in Throwable::initCause say

This method can be called at most once. It is generally called from within the constructor,
or immediately after creating the throwable. If this throwable was created with Throwable(Throwable)
or Throwable(String, Throwable), this method cannot be called even once.

So when we call it, the IllegalStateException will be throw. Finally, the promise.tryFailure(ex) never be called. The driver will be blocked.

Why are the changes needed?

Fix the OOM never be reported bug

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add new test

@github-actions github-actions bot added the SQL label Jun 8, 2023
@Hisoka-X
Copy link
Member Author

Hisoka-X commented Jun 8, 2023

cc @cloud-fan @itholic

@Hisoka-X Hisoka-X changed the title [SPARK-42290][SQL] The OOM error can't be reported when AQE on [SPARK-42290][SQL] Fix the OOM error can't be reported when AQE on Jun 8, 2023
}
}

test("SPARK-42290: NotEnoughMemory error can't be create") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, OutOfMemoryError has no with cause constructor.

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@LuciferYang
Copy link
Contributor

cc @MaxGekk FYI

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reporting and fix, @Hisoka-X .

According to the JIRA, v3.4.0 is this only one affected?

dongjoon-hyun referenced this pull request Jun 8, 2023
…CY_ERROR_TEMP_2226-2250

### What changes were proposed in this pull request?

This PR proposes to migrate 25 execution errors onto temporary error classes with the prefix `_LEGACY_ERROR_TEMP_2226` to `_LEGACY_ERROR_TEMP_2250`.

The error classes are prefixed with `_LEGACY_ERROR_TEMP_` indicates the dev-facing error messages, and won't be exposed to end users.

### Why are the changes needed?

To speed-up the error class migration.

The migration on temporary error classes allow us to analyze the errors, so we can detect the most popular error classes.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

```
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ build/sbt "test:testOnly *SQLQuerySuite"
$ build/sbt -Phive-thriftserver "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite"
```

Closes #38173 from itholic/SPARK-40540-2226-2250.

Authored-by: itholic <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

I also verified that this doesn't exist in Apache Spark 3.3.2.

scala> val df = spark.range(5000000).withColumn("str", lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf"))
df: org.apache.spark.sql.DataFrame = [id: bigint, str: string]

scala> val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer")
df2: org.apache.spark.sql.DataFrame = [id: bigint, str: string]

scala> df2.collect
java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver memory by setting spark.driver.memory to a higher value.
  at org.apache.spark.sql.errors.QueryExecutionErrors$.notEnoughMemoryToBuildAndBroadcastTableError(QueryExecutionErrors.scala:1838)
  at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:183)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:191)
  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  at java.base/java.lang.Thread.run(Thread.java:829)

scala> sc.version
res1: String = 3.3.2

@dongjoon-hyun
Copy link
Member

Also, cc @kazuyukitanimura too

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (non-binding)

@dongjoon-hyun
Copy link
Member

I verified manually. Merged to master/3.4.

$ build/sbt "sql/testOnly *.QueryExecutionErrorsSuite -- -z SPARK-42290"
[info] QueryExecutionErrorsSuite:
13:10:15.573 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[info] - SPARK-42290: NotEnoughMemory error can't be create (211 milliseconds)
13:10:16.741 WARN org.apache.spark.sql.errors.QueryExecutionErrorsSuite:

===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.errors.QueryExecutionErrorsSuite, threads: rpc-boss-3-1 (daemon=true), shuffle-boss-6-1 (daemon=true) =====
[info] Run completed in 2 seconds, 415 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 22 s, completed Jun 8, 2023, 1:10:16 PM

dongjoon-hyun pushed a commit that referenced this pull request Jun 8, 2023
### What changes were proposed in this pull request?
When we use spark shell to submit job like this:
```scala
$ spark-shell --conf spark.driver.memory=1g

val df = spark.range(5000000).withColumn("str", lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf"))
val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer")

df2.collect
```
This will cause the driver to hang indefinitely.
When we disable AQE, the `java.lang.OutOfMemoryError` will be throws.

After I check the code, the reason are wrong way to use `Throwable::initCause`. It happened when OOM be throw on https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L184 . Then https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L2401 will be executed.
It use `new SparkException(..., case=oe).initCause(oe.getCause)`.
The doc in `Throwable::initCause` say
```
This method can be called at most once. It is generally called from within the constructor,
or immediately after creating the throwable. If this throwable was created with Throwable(Throwable)
or Throwable(String, Throwable), this method cannot be called even once.
```
So when we call it, the `IllegalStateException` will be throw. Finally, the `promise.tryFailure(ex)` never be called. The driver will be blocked.

### Why are the changes needed?
Fix the OOM never be reported bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add new test

Closes #41517 from Hisoka-X/SPARK-42290_OOM_AQE_On.

Authored-by: Jia Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 4168e1a)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Thank you, @Hisoka-X , @LuciferYang , @kazuyukitanimura .

@cloud-fan
Copy link
Contributor

late LGTM

@Hisoka-X Hisoka-X deleted the SPARK-42290_OOM_AQE_On branch June 9, 2023 00:55
@itholic
Copy link
Contributor

itholic commented Jun 9, 2023

late LGTM. Thanks for the fix

czxm pushed a commit to czxm/spark that referenced this pull request Jun 12, 2023
### What changes were proposed in this pull request?
When we use spark shell to submit job like this:
```scala
$ spark-shell --conf spark.driver.memory=1g

val df = spark.range(5000000).withColumn("str", lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf"))
val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer")

df2.collect
```
This will cause the driver to hang indefinitely.
When we disable AQE, the `java.lang.OutOfMemoryError` will be throws.

After I check the code, the reason are wrong way to use `Throwable::initCause`. It happened when OOM be throw on https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L184 . Then https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L2401 will be executed.
It use `new SparkException(..., case=oe).initCause(oe.getCause)`.
The doc in `Throwable::initCause` say
```
This method can be called at most once. It is generally called from within the constructor,
or immediately after creating the throwable. If this throwable was created with Throwable(Throwable)
or Throwable(String, Throwable), this method cannot be called even once.
```
So when we call it, the `IllegalStateException` will be throw. Finally, the `promise.tryFailure(ex)` never be called. The driver will be blocked.

### Why are the changes needed?
Fix the OOM never be reported bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add new test

Closes apache#41517 from Hisoka-X/SPARK-42290_OOM_AQE_On.

Authored-by: Jia Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
### What changes were proposed in this pull request?
When we use spark shell to submit job like this:
```scala
$ spark-shell --conf spark.driver.memory=1g

val df = spark.range(5000000).withColumn("str", lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf"))
val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer")

df2.collect
```
This will cause the driver to hang indefinitely.
When we disable AQE, the `java.lang.OutOfMemoryError` will be throws.

After I check the code, the reason are wrong way to use `Throwable::initCause`. It happened when OOM be throw on https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L184 . Then https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L2401 will be executed.
It use `new SparkException(..., case=oe).initCause(oe.getCause)`.
The doc in `Throwable::initCause` say
```
This method can be called at most once. It is generally called from within the constructor,
or immediately after creating the throwable. If this throwable was created with Throwable(Throwable)
or Throwable(String, Throwable), this method cannot be called even once.
```
So when we call it, the `IllegalStateException` will be throw. Finally, the `promise.tryFailure(ex)` never be called. The driver will be blocked.

### Why are the changes needed?
Fix the OOM never be reported bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add new test

Closes apache#41517 from Hisoka-X/SPARK-42290_OOM_AQE_On.

Authored-by: Jia Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 4168e1a)
Signed-off-by: Dongjoon Hyun <[email protected]>
GladwinLee pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
### What changes were proposed in this pull request?
When we use spark shell to submit job like this:
```scala
$ spark-shell --conf spark.driver.memory=1g

val df = spark.range(5000000).withColumn("str", lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf"))
val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer")

df2.collect
```
This will cause the driver to hang indefinitely.
When we disable AQE, the `java.lang.OutOfMemoryError` will be throws.

After I check the code, the reason are wrong way to use `Throwable::initCause`. It happened when OOM be throw on https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L184 . Then https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L2401 will be executed.
It use `new SparkException(..., case=oe).initCause(oe.getCause)`.
The doc in `Throwable::initCause` say
```
This method can be called at most once. It is generally called from within the constructor,
or immediately after creating the throwable. If this throwable was created with Throwable(Throwable)
or Throwable(String, Throwable), this method cannot be called even once.
```
So when we call it, the `IllegalStateException` will be throw. Finally, the `promise.tryFailure(ex)` never be called. The driver will be blocked.

### Why are the changes needed?
Fix the OOM never be reported bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add new test

Closes apache#41517 from Hisoka-X/SPARK-42290_OOM_AQE_On.

Authored-by: Jia Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 4168e1a)
Signed-off-by: Dongjoon Hyun <[email protected]>
catalinii pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
### What changes were proposed in this pull request?
When we use spark shell to submit job like this:
```scala
$ spark-shell --conf spark.driver.memory=1g

val df = spark.range(5000000).withColumn("str", lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf"))
val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer")

df2.collect
```
This will cause the driver to hang indefinitely.
When we disable AQE, the `java.lang.OutOfMemoryError` will be throws.

After I check the code, the reason are wrong way to use `Throwable::initCause`. It happened when OOM be throw on https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L184 . Then https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L2401 will be executed.
It use `new SparkException(..., case=oe).initCause(oe.getCause)`.
The doc in `Throwable::initCause` say
```
This method can be called at most once. It is generally called from within the constructor,
or immediately after creating the throwable. If this throwable was created with Throwable(Throwable)
or Throwable(String, Throwable), this method cannot be called even once.
```
So when we call it, the `IllegalStateException` will be throw. Finally, the `promise.tryFailure(ex)` never be called. The driver will be blocked.

### Why are the changes needed?
Fix the OOM never be reported bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add new test

Closes apache#41517 from Hisoka-X/SPARK-42290_OOM_AQE_On.

Authored-by: Jia Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 4168e1a)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants