-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-42290][SQL] Fix the OOM error can't be reported when AQE on #41517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
| } | ||
|
|
||
| test("SPARK-42290: NotEnoughMemory error can't be create") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, OutOfMemoryError has no with cause constructor.
LuciferYang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
|
cc @MaxGekk FYI |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reporting and fix, @Hisoka-X .
According to the JIRA, v3.4.0 is this only one affected?
…CY_ERROR_TEMP_2226-2250 ### What changes were proposed in this pull request? This PR proposes to migrate 25 execution errors onto temporary error classes with the prefix `_LEGACY_ERROR_TEMP_2226` to `_LEGACY_ERROR_TEMP_2250`. The error classes are prefixed with `_LEGACY_ERROR_TEMP_` indicates the dev-facing error messages, and won't be exposed to end users. ### Why are the changes needed? To speed-up the error class migration. The migration on temporary error classes allow us to analyze the errors, so we can detect the most popular error classes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" $ build/sbt "test:testOnly *SQLQuerySuite" $ build/sbt -Phive-thriftserver "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite" ``` Closes #38173 from itholic/SPARK-40540-2226-2250. Authored-by: itholic <[email protected]> Signed-off-by: Max Gekk <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
I also verified that this doesn't exist in Apache Spark 3.3.2.
scala> val df = spark.range(5000000).withColumn("str", lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf"))
df: org.apache.spark.sql.DataFrame = [id: bigint, str: string]
scala> val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer")
df2: org.apache.spark.sql.DataFrame = [id: bigint, str: string]
scala> df2.collect
java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver memory by setting spark.driver.memory to a higher value.
at org.apache.spark.sql.errors.QueryExecutionErrors$.notEnoughMemoryToBuildAndBroadcastTableError(QueryExecutionErrors.scala:1838)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:183)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:191)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
scala> sc.version
res1: String = 3.3.2
|
Also, cc @kazuyukitanimura too |
kazuyukitanimura
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (non-binding)
|
I verified manually. Merged to master/3.4. |
### What changes were proposed in this pull request?
When we use spark shell to submit job like this:
```scala
$ spark-shell --conf spark.driver.memory=1g
val df = spark.range(5000000).withColumn("str", lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf"))
val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer")
df2.collect
```
This will cause the driver to hang indefinitely.
When we disable AQE, the `java.lang.OutOfMemoryError` will be throws.
After I check the code, the reason are wrong way to use `Throwable::initCause`. It happened when OOM be throw on https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L184 . Then https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L2401 will be executed.
It use `new SparkException(..., case=oe).initCause(oe.getCause)`.
The doc in `Throwable::initCause` say
```
This method can be called at most once. It is generally called from within the constructor,
or immediately after creating the throwable. If this throwable was created with Throwable(Throwable)
or Throwable(String, Throwable), this method cannot be called even once.
```
So when we call it, the `IllegalStateException` will be throw. Finally, the `promise.tryFailure(ex)` never be called. The driver will be blocked.
### Why are the changes needed?
Fix the OOM never be reported bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add new test
Closes #41517 from Hisoka-X/SPARK-42290_OOM_AQE_On.
Authored-by: Jia Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 4168e1a)
Signed-off-by: Dongjoon Hyun <[email protected]>
|
Thank you, @Hisoka-X , @LuciferYang , @kazuyukitanimura . |
|
late LGTM |
|
late LGTM. Thanks for the fix |
### What changes were proposed in this pull request?
When we use spark shell to submit job like this:
```scala
$ spark-shell --conf spark.driver.memory=1g
val df = spark.range(5000000).withColumn("str", lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf"))
val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer")
df2.collect
```
This will cause the driver to hang indefinitely.
When we disable AQE, the `java.lang.OutOfMemoryError` will be throws.
After I check the code, the reason are wrong way to use `Throwable::initCause`. It happened when OOM be throw on https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L184 . Then https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L2401 will be executed.
It use `new SparkException(..., case=oe).initCause(oe.getCause)`.
The doc in `Throwable::initCause` say
```
This method can be called at most once. It is generally called from within the constructor,
or immediately after creating the throwable. If this throwable was created with Throwable(Throwable)
or Throwable(String, Throwable), this method cannot be called even once.
```
So when we call it, the `IllegalStateException` will be throw. Finally, the `promise.tryFailure(ex)` never be called. The driver will be blocked.
### Why are the changes needed?
Fix the OOM never be reported bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add new test
Closes apache#41517 from Hisoka-X/SPARK-42290_OOM_AQE_On.
Authored-by: Jia Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
When we use spark shell to submit job like this:
```scala
$ spark-shell --conf spark.driver.memory=1g
val df = spark.range(5000000).withColumn("str", lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf"))
val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer")
df2.collect
```
This will cause the driver to hang indefinitely.
When we disable AQE, the `java.lang.OutOfMemoryError` will be throws.
After I check the code, the reason are wrong way to use `Throwable::initCause`. It happened when OOM be throw on https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L184 . Then https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L2401 will be executed.
It use `new SparkException(..., case=oe).initCause(oe.getCause)`.
The doc in `Throwable::initCause` say
```
This method can be called at most once. It is generally called from within the constructor,
or immediately after creating the throwable. If this throwable was created with Throwable(Throwable)
or Throwable(String, Throwable), this method cannot be called even once.
```
So when we call it, the `IllegalStateException` will be throw. Finally, the `promise.tryFailure(ex)` never be called. The driver will be blocked.
### Why are the changes needed?
Fix the OOM never be reported bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add new test
Closes apache#41517 from Hisoka-X/SPARK-42290_OOM_AQE_On.
Authored-by: Jia Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 4168e1a)
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
When we use spark shell to submit job like this:
```scala
$ spark-shell --conf spark.driver.memory=1g
val df = spark.range(5000000).withColumn("str", lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf"))
val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer")
df2.collect
```
This will cause the driver to hang indefinitely.
When we disable AQE, the `java.lang.OutOfMemoryError` will be throws.
After I check the code, the reason are wrong way to use `Throwable::initCause`. It happened when OOM be throw on https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L184 . Then https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L2401 will be executed.
It use `new SparkException(..., case=oe).initCause(oe.getCause)`.
The doc in `Throwable::initCause` say
```
This method can be called at most once. It is generally called from within the constructor,
or immediately after creating the throwable. If this throwable was created with Throwable(Throwable)
or Throwable(String, Throwable), this method cannot be called even once.
```
So when we call it, the `IllegalStateException` will be throw. Finally, the `promise.tryFailure(ex)` never be called. The driver will be blocked.
### Why are the changes needed?
Fix the OOM never be reported bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add new test
Closes apache#41517 from Hisoka-X/SPARK-42290_OOM_AQE_On.
Authored-by: Jia Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 4168e1a)
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
When we use spark shell to submit job like this:
```scala
$ spark-shell --conf spark.driver.memory=1g
val df = spark.range(5000000).withColumn("str", lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf"))
val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer")
df2.collect
```
This will cause the driver to hang indefinitely.
When we disable AQE, the `java.lang.OutOfMemoryError` will be throws.
After I check the code, the reason are wrong way to use `Throwable::initCause`. It happened when OOM be throw on https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L184 . Then https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L2401 will be executed.
It use `new SparkException(..., case=oe).initCause(oe.getCause)`.
The doc in `Throwable::initCause` say
```
This method can be called at most once. It is generally called from within the constructor,
or immediately after creating the throwable. If this throwable was created with Throwable(Throwable)
or Throwable(String, Throwable), this method cannot be called even once.
```
So when we call it, the `IllegalStateException` will be throw. Finally, the `promise.tryFailure(ex)` never be called. The driver will be blocked.
### Why are the changes needed?
Fix the OOM never be reported bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add new test
Closes apache#41517 from Hisoka-X/SPARK-42290_OOM_AQE_On.
Authored-by: Jia Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 4168e1a)
Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
When we use spark shell to submit job like this:
This will cause the driver to hang indefinitely.
When we disable AQE, the
java.lang.OutOfMemoryErrorwill be throws.After I check the code, the reason are wrong way to use
Throwable::initCause. It happened when OOM be throw on https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L184 . Then https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L2401 will be executed.It use
new SparkException(..., case=oe).initCause(oe.getCause).The doc in
Throwable::initCausesaySo when we call it, the
IllegalStateExceptionwill be throw. Finally, thepromise.tryFailure(ex)never be called. The driver will be blocked.Why are the changes needed?
Fix the OOM never be reported bug
Does this PR introduce any user-facing change?
No
How was this patch tested?
Add new test