-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21492][SQL][WIP] Memory leak in SortMergeJoin #23762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If SortMergeJoinScanner doesn't consume the iterator from UnsafeExternalRowSorter entirely, the memory that UnsafeExternalSorter acquired from TaskMemoryManager will not be released. This leads to a memory leak, spills, and OOME. A page will be held per partition of the unused iterator. This patch will allow the SortMergeJoinScanner to explicitly close the iterators (for non-generated code) Manual testing and profiling with scripts in SPARK-21492 comments.
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea in principle; I don't know this code well.
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowIterator.java
Outdated
Show resolved
Hide resolved
If SortMergeJoinScanner doesn't consume the iterator from UnsafeExternalRowSorter entirely, the memory that UnsafeExternalSorter acquired from TaskMemoryManager will not be released. This leads to a memory leak, spills, and OOME. A page will be held per partition of the unused iterator. This patch will allow the SortMergeJoinScanner to explicitly close the iterators (for non-generated code) Manual testing and profiling with scripts in SPARK-21492 comments.
|
Thanks for reviewing @srowen @kiszk @tejasapatil, would appreciate it if you have any input on how to approach the code-generation version of the fix. |
| import java.io.Closeable; | ||
| import java.io.IOException; | ||
|
|
||
| public abstract class UnsafeExternalRowIterator extends AbstractIterator<UnsafeRow> implements Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we need an iterator API with a close method. Shall we reuse RowIterator and add close method to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would require introducing a dependency on the spark-sql module from catalyst.
And would require more refactoring of RowIterator and its subclasses.
|
@taosaildrone the sorted iterators consumed by SMJ should be cleaned up when the task finishes. Do you have any idea why this is not working? |
|
IIUC the sorted iterator will be cleaned up eventually, this PR just clean it up earlier. I think this is not a memory leak fix, but an optimization. |
|
There's a task completion listener, but 1) we could hit an OOME before the task completes or 2) impact performance by holding unnecessary memory and causing a bunch of unneeded spills. Something as simple as joining a DF with 1000 rows with another DF of 2 rows (with one overlapping row_id) will cause an OOME (both locally, and on a cluster) and cause the job to fail: An example locally: ./bin/pyspark --master local[10] (tested on 2.4.0 and 3.0.0-master) |
|
@taosaildrone I have two problems with this approach:
We should discuss this on the dev list. |
|
@srowen @hvanhovell @cloud-fan http://apache-spark-developers-list.1001551.n3.nabble.com/Memory-leak-in-SortMergeJoin-td27152.html |
|
Can one of the admins verify this patch? |
### What changes were proposed in this pull request? We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below: - Add function named `cleanupResources` in SparkPlan, which default call children's `cleanupResources` function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call `super.cleanupResources`, like SortExec in this PR. - Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the `cleanupResources` to do the cleanup job for all its upstream(children) operator. ### Why are the changes needed? Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario. Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks taosaildrone for providing this test [here](#23762 (comment))) will pass with this PR. ``` from pyspark.sql.functions import rand, col spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) # spark.conf.set("spark.sql.sortMergeJoinExec.eagerCleanupResources", "true") r1 = spark.range(1, 1001).select(col("id").alias("timestamp1")) r1 = r1.withColumn('value', rand()) r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2")) r2 = r2.withColumn('value2', rand()) joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner") joined = joined.coalesce(1) joined.explain() joined.show() ``` Closes #26164 from xuanyuanking/SPARK-21492. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
/### What changes were proposed in this pull request? We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below: - Add function named `cleanupResources` in SparkPlan, which default call children's `cleanupResources` function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call `super.cleanupResources`, like SortExec in this PR. - Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the `cleanupResources` to do the cleanup job for all its upstream(children) operator. /### Why are the changes needed? Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup. /### Does this PR introduce any user-facing change? No. /### How was this patch tested? UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario. Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks taosaildrone for providing this test [here](apache#23762 (comment))) will pass with this PR. ``` from pyspark.sql.functions import rand, col spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) r1 = spark.range(1, 1001).select(col("id").alias("timestamp1")) r1 = r1.withColumn('value', rand()) r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2")) r2 = r2.withColumn('value2', rand()) joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner") joined = joined.coalesce(1) joined.explain() joined.show() ```
### What changes were proposed in this pull request? We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below: - Add function named `cleanupResources` in SparkPlan, which default call children's `cleanupResources` function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call `super.cleanupResources`, like SortExec in this PR. - Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the `cleanupResources` to do the cleanup job for all its upstream(children) operator. ### Why are the changes needed? Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario. Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks taosaildrone for providing this test [here](#23762 (comment))) will pass with this PR. ``` from pyspark.sql.functions import rand, col spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) r1 = spark.range(1, 1001).select(col("id").alias("timestamp1")) r1 = r1.withColumn('value', rand()) r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2")) r2 = r2.withColumn('value2', rand()) joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner") joined = joined.coalesce(1) joined.explain() joined.show() ``` Closes #26210 from xuanyuanking/SPARK-21492-backport. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below: - Add function named `cleanupResources` in SparkPlan, which default call children's `cleanupResources` function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call `super.cleanupResources`, like SortExec in this PR. - Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the `cleanupResources` to do the cleanup job for all its upstream(children) operator. ### Why are the changes needed? Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario. Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks taosaildrone for providing this test [here](apache#23762 (comment))) will pass with this PR. ``` from pyspark.sql.functions import rand, col spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) r1 = spark.range(1, 1001).select(col("id").alias("timestamp1")) r1 = r1.withColumn('value', rand()) r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2")) r2 = r2.withColumn('value2', rand()) joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner") joined = joined.coalesce(1) joined.explain() joined.show() ``` Closes apache#26210 from xuanyuanking/SPARK-21492-backport. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
### What changes were proposed in this pull request? We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below: - Add function named `cleanupResources` in SparkPlan, which default call children's `cleanupResources` function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call `super.cleanupResources`, like SortExec in this PR. - Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the `cleanupResources` to do the cleanup job for all its upstream(children) operator. ### Why are the changes needed? Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario. Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks taosaildrone for providing this test [here](apache#23762 (comment))) will pass with this PR. ``` from pyspark.sql.functions import rand, col spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) r1 = spark.range(1, 1001).select(col("id").alias("timestamp1")) r1 = r1.withColumn('value', rand()) r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2")) r2 = r2.withColumn('value2', rand()) joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner") joined = joined.coalesce(1) joined.explain() joined.show() ``` Closes apache#26210 from xuanyuanking/SPARK-21492-backport. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
If SortMergeJoinScanner doesn't consume the iterator from
UnsafeExternalRowSorter entirely, the memory that
UnsafeExternalSorter acquired from TaskMemoryManager will not
be released. This leads to a memory leak, spills, and OOME. A
page will be held per partition of the unused iterator.
This patch will allow the SortMergeJoinScanner to explicitly close the iterators (for non-generated code)
How was this patch tested?
Manual testing and profiling with scripts in SPARK-21492 comments.