-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-50748][SPARK-50889][CONNECT] Fix a race condition issue which happens when operations are interrupted #51638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nses are cleaned up before consumed
|
GA failure seems related to this change. |
|
Merged to master. |
|
I noticed that this change fixes SPARK-50748 too.
I reproduced the issue reported in SPARK-50748 by reverting this change and insert sleeps like as follows. @HyukjinKwon @dongjoon-hyun What do you think? |
|
Thank you so much, @sarutak . You can resolve that too. |
|
I resolved SPARK-50748 with this PR and assigned it to you. |
|
Oh, @sarutak . I realized that this is a main code change.
|
| // 3. sent everything from the stream and the stream is finished | ||
| def streamFinished = executionObserver.getLastResponseIndex().exists(nextIndex > _) | ||
| def streamFinished = executionObserver.getLastResponseIndex().exists(nextIndex > _) || | ||
| executionObserver.isCleaned() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, this is a bug fix, isn't it?
I guess we need to backport this, @sarutak .
cc @grundprinzip , @hvanhovell , @peter-toth
|
@dongjoon-hyun Also, I'll check whether this bug affects |
|
You can reuse the Jira issue. You don't need to file a new one. |
|
OK, just change the title of this PR. |
SparkSessionE2ESuite.interrupt operation (Hang)|
@dongjoon-hyun |
|
Thank you so much for verifying that |
…hich happens when operations are interrupted ### What changes were proposed in this pull request? This PR backports #51638 to `branch-4.0`. This PR fixes an issue which happens when operations are interrupted, which is related to SPARK-50748 and SPARK-50889. Regarding SPARK-50889, this issue happens if an execution thread for an operation id cleans up the corresponding `ExecutionHolder` as the result of interruption [here](https://github.com/apache/spark/blob/a81d79256027708830bf714105f343d085a2f20c/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala#L175) before a response sender thread consumes a response [here](https://github.com/apache/spark/blob/a81d79256027708830bf714105f343d085a2f20c/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteResponseObserver.scala#L183). In this case, the cleanup finally calls `ExecutorResponseObserver.removeAll()` and all the responses are discarded, and the response sender thread can't escape [this loop](https://github.com/apache/spark/blob/a81d79256027708830bf714105f343d085a2f20c/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala#L245) because neither `gotResponse` nor `streamFinished` becomes true. The solution this PR proposes is changing the definition of `streamFinished` in `ExecuteGrpcResponseSender` so that a stream is regarded as finished in case the `ExecutionResponseObserver` is marked as completed and all the responses are discarded. `ExecutionResponseObserver.removeAll` is called when the corresponding `ExecutionHolder` is closed or cleaned up by interruption so this solution could be reasonable. ### Why are the changes needed? To fix a potential issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested manually. You can easily reproduce this issue without this change by inserting sleep to the test like as follows. ``` --- a/sql/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/SparkSessionE2ESuite.scala +++ b/sql/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/SparkSessionE2ESuite.scala -331,6 +331,7 class SparkSessionE2ESuite extends ConnectFunSuite with RemoteSparkSession { // cancel val operationId = result.operationId val canceledId = spark.interruptOperation(operationId) + Thread.sleep(1000) assert(canceledId == Seq(operationId)) // and check that it got canceled val e = intercept[SparkException] { ``` After this change applied, the test above doesn't hang. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #51671 from sarutak/connect-race-condition. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…hich happens when operations are interrupted ### What changes were proposed in this pull request? This PR backports apache#51638 to `branch-4.0`. This PR fixes an issue which happens when operations are interrupted, which is related to SPARK-50748 and SPARK-50889. Regarding SPARK-50889, this issue happens if an execution thread for an operation id cleans up the corresponding `ExecutionHolder` as the result of interruption [here](https://github.com/apache/spark/blob/bfec6692b102b172bdbcad7f983e2ec2844383c9/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala#L175) before a response sender thread consumes a response [here](https://github.com/apache/spark/blob/bfec6692b102b172bdbcad7f983e2ec2844383c9/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteResponseObserver.scala#L183). In this case, the cleanup finally calls `ExecutorResponseObserver.removeAll()` and all the responses are discarded, and the response sender thread can't escape [this loop](https://github.com/apache/spark/blob/bfec6692b102b172bdbcad7f983e2ec2844383c9/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala#L245) because neither `gotResponse` nor `streamFinished` becomes true. The solution this PR proposes is changing the definition of `streamFinished` in `ExecuteGrpcResponseSender` so that a stream is regarded as finished in case the `ExecutionResponseObserver` is marked as completed and all the responses are discarded. `ExecutionResponseObserver.removeAll` is called when the corresponding `ExecutionHolder` is closed or cleaned up by interruption so this solution could be reasonable. ### Why are the changes needed? To fix a potential issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested manually. You can easily reproduce this issue without this change by inserting sleep to the test like as follows. ``` --- a/sql/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/SparkSessionE2ESuite.scala +++ b/sql/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/SparkSessionE2ESuite.scala -331,6 +331,7 class SparkSessionE2ESuite extends ConnectFunSuite with RemoteSparkSession { // cancel val operationId = result.operationId val canceledId = spark.interruptOperation(operationId) + Thread.sleep(1000) assert(canceledId == Seq(operationId)) // and check that it got canceled val e = intercept[SparkException] { ``` After this change applied, the test above doesn't hang. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51671 from sarutak/connect-race-condition. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This PR fixes an issue which happens when operations are interrupted, which is related to SPARK-50748 and SPARK-50889.
Regarding SPARK-50889, this issue happens if an execution thread for an operation id cleans up the corresponding
ExecutionHolderas the result of interruption here before a response sender thread consumes a response here.In this case, the cleanup finally calls
ExecutorResponseObserver.removeAll()and all the responses are discarded, and the response sender thread can't escape this loop because neithergotResponsenorstreamFinishedbecomes true.The solution this PR proposes is changing the definition of
streamFinishedinExecuteGrpcResponseSenderso that a stream is regarded as finished in case theExecutionResponseObserveris marked as completed and all the responses are discarded.ExecutionResponseObserver.removeAllis called when the correspondingExecutionHolderis closed or cleaned up by interruption so this solution could be reasonable.Why are the changes needed?
To fix a potential issue.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Tested manually.
You can easily reproduce this issue without this change by inserting sleep to the test like as follows.
After this change applied, the test above doesn't hang.
Was this patch authored or co-authored using generative AI tooling?
No.