-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-37043][SQL] Cancel all running job after AQE plan finished #34316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @yaooqinn @cloud-fan @maryannxue @viirya @HyukjinKwon if you have time to take a look |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #144368 has finished for PR 34316 at commit
|
| Some((planChangeLogger, "AQE Post Stage Creation"))) | ||
| isFinalPlan = true | ||
| executionId.foreach(onUpdatePlan(_, Seq(currentPhysicalPlan))) | ||
| cancelRunningStages() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is still a stage running, why it escapes from the loop? Isn't allChildStagesMaterialized false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the currentPhysicalPlan converted to LocalTableScanExec during re-optimize and the LocalTableScanExec is a leaf node, then the flag of allChildStagesMaterialized is awlays true.
|
I see some test failed in GA, and it related to this PR. So let me convert to draft now. |
|
Any update on this issue? |
|
I realize that we can not cancel the running stage easily. Many code place check the SQL execution status based on whether the SQL exists failed stage/job so if we cancel the running stage the status of the SQL will be failure, e.g. in UI. Given this, I don't have a good idea now. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Cancel running job after AQE plan finished, so this PR add a
runningStagesinAdaptiveExecutionContextto record the running stages.Why are the changes needed?
We see stage was still running after AQE plan finished. This is because the plan which contains a join with one empty side has been converted to
LocalTableScanExecduringAQEOptimizer, but the other side of this join is still running (shuffle map stage).It's no meaning to keep running the stage, so It's better to cancel the running stage after AQE plan finished in case wasting the task resource.
Does this PR introduce any user-facing change?
no
How was this patch tested?
add test.