Skip to content

Conversation

@ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Oct 18, 2021

What changes were proposed in this pull request?

Cancel running job after AQE plan finished, so this PR add a runningStages in AdaptiveExecutionContext to record the running stages.

Why are the changes needed?

We see stage was still running after AQE plan finished. This is because the plan which contains a join with one empty side has been converted to LocalTableScanExec during AQEOptimizer, but the other side of this join is still running (shuffle map stage).

It's no meaning to keep running the stage, so It's better to cancel the running stage after AQE plan finished in case wasting the task resource.

Does this PR introduce any user-facing change?

no

How was this patch tested?

add test.

@github-actions github-actions bot added the SQL label Oct 18, 2021
@ulysses-you
Copy link
Contributor Author

cc @yaooqinn @cloud-fan @maryannxue @viirya @HyukjinKwon if you have time to take a look

@SparkQA
Copy link

SparkQA commented Oct 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48843/

@SparkQA
Copy link

SparkQA commented Oct 18, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48843/

@SparkQA
Copy link

SparkQA commented Oct 18, 2021

Test build #144368 has finished for PR 34316 at commit de0d183.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Some((planChangeLogger, "AQE Post Stage Creation")))
isFinalPlan = true
executionId.foreach(onUpdatePlan(_, Seq(currentPhysicalPlan)))
cancelRunningStages()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is still a stage running, why it escapes from the loop? Isn't allChildStagesMaterialized false?

Copy link
Contributor Author

@ulysses-you ulysses-you Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the currentPhysicalPlan converted to LocalTableScanExec during re-optimize and the LocalTableScanExec is a leaf node, then the flag of allChildStagesMaterialized is awlays true.

@ulysses-you ulysses-you marked this pull request as draft October 19, 2021 10:41
@ulysses-you
Copy link
Contributor Author

I see some test failed in GA, and it related to this PR. So let me convert to draft now.

@AngersZhuuuu
Copy link
Contributor

Any update on this issue?

@ulysses-you
Copy link
Contributor Author

I realize that we can not cancel the running stage easily. Many code place check the SQL execution status based on whether the SQL exists failed stage/job so if we cancel the running stage the status of the SQL will be failure, e.g. in UI. Given this, I don't have a good idea now.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants