Skip to content

[SPARK-27164][Core] RDD.countApprox on empty RDDs schedules jobs which never complete#24100

Closed
ajithme wants to merge 6 commits intoapache:masterfrom
ajithme:emptyRDD
Closed

[SPARK-27164][Core] RDD.countApprox on empty RDDs schedules jobs which never complete#24100
ajithme wants to merge 6 commits intoapache:masterfrom
ajithme:emptyRDD

Conversation

@ajithme
Copy link
Copy Markdown
Contributor

@ajithme ajithme commented Mar 15, 2019

What changes were proposed in this pull request?

When Result stage has zero tasks, the Job End event is never fired, hence the Job is always running in UI. Example: sc.emptyRDD[Int].countApprox(1000) never finishes even it has no tasks to launch

How was this patch tested?

Added UT

@ajithme
Copy link
Copy Markdown
Contributor Author

ajithme commented Mar 15, 2019

@cloud-fan
Copy link
Copy Markdown
Contributor

ok to test

@srowen
Copy link
Copy Markdown
Member

srowen commented Mar 15, 2019

CC @squito @jinxing64 for a similar change earlier. It looks plausible but I'm not sure if it's the best way to do it in this code or not.

@@ -1239,6 +1239,9 @@ private[spark] class DAGScheduler(
markMapStageJobsAsFinished(stage)
case stage : ResultStage =>
logDebug(s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this happen for normal jobs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is guarded by a check when ResultStage itself has zero tasks to run. So it would be skipped for normal jobs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, does RDD#count have the same bug when rdd is empty? I think it doesn't but I'm trying to understand why it's an issue only for approximate jobs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is related to SPARK-26714 / #23637 -- maybe the same change from DAGScheduler.submitJob is needed in DAGScheduler.runApproximateJob

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, RDD#count scenario is handled via submitJob as initially we have a partitions.size==0 check, just like @squito mentioned

@squito so do you suggest i provide a similar check in runApproximateJob.?? shall i change this PR.?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @squito suggestion is better, i have changed my PR as per it. Please review

@SparkQA
Copy link
Copy Markdown

SparkQA commented Mar 15, 2019

Test build #103532 has finished for PR 24100 at commit 0920e17.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Mar 16, 2019

Test build #103567 has finished for PR 24100 at commit ad7b888.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu changed the title [SPARK-27164] RDD.countApprox on empty RDDs schedules jobs which never complete [SPARK-27164][Core] RDD.countApprox on empty RDDs schedules jobs which never complete Mar 17, 2019
@srowen
Copy link
Copy Markdown
Member

srowen commented Mar 17, 2019

Merged to master

@srowen srowen closed this in fc88d3d Mar 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants