Skip to content
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1239,6 +1239,9 @@ private[spark] class DAGScheduler(
markMapStageJobsAsFinished(stage)
case stage : ResultStage =>
logDebug(s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this happen for normal jobs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is guarded by a check when ResultStage itself has zero tasks to run. So it would be skipped for normal jobs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, does RDD#count have the same bug when rdd is empty? I think it doesn't but I'm trying to understand why it's an issue only for approximate jobs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is related to SPARK-26714 / #23637 -- maybe the same change from DAGScheduler.submitJob is needed in DAGScheduler.runApproximateJob

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, RDD#count scenario is handled via submitJob as initially we have a partitions.size==0 check, just like @squito mentioned

@squito so do you suggest i provide a similar check in runApproximateJob.?? shall i change this PR.?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @squito suggestion is better, i have changed my PR as per it. Please review

val jobId = stage.activeJob.get.jobId
cleanupStateForJobAndIndependentStages(stage.activeJob.get)
listenerBus.post(SparkListenerJobEnd(jobId, clock.getTimeMillis(), JobSucceeded))
}
submitWaitingChildStages(stage)
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
package org.apache.spark.scheduler

import java.util.Properties
import java.util.concurrent.{CountDownLatch, TimeUnit}
import java.util.concurrent.atomic.{AtomicBoolean, AtomicLong}

import scala.annotation.meta.param
Expand Down Expand Up @@ -2849,6 +2850,18 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
}
}

test("SPARK-27164: RDD.countApprox on empty RDDs schedules jobs which never complete") {
val latch = new CountDownLatch(1)
val jobListener = new SparkListener {
override def onJobEnd(jobEnd: SparkListenerJobEnd): Unit = {
latch.countDown()
}
}
sc.addSparkListener(jobListener)
sc.emptyRDD[Int].countApprox(10000).getFinalValue()
assert(latch.await(10, TimeUnit.SECONDS))
}

/**
* Assert that the supplied TaskSet has exactly the given hosts as its preferred locations.
* Note that this checks only the host and not the executor ID.
Expand Down