[SPARK-24819][CORE] Fail fast when no enough slots to launch the barrier stage on job submitted #22001

jiangxb1987 · 2018-08-05T15:57:54Z

What changes were proposed in this pull request?

We shall check whether the barrier stage requires more slots (to be able to launch all tasks in the barrier stage together) than the total number of active slots currently, and fail fast if trying to submit a barrier stage that requires more slots than current total number.

This PR proposes to add a new method getNumSlots() to try to get the total number of currently active slots in SchedulerBackend, support of this new method has been added to all the first-class scheduler backends except MesosFineGrainedSchedulerBackend.

How was this patch tested?

Added new test cases in BarrierStageOnSubmittedSuite.

felixcheung · 2018-08-05T18:39:42Z

...c/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosFineGrainedSchedulerBackend.scala

so this breaks barrier execution on mesos completely? (since available slot is 0 it will just fail)

but finegrained is being deprecated...

Only MesosFineGrainedSchedulerBackend shall break, we still support MesosCoarseGrainedSchedulerBackend

@jiangxb1987 Could you create a JIRA and link here?

felixcheung · 2018-08-05T18:42:02Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

should this be saved instead of re-compute on each stage?

As mentioned in the method description of SchedulerBackend.getNumSlots():

* Note that please don't cache the value returned by this method, because the number can change * due to add/remove executors.

It shall be fine to cache that within different stages of a job, but it requires a few more changes that will make the current PR more complicated.

SparkQA · 2018-08-05T19:40:04Z

Test build #94245 has finished for PR 22001 at commit 5253005.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-06T12:41:02Z

ok to test

HyukjinKwon · 2018-08-06T12:41:09Z

retest this please

SparkQA · 2018-08-06T18:08:05Z

Test build #94278 has finished for PR 22001 at commit cc6c572.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-08-06T18:19:26Z

retest this please

SparkQA · 2018-08-06T18:38:11Z

Test build #94283 has finished for PR 22001 at commit cc6c572.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-06T23:11:08Z

Test build #94307 has finished for PR 22001 at commit cc6c572.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2018-08-07T00:20:53Z

core/src/main/scala/org/apache/spark/SparkContext.scala

How about maxConcurrentTasks?

mengxr · 2018-08-07T00:22:46Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

executorDataMap.values.map { executor => executor.totalCores / scheduler.CPUS_PER_TASK }.sum

mengxr · 2018-08-07T00:24:51Z

core/src/test/scala/org/apache/spark/BarrierStageOnSubmittedSuite.scala

We need a test verifies if total slots are good but some are running other jobs, we shouldn't fail the barrier job.

mengxr · 2018-08-07T00:26:06Z

core/src/test/scala/org/apache/spark/SparkContextSuite.scala

move this wait code to barrier suite, because it is only required there

mengxr · 2018-08-07T00:26:42Z

core/src/test/scala/org/apache/spark/SparkContextSuite.scala

Add a unit test for getNumSlots.

mengxr · 2018-08-07T00:30:52Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

We should tolerate temporarily unavailability here by adding a wait or retry logic.

mengxr · 2018-08-07T00:32:05Z

...c/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosFineGrainedSchedulerBackend.scala

@jiangxb1987 Could you create a JIRA and link here?

SparkQA · 2018-08-09T12:46:32Z

Test build #94490 has finished for PR 22001 at commit bf0eccc.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-09T12:56:42Z

Test build #94491 has finished for PR 22001 at commit 825d2d9.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

…mitted

kiszk · 2018-08-09T14:15:17Z

core/src/main/scala/org/apache/spark/SparkContext.scala

  }

+  /**
+   * Get the max number of tasks that can be concurrent launched currently.


How about like this?

* Get the max number of tasks that can be concurrently launched when the method is called. * Note that please don't cache the value returned by this method, because the number can be * changed by adding/removing executors.

kiszk · 2018-08-09T14:17:08Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+
+  private[spark] val BARRIER_MAX_CONCURRENT_TASKS_CHECK_INTERVAL =
+    ConfigBuilder("spark.scheduler.barrier.maxConcurrentTasksCheck.interval")
+      .doc("Time in seconds to wait between a max concurrent tasks check failure and the next " +


nit: a max -> max?

"a ... failure"

kiszk · 2018-08-09T14:18:57Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+  private[scheduler] val jobIdToNumTasksCheckFailures = new ConcurrentHashMap[Int, Int]
+
+  /**
+   * Time in seconds to wait between a max concurrent tasks check failure and the next check.


nit: a max -> max?

kiszk · 2018-08-09T14:26:46Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+        logWarning("The job requires to run a barrier stage that requires more slots than the " +
+          "total number of slots in the cluster currently.")
+        jobIdToNumTasksCheckFailures.putIfAbsent(jobId, 0)
+        val numCheckFailures = jobIdToNumTasksCheckFailures.get(jobId) + 1


Is it OK while this increment is not atomic?
In the following scenario, the value may not be correct

We assume jobIdToNumTasksCheckFailures(jobId) = 1

Thread A executes L963, then numCheckFailures = 2

Thread B executes L963, then numCheckFailures = 2

Thread B executes L964 and L965, then jobIdToNumTasksCheckFailures(jobId) has 2.

Thread A executes L964 and L965, then jobIdToNumTasksCheckFailures(jobId) has 2.

Since two threads detected failure, we expect listener.jobFailed(e) is called. But, it is not called.

+1. Use atomic updates from ConcurrentHashMap. Update the counter and then check max failures.

@kiszk IIUC, there's exactly only one thread in eventLoop, so, the scenario mentioned above will not happen. And I even feel it is no need to use ConcurrentHashMap for jobIdToNumTasksCheckFailures at all. @jiangxb1987

SparkQA · 2018-08-09T16:40:08Z

Test build #94489 has finished for PR 22001 at commit 0df8f74.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-08-09T16:54:00Z

Test build #94493 has finished for PR 22001 at commit eb689ac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-09T17:38:20Z

Test build #94495 has finished for PR 22001 at commit 8de1a4b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2018-08-10T14:10:43Z

test this please

mengxr · 2018-08-10T14:12:15Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+
+  private[spark] val BARRIER_MAX_CONCURRENT_TASKS_CHECK_INTERVAL =
+    ConfigBuilder("spark.scheduler.barrier.maxConcurrentTasksCheck.interval")
+      .doc("Time in seconds to wait between a max concurrent tasks check failure and the next " +


"a ... failure"

mengxr · 2018-08-10T14:15:20Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+        "to jobs that contain one or more barrier stages, we won't perform the check on " +
+        "non-barrier jobs.")
+      .timeConf(TimeUnit.SECONDS)
+      .createWithDefaultString("10s")


Would you make the default higher like 30s? This is to cover the case when applications starts immediately with a barrier while master is adding new executors. Let me know if this won't happen.

mengxr · 2018-08-10T14:19:30Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

      // HadoopRDD whose underlying HDFS files have been deleted.
      finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
    } catch {
+      case e: Exception if e.getMessage ==


== -> .contains() in case the error message is nested

mengxr · 2018-08-10T14:20:32Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

    } catch {
+      case e: Exception if e.getMessage ==
+          DAGScheduler.ERROR_MESSAGE_BARRIER_REQUIRE_MORE_SLOTS_THAN_CURRENT_TOTAL_NUMBER =>
+        logWarning("The job requires to run a barrier stage that requires more slots than the " +


Please include jobId, stageId, request slots, and total slots in the log message.

mengxr · 2018-08-10T14:25:23Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+        logWarning("The job requires to run a barrier stage that requires more slots than the " +
+          "total number of slots in the cluster currently.")
+        jobIdToNumTasksCheckFailures.putIfAbsent(jobId, 0)
+        val numCheckFailures = jobIdToNumTasksCheckFailures.get(jobId) + 1


+1. Use atomic updates from ConcurrentHashMap. Update the counter and then check max failures.

mengxr · 2018-08-10T14:26:57Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+          )
+          return
+        } else {
+          listener.jobFailed(e)


do you expect the same job submitted again? if not, we should remove the key from the hashmap.

mengxr · 2018-08-10T14:27:38Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+  /**
+   * Number of max concurrent tasks check failures for each job.
+   */
+  private[scheduler] val jobIdToNumTasksCheckFailures = new ConcurrentHashMap[Int, Int]


How do entries in this map get cleaned?

mengxr · 2018-08-10T15:17:44Z

core/src/test/scala/org/apache/spark/scheduler/CoarseGrainedSchedulerBackendSuite.scala

+      // Submit a job to trigger some tasks on active executors.
+      testSubmitJob(sc, rdd)
+
+      eventually(timeout(5.seconds)) {


Maybe safer to let the task sleep longer and cancel the task one the conditions are met.

mengxr · 2018-08-10T15:18:09Z

...c/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosFineGrainedSchedulerBackend.scala

    }

+  override def maxNumConcurrentTasks(): Int = {
+    // TODO support this method for MesosFineGrainedSchedulerBackend


link to a JIRA

mengxr · 2018-08-10T15:50:13Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+          "total number of slots in the cluster currently.")
+        jobIdToNumTasksCheckFailures.putIfAbsent(jobId, 0)
+        val numCheckFailures = jobIdToNumTasksCheckFailures.get(jobId) + 1
+        if (numCheckFailures < DAGScheduler.DEFAULT_MAX_CONSECUTIVE_NUM_TASKS_CHECK_FAILURES) {


Should make DEFAULT_MAX_CONSECUTIVE_NUM_TASKS_CHECK_FAILURES configurable so users can specify unlimited retry if needed. Instead, we might want to fix the timeout since it is only relevant to cost.

SparkQA · 2018-08-12T21:43:21Z

Test build #94649 has finished for PR 22001 at commit 8b16c57.

This patch fails from timeout after a configured wait of `340m`.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2018-08-13T06:16:15Z

test this please

SparkQA · 2018-08-13T06:28:21Z

Test build #94658 has finished for PR 22001 at commit 9d4e232.

This patch fails from timeout after a configured wait of `340m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-13T07:05:01Z

Test build #94672 has finished for PR 22001 at commit 9d4e232.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-13T07:07:19Z

retest this please

SparkQA · 2018-08-13T12:48:45Z

Test build #94676 has finished for PR 22001 at commit 9d4e232.

This patch fails from timeout after a configured wait of `340m`.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-08-13T13:00:49Z

retest this please

mengxr · 2018-08-13T15:30:28Z

@shaneknapp Is the timeout due to concurrent workload on Jenkins workers? If so, shall we reduce the concurrency (more wait in the queue but more robust test result)?

Ngone51 · 2018-08-13T15:51:00Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+   * Check whether the barrier stage requires more slots (to be able to launch all tasks in the
+   * barrier stage together) than the total number of active slots currently. Fail current check
+   * if trying to submit a barrier stage that requires more slots than current total number. If
+   * the check fails consecutively for three times for a job, then fail current job submission.


Seems I do not find the code about "consecutively for three times", but only maxFailureNumTasksCheck ?

shaneknapp · 2018-08-13T17:24:45Z

@mengxr it looks like the builds are just taking longer and longer. :(

if this continues to be an issue, we'll need to bump the timeout in dev/run-tests-jenkins.py again. also, we JUST bumped the timeout ~20 days ago:
https://github.com/apache/spark/pull/21845/files/08b4ebe6a278f4e12eff95a9109803ed88a2c25b..51f8792007672899324c6a615be55f0179284401

SparkQA · 2018-08-13T18:43:33Z

Test build #94687 has finished for PR 22001 at commit 9d4e232.

This patch fails from timeout after a configured wait of `340m`.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2018-08-13T20:12:41Z

test this please

mengxr · 2018-08-13T20:14:14Z

@shaneknapp Maybe we could scan the test history and move some super stable tests to nightly. Apparently, it is not a solution for now. I'm giving another try:)

shaneknapp · 2018-08-13T21:08:15Z

@mengxr that is easier said than done... :)

once the 2.4 cut is done, it might be time to have a discussion on the dev@ list about build strategies and how we should proceed w/PRB testing.

mengxr · 2018-08-14T00:15:29Z

test this please

SparkQA · 2018-08-14T01:53:32Z

Test build #94705 has finished for PR 22001 at commit 9d4e232.

This patch fails from timeout after a configured wait of `340m`.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-08-14T02:00:07Z

retest this please

SparkQA · 2018-08-14T05:58:26Z

Test build #94716 has finished for PR 22001 at commit 9d4e232.

This patch fails from timeout after a configured wait of `340m`.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-08-14T06:16:31Z

Just curious.

It is very interesting to me since the recent three tries consistently cause a timeout failure at the same test.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94687
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94705
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94716

In addition, other PRs look successful without timeout.

[info] - abort the job if total size of results is too large (1 second, 122 milliseconds)
Exception in thread "task-result-getter-3" java.lang.Error: java.lang.InterruptedException
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
	at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:206)
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:222)
	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
	at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:115)
	at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:701)
	at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
	at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
	at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
	at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	... 2 more

SparkQA · 2018-08-14T07:05:02Z

Test build #94721 has finished for PR 22001 at commit 9d4e232.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2018-08-14T16:26:46Z

@kiszk Thanks for the note! I reverted the change in DAGSchedulerSuite. Let's try Jenkins again.

SparkQA · 2018-08-14T21:18:46Z

Test build #94754 has finished for PR 22001 at commit cb420e3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DAGSchedulerSuiteDummyException extends Exception

SparkQA · 2018-08-14T22:08:28Z

Test build #94752 has finished for PR 22001 at commit 79330f4.

This patch fails from timeout after a configured wait of `340m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-15T18:15:56Z

Test build #94801 has finished for PR 22001 at commit c9036aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2018-08-15T20:31:59Z

LGTM. Merged into master. Thanks!

felixcheung reviewed Aug 5, 2018

View reviewed changes

mengxr requested changes Aug 7, 2018

View reviewed changes

HyukjinKwon mentioned this pull request Aug 7, 2018

[SPARK-24886][INFRA] Fix the testing script to increase timeout for Jenkins build (from 300m to 340m) #21845

Closed

jiangxb1987 force-pushed the SPARK-24819 branch from 0df8f74 to bf0eccc Compare August 9, 2018 12:35

jiangxb1987 added 5 commits August 9, 2018 21:07

Fail fast when no enough slots to launch the barrier stage on job sub…

fca0176

…mitted

fix test failure

968975f

update

7acc9dd

update tests

6998b21

update

eb689ac

jiangxb1987 force-pushed the SPARK-24819 branch from 825d2d9 to eb689ac Compare August 9, 2018 13:09

add log

8de1a4b

kiszk reviewed Aug 9, 2018

View reviewed changes

mengxr requested changes Aug 10, 2018

View reviewed changes

mengxr reviewed Aug 10, 2018

View reviewed changes

minor updates

9d4e232

Ngone51 reviewed Aug 13, 2018

View reviewed changes

revert DAGSchedulerSuite change

79330f4

update

cb420e3

update

c9036aa

asfgit closed this in bfb7439 Aug 15, 2018

[SPARK-24819][CORE] Fail fast when no enough slots to launch the barrier stage on job submitted #22001

[SPARK-24819][CORE] Fail fast when no enough slots to launch the barrier stage on job submitted #22001

Uh oh!

Conversation

jiangxb1987 commented Aug 5, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 5, 2018

Uh oh!

HyukjinKwon commented Aug 6, 2018

Uh oh!

HyukjinKwon commented Aug 6, 2018

Uh oh!

SparkQA commented Aug 6, 2018

Uh oh!

jiangxb1987 commented Aug 6, 2018

Uh oh!

SparkQA commented Aug 6, 2018

Uh oh!

SparkQA commented Aug 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr Aug 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 9, 2018

Uh oh!

SparkQA commented Aug 9, 2018

Uh oh!

kiszk Aug 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Aug 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 9, 2018

Uh oh!

SparkQA commented Aug 9, 2018

Uh oh!

SparkQA commented Aug 9, 2018

Uh oh!

mengxr commented Aug 10, 2018

Uh oh!

mengxr Aug 7, 2018 •

edited

Loading

kiszk Aug 9, 2018 •

edited

Loading

kiszk Aug 9, 2018 •

edited

Loading