Skip to content

Conversation

@jiangxb1987
Copy link
Contributor

What changes were proposed in this pull request?

We shall check whether the barrier stage requires more slots (to be able to launch all tasks in the barrier stage together) than the total number of active slots currently, and fail fast if trying to submit a barrier stage that requires more slots than current total number.

This PR proposes to add a new method getNumSlots() to try to get the total number of currently active slots in SchedulerBackend, support of this new method has been added to all the first-class scheduler backends except MesosFineGrainedSchedulerBackend.

How was this patch tested?

Added new test cases in BarrierStageOnSubmittedSuite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this breaks barrier execution on mesos completely? (since available slot is 0 it will just fail)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but finegrained is being deprecated...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only MesosFineGrainedSchedulerBackend shall break, we still support MesosCoarseGrainedSchedulerBackend

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiangxb1987 Could you create a JIRA and link here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be saved instead of re-compute on each stage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the method description of SchedulerBackend.getNumSlots():

   * Note that please don't cache the value returned by this method, because the number can change
   * due to add/remove executors.

It shall be fine to cache that within different stages of a job, but it requires a few more changes that will make the current PR more complicated.

@SparkQA
Copy link

SparkQA commented Aug 5, 2018

Test build #94245 has finished for PR 22001 at commit 5253005.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 6, 2018

Test build #94278 has finished for PR 22001 at commit cc6c572.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 6, 2018

Test build #94283 has finished for PR 22001 at commit cc6c572.

  • This patch fails from timeout after a configured wait of `300m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2018

Test build #94307 has finished for PR 22001 at commit cc6c572.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about maxConcurrentTasks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

executorDataMap.values.map { executor =>
  executor.totalCores / scheduler.CPUS_PER_TASK
}.sum

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a test verifies if total slots are good but some are running other jobs, we shouldn't fail the barrier job.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this wait code to barrier suite, because it is only required there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a unit test for getNumSlots.

Copy link
Contributor

@mengxr mengxr Aug 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should tolerate temporarily unavailability here by adding a wait or retry logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiangxb1987 Could you create a JIRA and link here?

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94490 has finished for PR 22001 at commit bf0eccc.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94491 has finished for PR 22001 at commit 825d2d9.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

/**
* Get the max number of tasks that can be concurrent launched currently.
Copy link
Member

@kiszk kiszk Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about like this?

 * Get the max number of tasks that can be concurrently launched when the method is called.
 * Note that please don't cache the value returned by this method, because the number can be
 * changed by adding/removing executors. 


private[spark] val BARRIER_MAX_CONCURRENT_TASKS_CHECK_INTERVAL =
ConfigBuilder("spark.scheduler.barrier.maxConcurrentTasksCheck.interval")
.doc("Time in seconds to wait between a max concurrent tasks check failure and the next " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a max -> max?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a ... failure"

private[scheduler] val jobIdToNumTasksCheckFailures = new ConcurrentHashMap[Int, Int]

/**
* Time in seconds to wait between a max concurrent tasks check failure and the next check.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a max -> max?

logWarning("The job requires to run a barrier stage that requires more slots than the " +
"total number of slots in the cluster currently.")
jobIdToNumTasksCheckFailures.putIfAbsent(jobId, 0)
val numCheckFailures = jobIdToNumTasksCheckFailures.get(jobId) + 1
Copy link
Member

@kiszk kiszk Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK while this increment is not atomic?
In the following scenario, the value may not be correct

  1. We assume jobIdToNumTasksCheckFailures(jobId) = 1
  2. Thread A executes L963, then numCheckFailures = 2
  3. Thread B executes L963, then numCheckFailures = 2
  4. Thread B executes L964 and L965, then jobIdToNumTasksCheckFailures(jobId) has 2.
  5. Thread A executes L964 and L965, then jobIdToNumTasksCheckFailures(jobId) has 2.

Since two threads detected failure, we expect listener.jobFailed(e) is called. But, it is not called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Use atomic updates from ConcurrentHashMap. Update the counter and then check max failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiszk IIUC, there's exactly only one thread in eventLoop, so, the scenario mentioned above will not happen. And I even feel it is no need to use ConcurrentHashMap for jobIdToNumTasksCheckFailures at all. @jiangxb1987

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94489 has finished for PR 22001 at commit 0df8f74.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94493 has finished for PR 22001 at commit eb689ac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94495 has finished for PR 22001 at commit 8de1a4b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Aug 10, 2018

test this please


private[spark] val BARRIER_MAX_CONCURRENT_TASKS_CHECK_INTERVAL =
ConfigBuilder("spark.scheduler.barrier.maxConcurrentTasksCheck.interval")
.doc("Time in seconds to wait between a max concurrent tasks check failure and the next " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a ... failure"

"to jobs that contain one or more barrier stages, we won't perform the check on " +
"non-barrier jobs.")
.timeConf(TimeUnit.SECONDS)
.createWithDefaultString("10s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you make the default higher like 30s? This is to cover the case when applications starts immediately with a barrier while master is adding new executors. Let me know if this won't happen.

// HadoopRDD whose underlying HDFS files have been deleted.
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
case e: Exception if e.getMessage ==
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

== -> .contains() in case the error message is nested

} catch {
case e: Exception if e.getMessage ==
DAGScheduler.ERROR_MESSAGE_BARRIER_REQUIRE_MORE_SLOTS_THAN_CURRENT_TOTAL_NUMBER =>
logWarning("The job requires to run a barrier stage that requires more slots than the " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include jobId, stageId, request slots, and total slots in the log message.

logWarning("The job requires to run a barrier stage that requires more slots than the " +
"total number of slots in the cluster currently.")
jobIdToNumTasksCheckFailures.putIfAbsent(jobId, 0)
val numCheckFailures = jobIdToNumTasksCheckFailures.get(jobId) + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Use atomic updates from ConcurrentHashMap. Update the counter and then check max failures.

)
return
} else {
listener.jobFailed(e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you expect the same job submitted again? if not, we should remove the key from the hashmap.

/**
* Number of max concurrent tasks check failures for each job.
*/
private[scheduler] val jobIdToNumTasksCheckFailures = new ConcurrentHashMap[Int, Int]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do entries in this map get cleaned?

// Submit a job to trigger some tasks on active executors.
testSubmitJob(sc, rdd)

eventually(timeout(5.seconds)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe safer to let the task sleep longer and cancel the task one the conditions are met.

}

override def maxNumConcurrentTasks(): Int = {
// TODO support this method for MesosFineGrainedSchedulerBackend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to a JIRA

"total number of slots in the cluster currently.")
jobIdToNumTasksCheckFailures.putIfAbsent(jobId, 0)
val numCheckFailures = jobIdToNumTasksCheckFailures.get(jobId) + 1
if (numCheckFailures < DAGScheduler.DEFAULT_MAX_CONSECUTIVE_NUM_TASKS_CHECK_FAILURES) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should make DEFAULT_MAX_CONSECUTIVE_NUM_TASKS_CHECK_FAILURES configurable so users can specify unlimited retry if needed. Instead, we might want to fix the timeout since it is only relevant to cost.

@SparkQA
Copy link

SparkQA commented Aug 12, 2018

Test build #94649 has finished for PR 22001 at commit 8b16c57.

  • This patch fails from timeout after a configured wait of `340m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Aug 13, 2018

test this please

@SparkQA
Copy link

SparkQA commented Aug 13, 2018

Test build #94658 has finished for PR 22001 at commit 9d4e232.

  • This patch fails from timeout after a configured wait of `340m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 13, 2018

Test build #94672 has finished for PR 22001 at commit 9d4e232.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 13, 2018

Test build #94676 has finished for PR 22001 at commit 9d4e232.

  • This patch fails from timeout after a configured wait of `340m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please

@mengxr
Copy link
Contributor

mengxr commented Aug 13, 2018

@shaneknapp Is the timeout due to concurrent workload on Jenkins workers? If so, shall we reduce the concurrency (more wait in the queue but more robust test result)?

* Check whether the barrier stage requires more slots (to be able to launch all tasks in the
* barrier stage together) than the total number of active slots currently. Fail current check
* if trying to submit a barrier stage that requires more slots than current total number. If
* the check fails consecutively for three times for a job, then fail current job submission.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems I do not find the code about "consecutively for three times", but only maxFailureNumTasksCheck ?

@shaneknapp
Copy link
Contributor

shaneknapp commented Aug 13, 2018

@mengxr it looks like the builds are just taking longer and longer. :(

if this continues to be an issue, we'll need to bump the timeout in dev/run-tests-jenkins.py again. also, we JUST bumped the timeout ~20 days ago:
https://github.com/apache/spark/pull/21845/files/08b4ebe6a278f4e12eff95a9109803ed88a2c25b..51f8792007672899324c6a615be55f0179284401

@SparkQA
Copy link

SparkQA commented Aug 13, 2018

Test build #94687 has finished for PR 22001 at commit 9d4e232.

  • This patch fails from timeout after a configured wait of `340m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Aug 13, 2018

test this please

@mengxr
Copy link
Contributor

mengxr commented Aug 13, 2018

@shaneknapp Maybe we could scan the test history and move some super stable tests to nightly. Apparently, it is not a solution for now. I'm giving another try:)

@shaneknapp
Copy link
Contributor

@mengxr that is easier said than done... :)

once the 2.4 cut is done, it might be time to have a discussion on the dev@ list about build strategies and how we should proceed w/PRB testing.

@mengxr
Copy link
Contributor

mengxr commented Aug 14, 2018

test this please

@SparkQA
Copy link

SparkQA commented Aug 14, 2018

Test build #94705 has finished for PR 22001 at commit 9d4e232.

  • This patch fails from timeout after a configured wait of `340m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 14, 2018

Test build #94716 has finished for PR 22001 at commit 9d4e232.

  • This patch fails from timeout after a configured wait of `340m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Aug 14, 2018

Just curious.

It is very interesting to me since the recent three tries consistently cause a timeout failure at the same test.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94687
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94705
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94716

In addition, other PRs look successful without timeout.

[info] - abort the job if total size of results is too large (1 second, 122 milliseconds)
Exception in thread "task-result-getter-3" java.lang.Error: java.lang.InterruptedException
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
	at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:206)
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:222)
	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
	at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:115)
	at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:701)
	at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
	at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
	at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
	at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	... 2 more

@SparkQA
Copy link

SparkQA commented Aug 14, 2018

Test build #94721 has finished for PR 22001 at commit 9d4e232.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Aug 14, 2018

@kiszk Thanks for the note! I reverted the change in DAGSchedulerSuite. Let's try Jenkins again.

@SparkQA
Copy link

SparkQA commented Aug 14, 2018

Test build #94754 has finished for PR 22001 at commit cb420e3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DAGSchedulerSuiteDummyException extends Exception

@SparkQA
Copy link

SparkQA commented Aug 14, 2018

Test build #94752 has finished for PR 22001 at commit 79330f4.

  • This patch fails from timeout after a configured wait of `340m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #94801 has finished for PR 22001 at commit c9036aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Aug 15, 2018

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in bfb7439 Aug 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants