-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24819][CORE] Fail fast when no enough slots to launch the barrier stage on job submitted #22001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
[SPARK-24819][CORE] Fail fast when no enough slots to launch the barrier stage on job submitted #22001
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
fca0176
Fail fast when no enough slots to launch the barrier stage on job sub…
jiangxb1987 968975f
fix test failure
jiangxb1987 7acc9dd
update
jiangxb1987 6998b21
update tests
jiangxb1987 eb689ac
update
jiangxb1987 8de1a4b
add log
jiangxb1987 458c78f
update
jiangxb1987 8b16c57
update
jiangxb1987 9d4e232
minor updates
mengxr 79330f4
revert DAGSchedulerSuite change
mengxr cb420e3
update
jiangxb1987 c9036aa
update
jiangxb1987 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -577,4 +577,31 @@ package object config { | |
| .timeConf(TimeUnit.SECONDS) | ||
| .checkValue(v => v > 0, "The value should be a positive time value.") | ||
| .createWithDefaultString("365d") | ||
|
|
||
| private[spark] val BARRIER_MAX_CONCURRENT_TASKS_CHECK_INTERVAL = | ||
| ConfigBuilder("spark.scheduler.barrier.maxConcurrentTasksCheck.interval") | ||
| .doc("Time in seconds to wait between a max concurrent tasks check failure and the next " + | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "a ... failure" |
||
| "check. A max concurrent tasks check ensures the cluster can launch more concurrent " + | ||
| "tasks than required by a barrier stage on job submitted. The check can fail in case " + | ||
| "a cluster has just started and not enough executors have registered, so we wait for a " + | ||
| "little while and try to perform the check again. If the check fails more than a " + | ||
| "configured max failure times for a job then fail current job submission. Note this " + | ||
| "config only applies to jobs that contain one or more barrier stages, we won't perform " + | ||
| "the check on non-barrier jobs.") | ||
| .timeConf(TimeUnit.SECONDS) | ||
| .createWithDefaultString("15s") | ||
|
|
||
| private[spark] val BARRIER_MAX_CONCURRENT_TASKS_CHECK_MAX_FAILURES = | ||
| ConfigBuilder("spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures") | ||
| .doc("Number of max concurrent tasks check failures allowed before fail a job submission. " + | ||
| "A max concurrent tasks check ensures the cluster can launch more concurrent tasks than " + | ||
| "required by a barrier stage on job submitted. The check can fail in case a cluster " + | ||
| "has just started and not enough executors have registered, so we wait for a little " + | ||
| "while and try to perform the check again. If the check fails more than a configured " + | ||
| "max failure times for a job then fail current job submission. Note this config only " + | ||
| "applies to jobs that contain one or more barrier stages, we won't perform the check on " + | ||
| "non-barrier jobs.") | ||
| .intConf | ||
| .checkValue(v => v > 0, "The max failures should be a positive value.") | ||
| .createWithDefault(40) | ||
| } | ||
62 changes: 62 additions & 0 deletions
62
core/src/main/scala/org/apache/spark/scheduler/BarrierJobAllocationFailed.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.scheduler | ||
|
|
||
| import org.apache.spark.SparkException | ||
|
|
||
| /** | ||
| * Exception thrown when submit a job with barrier stage(s) failing a required check. | ||
| */ | ||
| private[spark] class BarrierJobAllocationFailed(message: String) extends SparkException(message) | ||
|
|
||
| private[spark] class BarrierJobUnsupportedRDDChainException | ||
| extends BarrierJobAllocationFailed( | ||
| BarrierJobAllocationFailed.ERROR_MESSAGE_RUN_BARRIER_WITH_UNSUPPORTED_RDD_CHAIN_PATTERN) | ||
|
|
||
| private[spark] class BarrierJobRunWithDynamicAllocationException | ||
| extends BarrierJobAllocationFailed( | ||
| BarrierJobAllocationFailed.ERROR_MESSAGE_RUN_BARRIER_WITH_DYN_ALLOCATION) | ||
|
|
||
| private[spark] class BarrierJobSlotsNumberCheckFailed | ||
| extends BarrierJobAllocationFailed( | ||
| BarrierJobAllocationFailed.ERROR_MESSAGE_BARRIER_REQUIRE_MORE_SLOTS_THAN_CURRENT_TOTAL_NUMBER) | ||
|
|
||
| private[spark] object BarrierJobAllocationFailed { | ||
|
|
||
| // Error message when running a barrier stage that have unsupported RDD chain pattern. | ||
| val ERROR_MESSAGE_RUN_BARRIER_WITH_UNSUPPORTED_RDD_CHAIN_PATTERN = | ||
| "[SPARK-24820][SPARK-24821]: Barrier execution mode does not allow the following pattern of " + | ||
| "RDD chain within a barrier stage:\n1. Ancestor RDDs that have different number of " + | ||
| "partitions from the resulting RDD (eg. union()/coalesce()/first()/take()/" + | ||
| "PartitionPruningRDD). A workaround for first()/take() can be barrierRdd.collect().head " + | ||
| "(scala) or barrierRdd.collect()[0] (python).\n" + | ||
| "2. An RDD that depends on multiple barrier RDDs (eg. barrierRdd1.zip(barrierRdd2))." | ||
|
|
||
| // Error message when running a barrier stage with dynamic resource allocation enabled. | ||
| val ERROR_MESSAGE_RUN_BARRIER_WITH_DYN_ALLOCATION = | ||
| "[SPARK-24942]: Barrier execution mode does not support dynamic resource allocation for " + | ||
| "now. You can disable dynamic resource allocation by setting Spark conf " + | ||
| "\"spark.dynamicAllocation.enabled\" to \"false\"." | ||
|
|
||
| // Error message when running a barrier stage that requires more slots than current total number. | ||
| val ERROR_MESSAGE_BARRIER_REQUIRE_MORE_SLOTS_THAN_CURRENT_TOTAL_NUMBER = | ||
| "[SPARK-24819]: Barrier execution mode does not allow run a barrier stage that requires " + | ||
| "more slots than the total number of slots in the cluster currently. Please init a new " + | ||
| "cluster with more CPU cores or repartition the input RDD(s) to reduce the number of " + | ||
| "slots required to run this barrier stage." | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about like this?