-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][SPARK-24375][Prototype] Support barrier scheduling #21494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #91471 has finished for PR 21494 at commit
|
|
Hi, @jiangxb1987 , can you explain more for what is |
|
@Ngone51 You can refer to the SPIP that xiangrui proposed in SPARK-24374 for a basic background and major goal of barrier scheduling, and you can also refer to SPARK-24375 for a design sketch. If you have further comments please feel free to talk on the JIRA (recommended because that works better for something we may want to revisit later) or here :) |
galv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't understood everything yet. I'll have to return to this later to review fully.
| val tc = TaskContext.get.asInstanceOf[org.apache.spark.barrier.BarrierTaskContext] | ||
| // If we don't get the expected taskInfos, the job shall abort due to stage failure. | ||
| if (tc.hosts().length != 2) { | ||
| throw new SparkException("Expected taksInfos length is 2, actual length is " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taksInfos -> taskInfos
| throw new SparkException("Expected taksInfos length is 2, actual length is " + | ||
| s"${tc.hosts().length}.") | ||
| } | ||
| // println(tc.getTaskInfos().toList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove comment
| shuffle.DiskBytesSpilled = 0 | ||
| _accumulatorRegistry.clear() | ||
|
|
||
| if (isBarrier): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: if (isBarrier): -> if isBarrier:
|
|
||
| if (isBarrier): | ||
| port = 25333 + 2 + 2 * taskContext._partitionId | ||
| paras = GatewayParameters(port=port) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paras -> params
| _accumulatorRegistry.clear() | ||
|
|
||
| if (isBarrier): | ||
| port = 25333 + 2 + 2 * taskContext._partitionId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend using DEFAULT_PORT and DEFAULT_PYTHON_PORT. They are exposed as part of the public API of py4j: https://github.com/bartdag/py4j/blob/216432d859de41441f0d1a0d55b31b5d8d09dd28/py4j-python/src/py4j/java_gateway.py#L54
By the way, acquiring ports like this is a little hacky and may require more thought.
| // TODO: We should kill any running task attempts when the task set manager becomes a zombie. | ||
| private[scheduler] var isZombie = false | ||
|
|
||
| private[scheduler] lazy val barrierCoordinator = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend adding a return type here for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 @galv
We also have barrierCoordinator with type RpcEndpointRef at each TaskContext, so it's better to add return type for both.
| import org.apache.spark.metrics.MetricsSystem | ||
| import org.apache.spark.util.RpcUtils | ||
|
|
||
| class BarrierTaskContext( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BarrierTaskContextImpl?
| case IncreaseEpoch(previousEpoch) => | ||
| if (previousEpoch == epoch) { | ||
| syncRequests.foreach(_.sendFailure(new RuntimeException( | ||
| s"The coordinator cannot get all barrier sync requests within $timeout ms."))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we considered to increase incrementally the time out when we can't get all barrier sync requests at an epoch?
|
|
||
| syncRequests += context | ||
| replyIfGetAllSyncRequest() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (epoch == this.epoch) {
...
} else { // Received RpcCallContext from failed previousEpoch.
context.sendFailure(new RuntimeException(
s"The coordinator cannot get all barrier sync requests within $timeout ms.")))
}| taskScheduler.cancelTasks(stageId, interruptThread = false) | ||
| } catch { | ||
| case e: UnsupportedOperationException => | ||
| logInfo(s"Could not cancel tasks for stage $stageId", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under barrier execution, will it be a problem if we can not cancel tasks?
| taskSet.abortIfCompletelyBlacklisted(hostToExecutors) | ||
| // Skip the barrier taskSet if the available slots are less than the number of pending tasks. | ||
| if (taskSet.isBarrier && availableSlots < taskSet.numTasks) { | ||
| // Skip the launch process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging something instead of silently passing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to propagate info to scheduler and resource manager layer for preempt scheduling
| timer.schedule(new TimerTask { | ||
| override def run(): Unit = { | ||
| // self can be null after this RPC endpoint is stopped. | ||
| if (self != null) self.send(IncreaseEpoch(currentEpoch)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once this epoch fails to sync, the stage will be failed and resubmitted. I think it will begin from new task set, so IncreaseEpoch seems useless because it doesn't really increase epoch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
register task level barriers sequence and hierarchy may be?
| // Write out the TaskContextInfo | ||
| val isBarrier = context.isInstanceOf[BarrierTaskContext] | ||
| dataOut.writeBoolean(isBarrier) | ||
| if (isBarrier) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this would be language dependent? would need something for R runner too?
| // TODO: We should kill any running task attempts when the task set manager becomes a zombie. | ||
| private[scheduler] var isZombie = false | ||
|
|
||
| private[scheduler] lazy val barrierCoordinator = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 @galv
We also have barrierCoordinator with type RpcEndpointRef at each TaskContext, so it's better to add return type for both.
| timeout: Long, | ||
| override val rpcEnv: RpcEnv) extends ThreadSafeRpcEndpoint { | ||
|
|
||
| private var epoch = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will epoch value be logged on driver and executors? It should be useful to diagnose upper level MPI program.
| /** Returns the default Spark timeout to use for RPC ask operations. */ | ||
| def askRpcTimeout(conf: SparkConf): RpcTimeout = { | ||
| RpcTimeout(conf, Seq("spark.rpc.askTimeout", "spark.network.timeout"), "120s") | ||
| RpcTimeout(conf, Seq("spark.rpc.askTimeout", "spark.network.timeout"), "900s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why hard-code this change? Couldn't you have set this at runtime if you needed it increased? I'm concerned about it breaking backwards compatibility with jobs that for whatever reason depend on the 120 second timeout.
|
|
||
|
|
||
| /** | ||
| * An RDD that supports running MPI programme. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
programme -> program
What changes were proposed in this pull request?
This PR is to add new RDDBarrier and BarrierTaskContext to support barrier scheduling in Spark. It also modifies how the job scheduling works to accommodate the new feature.
Note: this is a prototype to facilitate the discussion. It's not meant for the final design or anything. It just shows one way that might works.
How was this patch tested?
Simple unit test and integration test.