-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-8881][SPARK-9260] Fix algorithm for scheduling executors on workers #7274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 11 commits
ee7cf0e
66362d5
2d6371c
5d6a19c
c11c689
40c8f9f
a06da76
adec84b
f279cdf
1daf25f
79084e8
da0f491
b998097
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -533,6 +533,7 @@ private[master] class Master( | |
|
|
||
| /** | ||
| * Schedule executors to be launched on the workers. | ||
| * Returns an array containing number of cores assigned to each worker (None if scheduling fails) | ||
| * | ||
| * There are two modes of launching executors. The first attempts to spread out an application's | ||
| * executors on as many workers as possible, while the second does the opposite (i.e. launch them | ||
|
|
@@ -543,59 +544,108 @@ private[master] class Master( | |
| * multiple executors from the same application may be launched on the same worker if the worker | ||
| * has enough cores and memory. Otherwise, each executor grabs all the cores available on the | ||
| * worker by default, in which case only one executor may be launched on each worker. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add to this java doc what the new return type is? Also, it would be good to add a short paragraph to explain the necessity to allocate N cores at a time, and give an example of where allocating 1 core at a time fails. |
||
| * | ||
| * It is important to allocate coresPerExecutor on each worker at a time (instead of 1 core | ||
| * at a time). Consider the following example: cluster has 4 workers with 16 cores each. | ||
| * User requests 3 executors (spark.cores.max = 48, spark.executor.cores = 16). If 1 core is | ||
| * allocated at a time, 12 cores from each worker would be assigned to each executor. | ||
| * Since 12 < 16, no executors would launch [SPARK-8881]. | ||
| */ | ||
| private def startExecutorsOnWorkers(): Unit = { | ||
| // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app | ||
| // in the queue, then the second app, etc. | ||
| private[master] def scheduleExecutorsOnWorkers( | ||
| app: ApplicationInfo, | ||
| usableWorkers: Array[WorkerInfo], | ||
| spreadOutApps: Boolean): Array[Int] = { | ||
| // If the number of cores per executor is not specified, then we can just schedule | ||
| // 1 core at a time since we expect a single executor to be launched on each worker | ||
| val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(1) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. then we should add a comment here to say that: |
||
| val memoryPerExecutor = app.desc.memoryPerExecutorMB | ||
| val numUsable = usableWorkers.length | ||
| val assignedCores = new Array[Int](numUsable) // Number of cores to give to each worker | ||
| val assignedMemory = new Array[Int](numUsable) // Amount of memory to give to each worker | ||
| var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum) | ||
| var pos = 0 | ||
| var lastCoresToAssign = coresToAssign | ||
| if (spreadOutApps) { | ||
| // Try to spread out each app among all the workers, until it has all its cores | ||
| for (app <- waitingApps if app.coresLeft > 0) { | ||
| val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE) | ||
| .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB && | ||
| worker.coresFree >= app.desc.coresPerExecutor.getOrElse(1)) | ||
| .sortBy(_.coresFree).reverse | ||
| val numUsable = usableWorkers.length | ||
| val assigned = new Array[Int](numUsable) // Number of cores to give on each node | ||
| var toAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum) | ||
| var pos = 0 | ||
| while (toAssign > 0) { | ||
| if (usableWorkers(pos).coresFree - assigned(pos) > 0) { | ||
| toAssign -= 1 | ||
| assigned(pos) += 1 | ||
| } | ||
| pos = (pos + 1) % numUsable | ||
| // Try to spread out executors among workers (sparse scheduling) | ||
| while (coresToAssign > 0) { | ||
| if (usableWorkers(pos).coresFree - assignedCores(pos) >= coresPerExecutor && | ||
| usableWorkers(pos).memoryFree - assignedMemory(pos) >= memoryPerExecutor) { | ||
| coresToAssign -= coresPerExecutor | ||
| assignedCores(pos) += coresPerExecutor | ||
| assignedMemory(pos) += memoryPerExecutor | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I stared at this loop for a little bit and I think it could bring us into an infinite loop. E.g. We have 3 workers, with 3, 3, and 4 free cores respectively, so that Let me know if I'm missing something.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (same for the non spread out case)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't I haven't thought this through either but are there race condition problems here too? as long as the worst case is just that resources that looked available aren't anymore and fail to schedule, that's fine.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "resources that looked available aren't anymore and fail to schedule, that's fine." This is the assumption being made here. If the user didn't care about the size of the executor, they would skip executor.cores and the algorithm would proceed as before (best-effort: one-core at a time). If they do, we should either schedule as requested or not at all. If we care to be extra-friendly, we could add a check to log a message from within the loop: "Not enough resources, please check spark.cores.max and spark.executor.cores" ?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems OK but I think there is an infinite loop problem here still?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could potentially return assignedCores that we have thus far and proceed with scheduling. But as discussed earlier, we are better off failing than scheduling incorrectly. Do you feel otherwise?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Didn't see your note. I would think that by failing and allowing the user to reconfigure, we would be doing them a favor. But I can see the value in scheduling whatever we can as well.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now we have both versions. We can choose to keep this or revert to the previous one.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the previous behavior was to schedule as much as possible? since before it would only try to assign as many cores as are available, not necessarily as many as are requested. If so I think it's best to retain that behavior.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I don't think it's really a user error. The contract of setting |
||
| } | ||
| // Now that we've decided how many cores to give on each node, let's actually give them | ||
| for (pos <- 0 until numUsable if assigned(pos) > 0) { | ||
| allocateWorkerResourceToExecutors(app, assigned(pos), usableWorkers(pos)) | ||
| pos = (pos + 1) % numUsable | ||
| if (pos == 0) { | ||
| if (lastCoresToAssign == coresToAssign) { | ||
| return assignedCores | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was going to say just break here, but you can't in Scala eh? I think the doc about the return type needs to update if this is returned now |
||
| } | ||
| lastCoresToAssign = coresToAssign | ||
| } | ||
| } | ||
| } else { | ||
| // Pack each app into as few workers as possible until we've assigned all its cores | ||
| for (worker <- workers if worker.coresFree > 0 && worker.state == WorkerState.ALIVE) { | ||
| for (app <- waitingApps if app.coresLeft > 0) { | ||
| allocateWorkerResourceToExecutors(app, app.coresLeft, worker) | ||
| // Pack executors into as few workers as possible (dense scheduling) | ||
| while (coresToAssign > 0) { | ||
| while (usableWorkers(pos).coresFree - assignedCores(pos) >= coresPerExecutor && | ||
| usableWorkers(pos).memoryFree - assignedMemory(pos) >= memoryPerExecutor && | ||
| coresToAssign > 0) { | ||
| coresToAssign -= coresPerExecutor | ||
| assignedCores(pos) += coresPerExecutor | ||
| assignedMemory(pos) += memoryPerExecutor | ||
| } | ||
| pos = (pos + 1) % numUsable | ||
| if (pos == 0) { | ||
| if (lastCoresToAssign == coresToAssign) { | ||
| return assignedCores | ||
| } | ||
| lastCoresToAssign = coresToAssign | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a lot of duplicated code between the two cases now. I think there's a nicer way to express this. The termination condition is really (1) if we have allocated all cores, or (2) if there are no more workers with enough resources. Something like the following: This basically says you just keep narrowing the available set of workers until you have assigned all the cores. If it turns out that there are no more available workers, then we should just stop. The other thing is that I merged the spread-out vs non-spread-out cases with a common while loop.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Definitely cleaner. Will try and incorporate this. Thanks! |
||
| } | ||
| } | ||
| } | ||
| assignedCores | ||
| } | ||
|
|
||
| /** | ||
| * Schedule and launch executors on workers | ||
| */ | ||
| private def startExecutorsOnWorkers(): Unit = { | ||
| // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app | ||
| // in the queue, then the second app, etc. | ||
| for (app <- waitingApps if app.coresLeft > 0) { | ||
| val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor | ||
| // Filter out workers that don't have enough resources to launch an executor | ||
| val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE) | ||
| .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB && | ||
| worker.coresFree >= coresPerExecutor.getOrElse(1)) | ||
| .sortBy(_.coresFree).reverse | ||
| val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The I think it would simplify things if we just move the declaration of
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. usableWorkers is being used in startExecutorsOnWorkers as well
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hm, you're right. In the latest comment I've suggested an alternative that removes this logical coupling. |
||
|
|
||
| // Now that we've decided how many cores to allocate on each worker, let's allocate them | ||
| for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) { | ||
| allocateWorkerResourceToExecutors( | ||
| app, assignedCores(pos), coresPerExecutor, usableWorkers(pos)) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Allocate a worker's resources to one or more executors. | ||
| * @param app the info of the application which the executors belong to | ||
| * @param coresToAllocate cores on this worker to be allocated to this application | ||
| * @param assignedCores number of cores on this worker for this application | ||
| * @param coresPerExecutor number of cores per executor | ||
| * @param worker the worker info | ||
| */ | ||
| private def allocateWorkerResourceToExecutors( | ||
| app: ApplicationInfo, | ||
| coresToAllocate: Int, | ||
| assignedCores: Int, | ||
| coresPerExecutor: Option[Int], | ||
| worker: WorkerInfo): Unit = { | ||
| val memoryPerExecutor = app.desc.memoryPerExecutorMB | ||
| val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(coresToAllocate) | ||
| var coresLeft = coresToAllocate | ||
| while (coresLeft >= coresPerExecutor && worker.memoryFree >= memoryPerExecutor) { | ||
| val exec = app.addExecutor(worker, coresPerExecutor) | ||
| coresLeft -= coresPerExecutor | ||
| // If the number of cores per executor is specified, we divide the cores assigned | ||
| // to this worker evenly among the executors with no remainder. | ||
| // Otherwise, we launch a single executor that grabs all the assignedCores on this worker. | ||
| val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1) | ||
| val coresToAssign = coresPerExecutor.getOrElse(assignedCores) | ||
| for (i <- 1 to numExecutors) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually, doesn't this change default behavior? Previously a single executor would acquire all cores on a worker. Now there will be N executors on the worker, each occupying exactly one core.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we might have to pass in an option of
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Assignment of cores per worker is done by the time this method is invoked. coresPerExecutor is 1 by default. If we have N workers and N executors with coresPerExecutor = 1 and we are spreading out, assignedCores = 1 and we would launch 1 executor per worker with 1 core each, as expected.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's not true (see documentation). If Example: We have 3 workers with 8 cores each,
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I see. I thought the default behavior had inherited the bug, didn't realize it was intentional. Will fix this in a few mins. |
||
| val exec = app.addExecutor(worker, coresToAssign) | ||
| launchExecutor(worker, exec) | ||
| app.state = ApplicationState.RUNNING | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, for if another change is needed: this could be a
@returntagThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, this shouldn't say None anymore