-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-8881][SPARK-9260] Fix algorithm for scheduling executors on workers #7274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
ee7cf0e
66362d5
2d6371c
5d6a19c
c11c689
40c8f9f
a06da76
adec84b
f279cdf
1daf25f
79084e8
da0f491
b998097
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -544,58 +544,82 @@ private[master] class Master( | |
| * has enough cores and memory. Otherwise, each executor grabs all the cores available on the | ||
| * worker by default, in which case only one executor may be launched on each worker. | ||
| */ | ||
| private def startExecutorsOnWorkers(): Unit = { | ||
| // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app | ||
| // in the queue, then the second app, etc. | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove this |
||
| private[master] def scheduleExecutorsOnWorkers(app: ApplicationInfo, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please keep this private (or do you plan to add tests for this?) |
||
| usableWorkers: Array[WorkerInfo], spreadOutApps: Boolean): Array[Int] = { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. style: |
||
| val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(1) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. then we should add a comment here to say that: |
||
| val memoryPerExecutor = app.desc.memoryPerExecutorMB | ||
| val numUsable = usableWorkers.length | ||
| val assignedCores = new Array[Int](numUsable) // Number of cores to give to each worker | ||
| val assignedMemory = new Array[Int](numUsable) // Amount of memory to give to each worker | ||
| var toAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you call this |
||
| var pos = 0 | ||
| if (spreadOutApps) { | ||
| // Try to spread out each app among all the workers, until it has all its cores | ||
| for (app <- waitingApps if app.coresLeft > 0) { | ||
| val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE) | ||
| .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB && | ||
| worker.coresFree >= app.desc.coresPerExecutor.getOrElse(1)) | ||
| .sortBy(_.coresFree).reverse | ||
| val numUsable = usableWorkers.length | ||
| val assigned = new Array[Int](numUsable) // Number of cores to give on each node | ||
| var toAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum) | ||
| var pos = 0 | ||
| while (toAssign > 0) { | ||
| if (usableWorkers(pos).coresFree - assigned(pos) > 0) { | ||
| toAssign -= 1 | ||
| assigned(pos) += 1 | ||
| } | ||
| pos = (pos + 1) % numUsable | ||
| } | ||
| // Now that we've decided how many cores to give on each node, let's actually give them | ||
| for (pos <- 0 until numUsable if assigned(pos) > 0) { | ||
| allocateWorkerResourceToExecutors(app, assigned(pos), usableWorkers(pos)) | ||
| // Try to spread out executors among workers (sparse scheduling) | ||
| while (toAssign > 0) { | ||
| if (usableWorkers(pos).coresFree - assignedCores(pos) >= coresPerExecutor && | ||
| usableWorkers(pos).memoryFree - assignedMemory(pos) >= memoryPerExecutor) { | ||
| toAssign -= coresPerExecutor | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I understand what you're trying to change, this won't help. If there aren't enough cores on any worker, then this becomes an infinite loop.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please read code carefully
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Eh, sorry that is not addressing my question but I think I see the situation now i.e. I need 16 cores for 2 8 core executors and I have 4 workers so each fails to be enough to cause an executor to launch anywhere? Example would be really helpful or a test. If so think this is also fixable by just considering no more workers than executors.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also yes I see you still have the filtering on cores available so this shouldn't keep looping over workers, right. Unless the available count can drop while this is in progress but that is either not a problem or already a problem so not directly relevant
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider the following: 4 workers each with 16 cores, spark.cores.max=48, spark.executor.cores = 16. When we spread out, we allocate one core at a time and in doing so end up allocating 12 cores from each worker. First, we ended up ignoring spark.executor.cores during allocation, which isn't right. Second, when the following condition is checked: while (coresLeft >= coresPerExecutor), coresLeft is 12 and coresPerExecutor is 16. As a result, executors don't launch.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, that's right.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, makes sense. Is it maybe more direct to never spread the allocation over more than 3 workers in this case since only 3 executors are needed? Same effect but I also see the value in allocating whole executors of cores at a time for clarity
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep. Allocating spark.executor.cores at a time is cleaner and directly enforces semantics. |
||
| assignedCores(pos) += coresPerExecutor | ||
| assignedMemory(pos) += memoryPerExecutor | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I stared at this loop for a little bit and I think it could bring us into an infinite loop. E.g. We have 3 workers, with 3, 3, and 4 free cores respectively, so that Let me know if I'm missing something.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (same for the non spread out case)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't I haven't thought this through either but are there race condition problems here too? as long as the worst case is just that resources that looked available aren't anymore and fail to schedule, that's fine.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "resources that looked available aren't anymore and fail to schedule, that's fine." This is the assumption being made here. If the user didn't care about the size of the executor, they would skip executor.cores and the algorithm would proceed as before (best-effort: one-core at a time). If they do, we should either schedule as requested or not at all. If we care to be extra-friendly, we could add a check to log a message from within the loop: "Not enough resources, please check spark.cores.max and spark.executor.cores" ?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems OK but I think there is an infinite loop problem here still?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could potentially return assignedCores that we have thus far and proceed with scheduling. But as discussed earlier, we are better off failing than scheduling incorrectly. Do you feel otherwise?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Didn't see your note. I would think that by failing and allowing the user to reconfigure, we would be doing them a favor. But I can see the value in scheduling whatever we can as well.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now we have both versions. We can choose to keep this or revert to the previous one.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the previous behavior was to schedule as much as possible? since before it would only try to assign as many cores as are available, not necessarily as many as are requested. If so I think it's best to retain that behavior.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I don't think it's really a user error. The contract of setting |
||
| } | ||
| pos = (pos + 1) % numUsable | ||
| } | ||
| } else { | ||
| // Pack each app into as few workers as possible until we've assigned all its cores | ||
| for (worker <- workers if worker.coresFree > 0 && worker.state == WorkerState.ALIVE) { | ||
| for (app <- waitingApps if app.coresLeft > 0) { | ||
| allocateWorkerResourceToExecutors(app, app.coresLeft, worker) | ||
| // Pack executors into as few workers as possible (dense scheduling) | ||
| while (toAssign > 0) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This branch shouldn't change right? it's now identical to the above.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not identical. In one we are spreading out, in another we aren't.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is the while vs if, I see. But this is missing a condition in toAssign in the inner loop then and over allocates
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This still has a problem - say I want to assign 4 cores for one executor but there are 24 free (and plenty of memory). The inner while loop will assign 24 cores to one worker and get 6 executors.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch. Missing a check on toAssign in inner loop. |
||
| while (usableWorkers(pos).coresFree - assignedCores(pos) >= coresPerExecutor && | ||
| usableWorkers(pos).memoryFree - assignedMemory(pos) >= memoryPerExecutor && | ||
| toAssign > 0) { | ||
| toAssign -= coresPerExecutor | ||
| assignedCores(pos) += coresPerExecutor | ||
| assignedMemory(pos) += memoryPerExecutor | ||
| } | ||
| pos = (pos + 1) % numUsable | ||
| } | ||
| } | ||
| assignedCores | ||
| } | ||
|
|
||
| /** | ||
| * Schedule and launch executors on workers | ||
| */ | ||
| private def startExecutorsOnWorkers(): Unit = { | ||
| // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app | ||
| // in the queue, then the second app, etc. | ||
| for (app <- waitingApps if app.coresLeft > 0) { | ||
| val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(1) | ||
| val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE) | ||
| .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB && | ||
| worker.coresFree >= coresPerExecutor) | ||
| .sortBy(_.coresFree).reverse | ||
| val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The I think it would simplify things if we just move the declaration of
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. usableWorkers is being used in startExecutorsOnWorkers as well
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hm, you're right. In the latest comment I've suggested an alternative that removes this logical coupling. |
||
|
|
||
| // Now that we've decided how many cores to allocate on each worker, let's allocate them | ||
| var pos = 0 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not needed?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ping, can you remove this variable? The for loop down there already declares this so you don't need to declare it here again.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, missed this one somehow |
||
| for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) { | ||
| allocateWorkerResourceToExecutors(app, assignedCores(pos), coresPerExecutor, | ||
| usableWorkers(pos)) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. style: |
||
| } | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Allocate a worker's resources to one or more executors. | ||
| * Allocate a worker's resources to one or more executors | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you revert some of these changes? They have nothing to do with the issue (I actually prefer the period being there, but that's a minor point). |
||
| * @param app the info of the application which the executors belong to | ||
| * @param coresToAllocate cores on this worker to be allocated to this application | ||
| * @param assignedCores number of cores on this worker for this application | ||
| * @param coresPerExecutor number of cores per executor | ||
| * @param worker the worker info | ||
| */ | ||
| private def allocateWorkerResourceToExecutors( | ||
| app: ApplicationInfo, | ||
| coresToAllocate: Int, | ||
| assignedCores: Int, | ||
| coresPerExecutor: Int, | ||
| worker: WorkerInfo): Unit = { | ||
| val memoryPerExecutor = app.desc.memoryPerExecutorMB | ||
| val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(coresToAllocate) | ||
| var coresLeft = coresToAllocate | ||
| while (coresLeft >= coresPerExecutor && worker.memoryFree >= memoryPerExecutor) { | ||
|
|
||
| var numExecutors = assignedCores/coresPerExecutor | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please add spaces around
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also, this can be a val
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add a comment here stating your implicit assumptions: |
||
| for (i <- 1 to numExecutors) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually, doesn't this change default behavior? Previously a single executor would acquire all cores on a worker. Now there will be N executors on the worker, each occupying exactly one core.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we might have to pass in an option of
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Assignment of cores per worker is done by the time this method is invoked. coresPerExecutor is 1 by default. If we have N workers and N executors with coresPerExecutor = 1 and we are spreading out, assignedCores = 1 and we would launch 1 executor per worker with 1 core each, as expected.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's not true (see documentation). If Example: We have 3 workers with 8 cores each,
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I see. I thought the default behavior had inherited the bug, didn't realize it was intentional. Will fix this in a few mins. |
||
| val exec = app.addExecutor(worker, coresPerExecutor) | ||
| coresLeft -= coresPerExecutor | ||
| launchExecutor(worker, exec) | ||
| app.state = ApplicationState.RUNNING | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add to this java doc what the new return type is? Also, it would be good to add a short paragraph to explain the necessity to allocate N cores at a time, and give an example of where allocating 1 core at a time fails.