[SPARK-21834] Incorrect executor request in case of dynamic allocation #19048

sitalkedia · 2017-08-25T01:08:01Z

What changes were proposed in this pull request?

killExecutor api currently does not allow killing an executor without updating the total number of executors needed. In case of dynamic allocation is turned on and the allocator tries to kill an executor, the scheduler reduces the total number of executors needed ( see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635) which is incorrect because the allocator already takes care of setting the required number of executors itself.

How was this patch tested?

Ran a job on the cluster and made sure the executor request is correct

SparkQA · 2017-08-25T03:33:45Z

Test build #81110 has finished for PR 19048 at commit e30bbac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sitalkedia · 2017-08-25T04:20:11Z

cc - @markhamstra , @sameeragarwal, @rxin, @vanzin,

SparkQA · 2017-08-25T07:04:48Z

Test build #81116 has finished for PR 19048 at commit e000db3.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

sitalkedia · 2017-08-25T15:50:57Z

Jenkins retest this please.

vanzin · 2017-08-25T18:29:09Z

I'm not sure I understand why is this a problem. What is the undesired behavior that happens because of this? That's not explained either in the PR nor in the bug.

The way I understand the code, yes, there are potentially redundant calls to update the target number of executors; but then, ExecutorAllocationManager makes that call every time it wakes up (100ms), so it doesn't seem like that causes any problems.

sitalkedia · 2017-08-25T18:39:55Z

Looking at the scheduler and the dynamic executor allocator code, this is what my understanding, correct me if I am wrong.

Let's say the dynamic executor allocator is ramping down the number of executors. There are 10 executors running and it needs only 4. Then ExecutorAllocationManager will make a call to set the total executors to 4 and also try to kill some idle executors (say 3). This is when things get out of sync because now the scheduler will set the number of total executors needed from 4 to 1.

vanzin · 2017-08-25T18:48:33Z

This is when things get out of sync because now the scheduler will set the number of total executors needed from 4 to 1.

Have you actually observed that behavior?

The way I understand the code, both ExecutorAllocationManager and CoarseGrainedSchedulerBackend keep track of the target number separately, and deal in absolutes. So you have this order of events:

10 executors are running
EAM detects 5 as idle and requests that 5 be killed, update its internal target to 5
CGSB tries to kill 5 executors and update its internal target to 5 too
EAM in its periodic task tells CGSB that it expects 5 executors to exist
Everybody is happy

Is that not what you're seeing?

SparkQA · 2017-08-25T18:53:55Z

Test build #81132 has finished for PR 19048 at commit e000db3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sitalkedia · 2017-08-25T19:09:13Z

That's not really true.
The EAM uses the requestTotalExecutors api to set the target for the scheduler.

10 executors are running, each executor can run 4 tasks at max.
20 tasks are running so EAM sets the internal target to 5 and also asks the CGSB to set its target to 5.
However, only 2 executors are idle, so the EAM will try to kill 2 of them.
CGSB now sets its target to 3 because the EAM asks to kill 2 executors.

vanzin · 2017-08-25T19:35:02Z

I think I'm starting to understand what you're getting at, but I still don't see why this has anything to do with the CGSB. What I understand from your comment is that the EAM may reduce its target and at the same time try to kill idle executors, basically doubling down and killing too many executors in the process.

Isn't this what this piece of code is trying to prevent?

      } else if (newExecutorTotal - 1 < numExecutorsTarget) {
        logDebug(s"Not removing idle executor $executorIdToBeRemoved because there are only " +
          s"$newExecutorTotal executor(s) left (number of executor target $numExecutorsTarget)")

If killing an idle executor would bring the number below the current target, then it won't be killed. That's a pretty recent fix so maybe you haven't seen it (SPARK-21656).

sitalkedia · 2017-08-26T00:01:21Z

To be clear there is no issue on EAM side. Consider the following situation -

10 executors are running, each executor can run 4 tasks at max.
20 tasks are running so EAM sets the internal target to 5 and also asks the CGSB to set its requestedTotalExecutors to 5. However, it can not kill any executor yet because all of them have atleast one running tasks.
2 tasks on 2 executors succeeds, now the target is still 5 (18/4), but the EAM asks the scheduler to kill 2 of them.
The scheduler decieds to kill 2 of them and sets the new target as 3. While the EAM has set the target as 5.

vanzin · 2017-08-26T00:06:19Z

but the EAM asks the scheduler to kill 2 of them.

Why? Because of the idle timeout? If that's your point, then the change I referenced above should avoid that.

The scheduler decieds to kill 2 of them and sets the new target as 3. While the EAM has set the target as 5

How? The scheduler (a.k.a. CGSB) does not kill executors on its own. It has to be told to do so in some way.

vanzin · 2017-08-26T00:06:46Z

If you can actually provide logs that show what you're trying to say that would probably be easier.

sitalkedia · 2017-08-26T00:16:54Z

Why? Because of the idle timeout? If that's your point, then the change I referenced above should avoid that.

Yes because of idle timeout. Note that the numExecutorsTarget is 5 and EAM has 10 executors available, so it is fine to kill 2 of them. That is not the issue.

How? The scheduler (a.k.a. CGSB) does not kill executors on its own. It has to be told to do so in some way

Because the EAM asks it to kill 2 of them. But please note that while killing 2 executors the EAM did not reduce its target to 3, it is still 5. But since scheduler keeps its internal target, it reduces its target from 5 to 3. And the EAM and scheduler gets out of sync.

If you can actually provide logs that show what you're trying to say that would probably be easier.

Actually, I added a lot of debug log to find this issue so probably the log is not going to be of any help to you.

vanzin · 2017-08-26T00:38:29Z

I think I see what you're saying. But I still think it's the fault of the EAM.

But please note that while killing 2 executors the EAM did not reduce its target to 3, it is still 5.

And I think the problem here is that the EAM should not be telling the CGSB that the target is 5 when 5 is actually the "minimum" the EAM wants, but there may be more executors running that haven't timed out yet. Basically, this code in the EAM:

      if (numExecutorsTarget < oldNumExecutorsTarget) {
        client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks, hostToLocalTaskCount)
        logDebug(s"Lowering target number of executors to $numExecutorsTarget (previously " +
          s"$oldNumExecutorsTarget) because not all requested executors are actually needed")
      }

Should be changed to account for the current number of executors, so that the EAM doesn't tell the CGSB that it wants less executors than currently exist. Because even if the EAM may not currently "need" the extra executors, it hasn't timed them out, so they need to be counted towards the "number of executors that I expect to be active".

Your solution (the new updateTotalExecutor) looks too much like the existing replace parameter, and it's a little confusing if you try to think about how to use both. What does it mean to ask for updateTotalExecutor = false and replace = false? The latter means you want the executor count to go down, while the former means you don't.

Now if the EAM tells the CGSB the correct amount of executors it expects to be active (which means something like max(executors I need, active executors)) then the problem should go away, no?

sitalkedia · 2017-08-26T01:53:39Z

this code in the EAM: Should be changed to account for the current number of executors, so that the EAM doesn't tell the CGSB that it wants less executors than currently exist.

Actually if you look at the api, ExecutorAllocationManager api, this is how requestTotalExecutors behaves - The total number of executors we'd like to have. The cluster manager shouldn't kill any running executor to reach this number, but, if all existing executors were to die, this is the number of executors we'd want to be allocated. So the EAM is right in setting the number of total executors it needs to 5 because lets say all executors die, it is up to the cluster manager to spawn 5 executors (not 10).

Your solution (the new updateTotalExecutor) looks too much like the existing replace parameter, and it's a little confusing if you try to think about how to use both. What does it mean to ask for updateTotalExecutor = false and replace = false? The latter means you want the executor count to go down, while the former means you don't.

I agree with you on this. May be it would be cleaner if we provide a new api like this - killExecutorsAndNotUpdateTotal?

vanzin · 2017-08-26T02:41:45Z

May be it would be cleaner if we provide a new api like this - killExecutorsAndNotUpdateTotal?

I think the main thing that bothers me is that adding anything to the API is making all this code even more complicated and confusing than it already is.

Having two (3 if you count the YARN allocator) places track all this state is bound to lead to these issues. Optimally only the EAM would keep track of these things; the CGSB shouldn't really be dealing with executor allocation and de-allocation, just with managing the existing executors that connect to it. But fixing things like that is probably a much larger change (the words "hornets' nest" come to mind).

Barring that, I think that we should make the change that leads to the correct behavior without making the internal interface more complicated than it needs to be. If changing the semantics of ExecutorAllocationClient lead to the code being easier to follow, then that's what we should do. After all, there is a single implementation of it (the CGSB). (And, digressing back to my paragraph above, maybe ExecutorAllocationClient shouldn't even exist and we should only have the EAM. But back to this PR.)

Or maybe you can reach the same thing through other means. For example, maybe if you get rid of the replace argument and make killExecutors not update the CGSB target count, and then force the caller to call requestTotalExecutors before killing executors, you could achieve the same thing. Maybe there are corner cases doing that, but maybe it works?

If none of those work, then we can talk about adding new things.

sitalkedia · 2017-08-28T17:19:42Z

On a high level I agree that keeping the states in 3 places is creating a mess but changing that would require a big refactoring which is probably outside of the scope of this change.

Or maybe you can reach the same thing through other means. For example, maybe if you get rid of the replace argument and make killExecutors not update the CGSB target count, and then force the caller to call requestTotalExecutors before killing executors, you could achieve the same thing. Maybe there are corner cases doing that, but maybe it works?

That might work. But there is a race-condition in doing that. In order to do that, we need to have a getTotalExecutors api to CGSB and the caller needs to use the api as follows

val executors = getTotalExecutors(); requestTotalExecutors(executors - 1); killExecutors(x);

It is possible that the total executors value changed by another thread between getTotalExecutors and requestTotalExecutors call. Setting aside the above potential race condition, I do not personally like that the caller need to call three apis which could have been done in just one.

Instead of doing that, how about we add a new api to ExecutorAllocationClient which looks like this - killExecutors(List of executors, int requestedTotal) - This will be used only by the EAM to kill a specific executor and also set the total exected count in CGSB.

vanzin · 2017-08-28T17:26:59Z

Well, that's adding an API that does the same thing as existing APIs but a little bit differently. In my view that adds to the problem, instead of fixing it. Now every caller into the ExecutorAllocationClient has its own version of the API that does things the exact way they want.

For example: the replace argument of the existing API is pretty much the functionality you want. The EAM can call requestTotalExecutors right before it calls killExecutors(replace = true) and achieve what it wants. It's just awkwardly named in this context.

vanzin · 2017-08-28T17:31:06Z

(Or it can call killExecutors() like it does today and then call requestTotalExecutors right after, same result without the awkwardness of the parameter name, but that adds a trip to the cluster manager.)

sitalkedia · 2017-08-28T17:37:47Z

Or it can call killExecutors() like it does today and then call requestTotalExecutors right after, same result without the awkwardness of the parameter name, but that adds a trip to the cluster manager.

Okay, that seems like a reasonable hack. Only downside as you mentioned is extra trip to the CM and adding more confusion to the usage of ExecutorAllocationClient :/. But like you mentioned, cleaning up the api will be a much larger change which is outside of the scope of this PR. I will make the change as you suggested soon.

jiangxb1987 · 2017-08-28T20:08:24Z

One thing I don't understand clearly is why we should update the requestedTotalExecutors inside the function killExecutors, asking to kill some executor(s) don't implies we are requesting for less executor(s) IMO.

sitalkedia · 2017-08-29T02:11:42Z

@jiangxb1987 - I agree with you. I do not have the context or history to comment on that. Unfortunately, the api has been designed that way and book keeping of target number of executors is done by the CGSB. Changing the existing scheduler behavior will require a bigger change and possibly breaking some existing api behavior which I think is out of the scope of this PR.

SparkQA · 2017-08-29T02:36:46Z

Test build #81193 has finished for PR 19048 at commit 297059f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-29T02:41:29Z

Test build #81194 has finished for PR 19048 at commit 22c3596.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-29T05:48:30Z

Test build #81195 has finished for PR 19048 at commit 6cc5fab.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-08-29T17:26:16Z

Yea I agree the change made in this PR looks good for your issue, I'm just suggesting maybe we could refactor the code further more, maybe as a follow up work.

sitalkedia · 2017-08-29T17:26:41Z

Not sure why the test failed? May be the build is unstable? cc - @vanzin

vanzin · 2017-08-29T17:28:01Z

SparkR tests have been super flaky lately.

retest this please

vanzin · 2017-08-29T19:40:55Z

retest this please

vanzin · 2017-08-29T21:22:34Z

Not sure why the PRB is not picking up my requests. @sitalkedia can you close and re-open the PR to see if that does it?

(The change looks fine, it just would be nice to get a clean test run.)

sitalkedia · 2017-08-29T21:57:53Z

jenkins retest this please.

sitalkedia · 2017-08-29T22:19:34Z

Created #19081.

sitalkedia force-pushed the skedia/oss_fix_executor_allocation branch from 120f383 to e30bbac Compare August 25, 2017 01:11

sitalkedia force-pushed the skedia/oss_fix_executor_allocation branch 2 times, most recently from 297059f to 22c3596 Compare August 29, 2017 02:33

[SPARK-21834] Incorrect executor request in case of dynamic allocation

6cc5fab

sitalkedia force-pushed the skedia/oss_fix_executor_allocation branch from 22c3596 to 6cc5fab Compare August 29, 2017 02:46

sitalkedia closed this Aug 29, 2017

ulysses-you mentioned this pull request Aug 8, 2023

[KYUUBI #5136][Bug] Spark App may hang forever if FinalStageResourceManager killed all executors apache/kyuubi#5141

Closed

1 task

[SPARK-21834] Incorrect executor request in case of dynamic allocation #19048

[SPARK-21834] Incorrect executor request in case of dynamic allocation #19048

Uh oh!

Conversation

sitalkedia commented Aug 25, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 25, 2017

Uh oh!

sitalkedia commented Aug 25, 2017

Uh oh!

SparkQA commented Aug 25, 2017

Uh oh!

sitalkedia commented Aug 25, 2017

Uh oh!

vanzin commented Aug 25, 2017

Uh oh!

sitalkedia commented Aug 25, 2017

Uh oh!

vanzin commented Aug 25, 2017

Uh oh!

SparkQA commented Aug 25, 2017

Uh oh!

sitalkedia commented Aug 25, 2017

Uh oh!

vanzin commented Aug 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sitalkedia commented Aug 26, 2017

Uh oh!

vanzin commented Aug 26, 2017

Uh oh!

vanzin commented Aug 26, 2017

Uh oh!

sitalkedia commented Aug 26, 2017

Uh oh!

vanzin commented Aug 26, 2017

Uh oh!

sitalkedia commented Aug 26, 2017

Uh oh!

vanzin commented Aug 26, 2017

Uh oh!

sitalkedia commented Aug 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin commented Aug 28, 2017

Uh oh!

vanzin commented Aug 28, 2017

Uh oh!

sitalkedia commented Aug 28, 2017

Uh oh!

jiangxb1987 commented Aug 28, 2017

Uh oh!

sitalkedia commented Aug 29, 2017

Uh oh!

SparkQA commented Aug 29, 2017

Uh oh!

SparkQA commented Aug 29, 2017

Uh oh!

SparkQA commented Aug 29, 2017

Uh oh!

jiangxb1987 commented Aug 29, 2017

Uh oh!

sitalkedia commented Aug 29, 2017

Uh oh!

vanzin commented Aug 29, 2017

Uh oh!

vanzin commented Aug 29, 2017

Uh oh!

vanzin commented Aug 29, 2017

Uh oh!

sitalkedia commented Aug 29, 2017

Uh oh!

sitalkedia commented Aug 29, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

vanzin commented Aug 25, 2017 •

edited

Loading

sitalkedia commented Aug 28, 2017 •

edited

Loading