[SPARK-2687] [yarn]amClient should remove ContainerRequest#1589
[SPARK-2687] [yarn]amClient should remove ContainerRequest#1589lianhuiwang wants to merge 2 commits intoapache:masterfrom
Conversation
|
QA tests have started for PR 1589. This patch merges cleanly. |
|
QA results for PR 1589: |
|
QA tests have started for PR 1589. This patch merges cleanly. |
|
@witgo @andrewor14 please take a look at it. thanks. |
|
QA results for PR 1589: |
|
@lianhuiwang could you take a look at the latest version of the code to see if we still need this? The description on the YARN jira isn't real clear to me and I haven't had chance to look at the patch. If it does can you please more details on when you would hit it. Note that we are only asking once for the number of containers we need upfront and then we only add in ones that are missing, (ie allocated and then perhaps died). |
|
@lianhuiwang can you answer my last comment? If not can we close this? |
|
@tgravescs i have take a look at the latest version and make sure that problem still exist. because when amClient receive containers from YARN's RM, amClient need to removeContainerRequest. if amClient donot removeContainerRequest, when it has a failure container, amClient will report numExecutorsRunning+1 ResourceRequests to Yarn. Can you understand I say? i think i should submit a new PR based on the latest version code because this PR's code is out of date. |
|
I think I understand what you are saying, but YARN handles removing container requests once it has been allocated. Here is a scenario:
If this is not the scenario please clarify. |
|
yes, the scenario that your said is one of situation.other is: |
|
Ok so are you saying spark isn't properly adding 1 back in when a executor fails? Have you verified on the YARN side the number of requests it shows? I don't see how removing requests is going to help if the number of requests on the yarn side is already 0 so just want to understand the scenario. Do you have any steps to reproduce? |
|
I think RM will allocate more than one to spark's AM when a executor fails.
|
|
Ah ok, I understand now. Thanks for the explanation. Yeah if you could upmerge this to the latest that would be great. Ideal instead of just removing the first request on the list it checks to see if it fullfilled one of its host/rack level requests and removes that one first. But then again I the preferred host stuff is broken right now so kind of hard to test. |
|
yes,i create new PR:#3245 for the latest code. |
in https://issues.apache.org/jira/browse/YARN-1902, after receving allocated containers,if amClient donot remove ContainerRequest,RM will continually allocate container for spark AppMaster.