Skip to content

[SPARK-2687] [yarn]amClient should remove ContainerRequest#1589

Closed
lianhuiwang wants to merge 2 commits intoapache:masterfrom
lianhuiwang:amclient-remove-request
Closed

[SPARK-2687] [yarn]amClient should remove ContainerRequest#1589
lianhuiwang wants to merge 2 commits intoapache:masterfrom
lianhuiwang:amclient-remove-request

Conversation

@lianhuiwang
Copy link
Contributor

in https://issues.apache.org/jira/browse/YARN-1902, after receving allocated containers,if amClient donot remove ContainerRequest,RM will continually allocate container for spark AppMaster.

@SparkQA
Copy link

SparkQA commented Jul 25, 2014

QA tests have started for PR 1589. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17183/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 25, 2014

QA results for PR 1589:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17183/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 1589. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17354/consoleFull

@lianhuiwang
Copy link
Contributor Author

@witgo @andrewor14 please take a look at it. thanks.

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA results for PR 1589:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17354/consoleFull

@tgravescs
Copy link
Contributor

@lianhuiwang could you take a look at the latest version of the code to see if we still need this? The description on the YARN jira isn't real clear to me and I haven't had chance to look at the patch. If it does can you please more details on when you would hit it.

Note that we are only asking once for the number of containers we need upfront and then we only add in ones that are missing, (ie allocated and then perhaps died).

@tgravescs
Copy link
Contributor

@lianhuiwang can you answer my last comment? If not can we close this?

@lianhuiwang
Copy link
Contributor Author

@tgravescs i have take a look at the latest version and make sure that problem still exist. because when amClient receive containers from YARN's RM, amClient need to removeContainerRequest. if amClient donot removeContainerRequest, when it has a failure container, amClient will report numExecutorsRunning+1 ResourceRequests to Yarn. Can you understand I say? i think i should submit a new PR based on the latest version code because this PR's code is out of date.

@tgravescs
Copy link
Contributor

I think I understand what you are saying, but YARN handles removing container requests once it has been allocated.

Here is a scenario:

  • spark requests 2 containers (yarn request total = 2)
  • yarn allocates one and give it to spark (yarn request total = 1)
  • spark task failed so +1 to requests (yarn request total = 2)

If this is not the scenario please clarify.

@lianhuiwang
Copy link
Contributor Author

yes, the scenario that your said is one of situation.other is:
spark requests 2 containers (yarn request total = 2)
yarn allocates 2 and give it to spark (yarn request total = 0)
spark task failed so +1 to requests (yarn request total = 1)
but in my environment, there are some mistakes in last step.number of requests is incorrect.

@tgravescs
Copy link
Contributor

Ok so are you saying spark isn't properly adding 1 back in when a executor fails? Have you verified on the YARN side the number of requests it shows? I don't see how removing requests is going to help if the number of requests on the yarn side is already 0 so just want to understand the scenario.

Do you have any steps to reproduce?

@lianhuiwang
Copy link
Contributor Author

I think RM will allocate more than one to spark's AM when a executor fails.
Here is a scenario:

  1. spark requests 3 containers (AM & RM request total = 3)
  2. RM allocates one and give it to spark (AM request total = 3 but RM request total = 2)
  3. spark task failed so AM will +1 to requests (AM request total = 4)
  4. RM receive AM's request and allocates one to spark (RM request total =3 (4-1)), but actually is 2
  5. but next RM will allocate three to spark.
    and then in appmaster's log there will be the following logs:
    Ignoring container %s at host %s, since we already have the required number of containers for it.
    in this scenario,when AM received one allocated container, AM donot remove one request. so it causes RM allocate more containers to AM.

@tgravescs
Copy link
Contributor

Ah ok, I understand now. Thanks for the explanation.

Yeah if you could upmerge this to the latest that would be great. Ideal instead of just removing the first request on the list it checks to see if it fullfilled one of its host/rack level requests and removes that one first. But then again I the preferred host stuff is broken right now so kind of hard to test.

@lianhuiwang
Copy link
Contributor Author

yes,i create new PR:#3245 for the latest code.

sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants