Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

depsolver concurrency limitations #36

Open
matthiasr opened this issue Dec 4, 2014 · 5 comments
Open

depsolver concurrency limitations #36

matthiasr opened this issue Dec 4, 2014 · 5 comments
Labels
Status: To be prioritized Indicates that product needs to prioritize this issue. Triage: Confirmed Indicates and issue has been confirmed as described.

Comments

@matthiasr
Copy link

(cross-filing as requested in chef-boneyard/chef-provisioning#112)

Using Open Source Chef Server 11.1.5 (but really ever since Chef 11) there seems to be a hard limit on the number of concurrent cookbook dependency resolutions.

We are running chef-client at regular intervals on all our hosts, but occasionally we want to kick them off immediately on one or more hosts, so we pdsh over them. This means that N (where N is the pdsh fanout) chef-client runs start at almost exactly the same time, and since they look up their cookbooks very early they nearly simultaneously hit the depsolvers.

Correlating with general chef-server load and the number of cookbooks / cookbook versions, it used to be that we could do this to no more than 5-10 nodes at the same time, in line with the default number of depsolvers (5). I bumped the erchef['depsolver_worker_count'] to 20 now and can do this to ~30 nodes at once without hitting errors, but see errors at 40. Since they're probably not perfectly in sync it looks to me like this is just hitting the new, higher limit again.

To sum up, it looks like each depsolver worker can only do one dependency resolution at a time, and it is not possible to have more than depsolver_worker_count simultaneous chef-client runs. Is that so, and is that by design or a known issue, or (quoting @jkeiser) "bad mojo"?

@sdelano
Copy link
Contributor

sdelano commented Dec 4, 2014

You've summarized the issue correctly. This bottleneck was introduced when we switched back from the pure erlang implementation of the dependency solver feature to using the Ruby/Gecode-based solver that shipped with Chef 10. The Chef Server keeps a pool of depsolver processes running and waiting for requests. When this pool is exhausted, the API returns 503, for Service Unavailable. It is not recommended that this pool size exceed the number of CPUs that you have on your Chef Server, as this is a compute-constrained resource.

Chef 12 ships with a new version of the pooler library (the library that handles said pooling) that now allows for requests to the pool to queue up when no workers are available. This hasn't yet been exposed in Chef 12 for the dependency solver, but it wouldn't take a lot of work to add that ability.

@sdelano
Copy link
Contributor

sdelano commented May 13, 2015

Coming back to this issue again, the conclusion still stands that we should add queuing to the depsolver workers. I'll add the accepted-minor tag to this card, indicating that we should do this and that it's a minor issue.

We should also include upgrading to the newest dep_selector gem if we haven't done so already. @danielsdeleo made some great improvements there to optimize the Ruby portion of the code so that it doesn't have to do as much work building up redundant objects.

@sdelano sdelano added this to the accepted-minor milestone May 13, 2015
@heaven
Copy link

heaven commented Jul 11, 2016

Hi, why is it a minor issue? We have 25 nodes and they all constantly fail to run chef-client.

@matthiasr
Copy link
Author

You can work around it by raising the number of dependency solvers. The
default is very conservative.

On Mon, Jul 11, 2016, 17:24 Alexander S. [email protected] wrote:

Hi, why is it a minor issue? We have 25 nodes and they all constantly fail
to run chef-client.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#36 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAICBonIjLU9HVIuzBwzuZBqnIUDbpylks5qUm4igaJpZM4DEOqD
.

@heaven
Copy link

heaven commented Jul 11, 2016

Yeah, thank you, looking into that now.

@PrajaktaPurohit PrajaktaPurohit added the Status: Untriaged An issue that has yet to be triaged. label Oct 11, 2019
@PrajaktaPurohit PrajaktaPurohit added Status: To be prioritized Indicates that product needs to prioritize this issue. Triage: Confirmed Indicates and issue has been confirmed as described. and removed Status: Untriaged An issue that has yet to be triaged. labels Jul 24, 2020
@stevendanna stevendanna removed this from the accepted-minor milestone Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: To be prioritized Indicates that product needs to prioritize this issue. Triage: Confirmed Indicates and issue has been confirmed as described.
Projects
None yet
Development

No branches or pull requests

5 participants