Skip to content

[WIP] Limit the number of hosts simultaneously provisioned (BMO only version)#730

Closed
andfasano wants to merge 1 commit intometal3-io:masterfrom
andfasano:limit-hosts-local
Closed

[WIP] Limit the number of hosts simultaneously provisioned (BMO only version)#730
andfasano wants to merge 1 commit intometal3-io:masterfrom
andfasano:limit-hosts-local

Conversation

@andfasano
Copy link
Copy Markdown
Member

While testing #725 with an high number of nodes I found it wasn't not working as expected, so I deployed an alternative implementation which adopted a different approach in respect to the one originally discussed.

Adding some notes here to provide as much as possible additional details.


The problem encountered was an immediate surge when provisioning simultaneously an higher number of nodes, resulting in a breach to the configured threshold: the BMO reconcile loop processed the provisioning requests very rapidly, way before the Ironic subsystem was able to update its status and communicate it back. And things get worse when the BMO concurrency level is increased.

So, this PR bypasses such limitation by observing directly the BMO states instead of the Ironic ones, and by preventing an host to transition in a provisionable state (Inspecting or Provisioning) if there isn't a free slot available. This resulted in a simpler and more reliable check.

A very small critical section has been introduced to cope with the BMO concurrency level to handle properly a synchronized counter (a set) tracking the hosts currently being provisioned.

When an host cannot transition to a provisionable state due the lack of free slots, the action is rescheduled with an exponential backoff (by reusing the existing ErrorCount mechanism). A new ErrorType has been introduced to keep track of such situation. The current host state is left unchanged.

A preliminary check is applied when processing an host already delayed: if there are no free slots then the request is immediately rescheduled, to avoid putting unnecessary additional stress to the Ironic subsystem.

@metal3-io-bot metal3-io-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 26, 2020
@metal3-io-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andfasano
To complete the pull request process, please assign hardys
You can assign the PR to them by writing /assign @hardys in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 26, 2020
Copy link
Copy Markdown
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me nervous that we are relying on an in-memory cache of states as a source of truth with no mechanism to ensure it is in sync after startup.

This is also not on the path toward likely future enhancements like automatically scaling with the number of conductors available. TBH I'd prefer an approach with this error management but the state being read from ironic more like in #725. The target_provisioning_state should be updated more or less immediately, so if we add that into consideration I wouldn't expect any major capacity overflow problems.

Comment thread apis/metal3.io/v1alpha1/baremetalhost_types.go
Comment thread controllers/metal3.io/baremetalhost_controller.go
ProvisionerFactory provisioner.Factory

currentProvisioningHosts map[string]struct{}
mu sync.Mutex
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be preferable to have a name that describes what the mutex is protecting.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I can move them in a separate struct to make it even more clear, and it could help a little bit in testing

if !hsm.updateHostStateFrom(initialState, info) {
actionRes = recordActionFailure(info, metal3v1alpha1.TooManyHostsError, "Delayed, too many hosts")
}
}()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one stage @dhellmann had a proposal for a generic way to attach actions to particular state transitions, but I don't remember the context...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested adding a transition(newState) method to the state machine to use instead of assigning the new state directly. By doing that, all of the transitions would pass through one place and we could have a switch statement for handling before & after calls consistently.


if !hsm.Reconciler.checkProvisioningHost(*info.host, hsm.NextState) {
return false
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little strange that we will be stuck in e.g. the Ready state with an error, rather than the Provisioning state. This does make things simpler though.

defer r.mu.Unlock()

switch state {
case metal3v1alpha1.StateInspecting, metal3v1alpha1.StateProvisioning:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about deprovisioning? Isn't that just as hard on ironic-conductor? I don't know that we ever want to block deprovisioning but we probably don't want to start provisioning a bunch of stuff when we already have a large amount of deprovisioning in-flight.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, and in reality deprovisioning wasn't part of initial proposal, which focused only on the provisioning phase. Maybe some Ironic expert could comment better (@dtantsur ?), anyhow the deleting part appeared to be not so hard as the provisioning part.

if len(r.currentProvisioningHosts) >= maxProvisioningHosts {
return false
}
r.currentProvisioningHosts[host.Name] = struct{}{}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that we will successfully write the state change. If writing that fails we will end up back here checking for capacity again against a map that already contains the host.

The good news is that this is a map not a list so we can only store one copy, and a single host repeatedly failing will not fill up our quota and block everything. Nevertheless, this exposes us to systemic risk: e.g. if we try to provision 20 hosts and then the k8s API becomes briefly unavailable for some (presumably not unrelated) reason, the bmo is now deadlocked and cannot provision anything ever again until it is restarted.

We have a mechanism in the info struct for deferring function calls until after the write succeeds, and I think we need to use that here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, I'll check how to use that

// SetupWithManager reigsters the reconciler to be run by the manager
func (r *BareMetalHostReconciler) SetupWithManager(mgr ctrl.Manager) error {

r.syncProvisioningHosts(mgr)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is a race here; we haven't yet started watching resources or obtained the leader lock, so Hosts can be modified after we build our initial state without us seeing the changes (potentially ever, since our updates are edge-triggered).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other initialization point then that could be used (before reconciling)? Otherwise the other approach could be to keep in sync the set at the beginning of every reconcile loop, in line with @dhellmann's suggestion


// Resync the set at the startup
func (r *BareMetalHostReconciler) syncProvisioningHosts(mgr ctrl.Manager) {
hosts := &metal3v1alpha1.BareMetalHostList{}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller runtime uses a caching client, which means when we ask for resources that we're subscribed to we look at the cache and get a list back quickly. If we eliminate the mutex management and just write a function that returns a number, it should be plenty fast enough to call every time we start to reconcile.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got your point and will give it a try, but I'm not sure it could work, since the cache is not updated immediately, thus still risking a surge (while using a sync set every thread has always an effectively updated view). Anyhow I'll check

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR uses the same cache to initialize the data structure, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, I've the APIReader() returned by the manager to access directly the API server

if !hsm.updateHostStateFrom(initialState, info) {
actionRes = recordActionFailure(info, metal3v1alpha1.TooManyHostsError, "Delayed, too many hosts")
}
}()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested adding a transition(newState) method to the state machine to use instead of assigning the new state directly. By doing that, all of the transitions would pass through one place and we could have a switch statement for handling before & after calls consistently.

@andfasano
Copy link
Copy Markdown
Member Author

It makes me nervous that we are relying on an in-memory cache of states as a source of truth with no mechanism to ensure it is in sync after startup.

This is also not on the path toward likely future enhancements like automatically scaling with the number of conductors available. TBH I'd prefer an approach with this error management but the state being read from ironic more like in #725. The target_provisioning_state should be updated more or less immediately, so if we add that into consideration I wouldn't expect any major capacity overflow problems.

I understand your concerns and the idea to use as much as possible ironic as the only source of truth. Still remains the problem of the requests on the wire, for which ironic couldn't be used effectively, with the risk of a complete overflow - where instead the cached worked properly. I'll try to work out the best of the two approaches, in order to guarantee the respect of the configured limit and the comments mentioned so far.

@andfasano
Copy link
Copy Markdown
Member Author

Feedback addressed in #725

@andfasano andfasano closed this Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants