[WIP] Limit the number of hosts simultaneously provisioned (BMO only version) by andfasano · Pull Request #730 · metal3-io/baremetal-operator

andfasano · 2020-11-26T17:37:41Z

While testing #725 with an high number of nodes I found it wasn't not working as expected, so I deployed an alternative implementation which adopted a different approach in respect to the one originally discussed.

Adding some notes here to provide as much as possible additional details.

The problem encountered was an immediate surge when provisioning simultaneously an higher number of nodes, resulting in a breach to the configured threshold: the BMO reconcile loop processed the provisioning requests very rapidly, way before the Ironic subsystem was able to update its status and communicate it back. And things get worse when the BMO concurrency level is increased.

So, this PR bypasses such limitation by observing directly the BMO states instead of the Ironic ones, and by preventing an host to transition in a provisionable state (Inspecting or Provisioning) if there isn't a free slot available. This resulted in a simpler and more reliable check.

A very small critical section has been introduced to cope with the BMO concurrency level to handle properly a synchronized counter (a set) tracking the hosts currently being provisioned.

When an host cannot transition to a provisionable state due the lack of free slots, the action is rescheduled with an exponential backoff (by reusing the existing ErrorCount mechanism). A new ErrorType has been introduced to keep track of such situation. The current host state is left unchanged.

A preliminary check is applied when processing an host already delayed: if there are no free slots then the request is immediately rescheduled, to avoid putting unnecessary additional stress to the Ironic subsystem.

metal3-io-bot · 2020-11-26T17:37:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andfasano
To complete the pull request process, please assign hardys
You can assign the PR to them by writing /assign @hardys in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zaneb

It makes me nervous that we are relying on an in-memory cache of states as a source of truth with no mechanism to ensure it is in sync after startup.

This is also not on the path toward likely future enhancements like automatically scaling with the number of conductors available. TBH I'd prefer an approach with this error management but the state being read from ironic more like in #725. The target_provisioning_state should be updated more or less immediately, so if we add that into consideration I wouldn't expect any major capacity overflow problems.

zaneb · 2020-11-30T15:24:20Z

 	ProvisionerFactory provisioner.Factory
+
+	currentProvisioningHosts map[string]struct{}
+	mu                       sync.Mutex


It would be preferable to have a name that describes what the mutex is protecting.

Ok. I can move them in a separate struct to make it even more clear, and it could help a little bit in testing

zaneb · 2020-11-30T15:48:06Z

+		if !hsm.updateHostStateFrom(initialState, info) {
+			actionRes = recordActionFailure(info, metal3v1alpha1.TooManyHostsError, "Delayed, too many hosts")
+		}
+	}()


At one stage @dhellmann had a proposal for a generic way to attach actions to particular state transitions, but I don't remember the context...

I suggested adding a transition(newState) method to the state machine to use instead of assigning the new state directly. By doing that, all of the transitions would pass through one place and we could have a switch statement for handling before & after calls consistently.

zaneb · 2020-11-30T18:38:02Z

+
+		if !hsm.Reconciler.checkProvisioningHost(*info.host, hsm.NextState) {
+			return false
+		}


It's a little strange that we will be stuck in e.g. the Ready state with an error, rather than the Provisioning state. This does make things simpler though.

zaneb · 2020-11-30T18:53:21Z

+	defer r.mu.Unlock()
+
+	switch state {
+	case metal3v1alpha1.StateInspecting, metal3v1alpha1.StateProvisioning:


What about deprovisioning? Isn't that just as hard on ironic-conductor? I don't know that we ever want to block deprovisioning but we probably don't want to start provisioning a bunch of stuff when we already have a large amount of deprovisioning in-flight.

That's a good question, and in reality deprovisioning wasn't part of initial proposal, which focused only on the provisioning phase. Maybe some Ironic expert could comment better (@dtantsur ?), anyhow the deleting part appeared to be not so hard as the provisioning part.

zaneb · 2020-11-30T18:55:17Z

+		if len(r.currentProvisioningHosts) >= maxProvisioningHosts {
+			return false
+		}
+		r.currentProvisioningHosts[host.Name] = struct{}{}


This assumes that we will successfully write the state change. If writing that fails we will end up back here checking for capacity again against a map that already contains the host.

The good news is that this is a map not a list so we can only store one copy, and a single host repeatedly failing will not fill up our quota and block everything. Nevertheless, this exposes us to systemic risk: e.g. if we try to provision 20 hosts and then the k8s API becomes briefly unavailable for some (presumably not unrelated) reason, the bmo is now deadlocked and cannot provision anything ever again until it is restarted.

We have a mechanism in the info struct for deferring function calls until after the write succeeds, and I think we need to use that here.

Nice catch, I'll check how to use that

zaneb · 2020-11-30T19:09:27Z

 // SetupWithManager reigsters the reconciler to be run by the manager
 func (r *BareMetalHostReconciler) SetupWithManager(mgr ctrl.Manager) error {

+	r.syncProvisioningHosts(mgr)


I believe there is a race here; we haven't yet started watching resources or obtained the leader lock, so Hosts can be modified after we build our initial state without us seeing the changes (potentially ever, since our updates are edge-triggered).

Is there any other initialization point then that could be used (before reconciling)? Otherwise the other approach could be to keep in sync the set at the beginning of every reconcile loop, in line with @dhellmann's suggestion

dhellmann · 2020-11-30T19:56:50Z


+// Resync the set at the startup
+func (r *BareMetalHostReconciler) syncProvisioningHosts(mgr ctrl.Manager) {
+	hosts := &metal3v1alpha1.BareMetalHostList{}


The controller runtime uses a caching client, which means when we ask for resources that we're subscribed to we look at the cache and get a list back quickly. If we eliminate the mutex management and just write a function that returns a number, it should be plenty fast enough to call every time we start to reconcile.

Got your point and will give it a try, but I'm not sure it could work, since the cache is not updated immediately, thus still risking a surge (while using a sync set every thread has always an effectively updated view). Anyhow I'll check

This PR uses the same cache to initialize the data structure, right?

Not really, I've the APIReader() returned by the manager to access directly the API server

dhellmann · 2020-11-30T19:59:02Z

+		if !hsm.updateHostStateFrom(initialState, info) {
+			actionRes = recordActionFailure(info, metal3v1alpha1.TooManyHostsError, "Delayed, too many hosts")
+		}
+	}()


I suggested adding a transition(newState) method to the state machine to use instead of assigning the new state directly. By doing that, all of the transitions would pass through one place and we could have a switch statement for handling before & after calls consistently.

andfasano · 2020-12-01T16:41:08Z

It makes me nervous that we are relying on an in-memory cache of states as a source of truth with no mechanism to ensure it is in sync after startup.

This is also not on the path toward likely future enhancements like automatically scaling with the number of conductors available. TBH I'd prefer an approach with this error management but the state being read from ironic more like in #725. The target_provisioning_state should be updated more or less immediately, so if we add that into consideration I wouldn't expect any major capacity overflow problems.

I understand your concerns and the idea to use as much as possible ironic as the only source of truth. Still remains the problem of the requests on the wire, for which ironic couldn't be used effectively, with the risk of a complete overflow - where instead the cached worked properly. I'll try to work out the best of the two approaches, in order to guarantee the respect of the configured limit and the comments mentioned so far.

andfasano · 2020-12-04T14:41:22Z

Feedback addressed in #725

metal3-io-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 26, 2020

metal3-io-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 26, 2020

Limit the number of hosts simultaneously provisioned

26ca37a

andfasano force-pushed the limit-hosts-local branch from 61b8a20 to 26ca37a Compare November 27, 2020 15:47

zaneb reviewed Nov 30, 2020

View reviewed changes

dhellmann reviewed Nov 30, 2020

View reviewed changes

andfasano closed this Dec 4, 2020

Conversation

andfasano commented Nov 26, 2020

Uh oh!

metal3-io-bot commented Nov 26, 2020

Uh oh!

zaneb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andfasano commented Dec 1, 2020

Uh oh!

andfasano commented Dec 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants