[WIP] Limit the number of hosts simultaneously provisioned (BMO only version)#730
[WIP] Limit the number of hosts simultaneously provisioned (BMO only version)#730andfasano wants to merge 1 commit intometal3-io:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: andfasano The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
61b8a20 to
26ca37a
Compare
zaneb
left a comment
There was a problem hiding this comment.
It makes me nervous that we are relying on an in-memory cache of states as a source of truth with no mechanism to ensure it is in sync after startup.
This is also not on the path toward likely future enhancements like automatically scaling with the number of conductors available. TBH I'd prefer an approach with this error management but the state being read from ironic more like in #725. The target_provisioning_state should be updated more or less immediately, so if we add that into consideration I wouldn't expect any major capacity overflow problems.
| ProvisionerFactory provisioner.Factory | ||
|
|
||
| currentProvisioningHosts map[string]struct{} | ||
| mu sync.Mutex |
There was a problem hiding this comment.
It would be preferable to have a name that describes what the mutex is protecting.
There was a problem hiding this comment.
Ok. I can move them in a separate struct to make it even more clear, and it could help a little bit in testing
| if !hsm.updateHostStateFrom(initialState, info) { | ||
| actionRes = recordActionFailure(info, metal3v1alpha1.TooManyHostsError, "Delayed, too many hosts") | ||
| } | ||
| }() |
There was a problem hiding this comment.
At one stage @dhellmann had a proposal for a generic way to attach actions to particular state transitions, but I don't remember the context...
There was a problem hiding this comment.
I suggested adding a transition(newState) method to the state machine to use instead of assigning the new state directly. By doing that, all of the transitions would pass through one place and we could have a switch statement for handling before & after calls consistently.
|
|
||
| if !hsm.Reconciler.checkProvisioningHost(*info.host, hsm.NextState) { | ||
| return false | ||
| } |
There was a problem hiding this comment.
It's a little strange that we will be stuck in e.g. the Ready state with an error, rather than the Provisioning state. This does make things simpler though.
| defer r.mu.Unlock() | ||
|
|
||
| switch state { | ||
| case metal3v1alpha1.StateInspecting, metal3v1alpha1.StateProvisioning: |
There was a problem hiding this comment.
What about deprovisioning? Isn't that just as hard on ironic-conductor? I don't know that we ever want to block deprovisioning but we probably don't want to start provisioning a bunch of stuff when we already have a large amount of deprovisioning in-flight.
There was a problem hiding this comment.
That's a good question, and in reality deprovisioning wasn't part of initial proposal, which focused only on the provisioning phase. Maybe some Ironic expert could comment better (@dtantsur ?), anyhow the deleting part appeared to be not so hard as the provisioning part.
| if len(r.currentProvisioningHosts) >= maxProvisioningHosts { | ||
| return false | ||
| } | ||
| r.currentProvisioningHosts[host.Name] = struct{}{} |
There was a problem hiding this comment.
This assumes that we will successfully write the state change. If writing that fails we will end up back here checking for capacity again against a map that already contains the host.
The good news is that this is a map not a list so we can only store one copy, and a single host repeatedly failing will not fill up our quota and block everything. Nevertheless, this exposes us to systemic risk: e.g. if we try to provision 20 hosts and then the k8s API becomes briefly unavailable for some (presumably not unrelated) reason, the bmo is now deadlocked and cannot provision anything ever again until it is restarted.
We have a mechanism in the info struct for deferring function calls until after the write succeeds, and I think we need to use that here.
There was a problem hiding this comment.
Nice catch, I'll check how to use that
| // SetupWithManager reigsters the reconciler to be run by the manager | ||
| func (r *BareMetalHostReconciler) SetupWithManager(mgr ctrl.Manager) error { | ||
|
|
||
| r.syncProvisioningHosts(mgr) |
There was a problem hiding this comment.
I believe there is a race here; we haven't yet started watching resources or obtained the leader lock, so Hosts can be modified after we build our initial state without us seeing the changes (potentially ever, since our updates are edge-triggered).
There was a problem hiding this comment.
Is there any other initialization point then that could be used (before reconciling)? Otherwise the other approach could be to keep in sync the set at the beginning of every reconcile loop, in line with @dhellmann's suggestion
|
|
||
| // Resync the set at the startup | ||
| func (r *BareMetalHostReconciler) syncProvisioningHosts(mgr ctrl.Manager) { | ||
| hosts := &metal3v1alpha1.BareMetalHostList{} |
There was a problem hiding this comment.
The controller runtime uses a caching client, which means when we ask for resources that we're subscribed to we look at the cache and get a list back quickly. If we eliminate the mutex management and just write a function that returns a number, it should be plenty fast enough to call every time we start to reconcile.
There was a problem hiding this comment.
Got your point and will give it a try, but I'm not sure it could work, since the cache is not updated immediately, thus still risking a surge (while using a sync set every thread has always an effectively updated view). Anyhow I'll check
There was a problem hiding this comment.
This PR uses the same cache to initialize the data structure, right?
There was a problem hiding this comment.
Not really, I've the APIReader() returned by the manager to access directly the API server
| if !hsm.updateHostStateFrom(initialState, info) { | ||
| actionRes = recordActionFailure(info, metal3v1alpha1.TooManyHostsError, "Delayed, too many hosts") | ||
| } | ||
| }() |
There was a problem hiding this comment.
I suggested adding a transition(newState) method to the state machine to use instead of assigning the new state directly. By doing that, all of the transitions would pass through one place and we could have a switch statement for handling before & after calls consistently.
I understand your concerns and the idea to use as much as possible ironic as the only source of truth. Still remains the problem of the requests on the wire, for which ironic couldn't be used effectively, with the risk of a complete overflow - where instead the cached worked properly. I'll try to work out the best of the two approaches, in order to guarantee the respect of the configured limit and the comments mentioned so far. |
|
Feedback addressed in #725 |
While testing #725 with an high number of nodes I found it wasn't not working as expected, so I deployed an alternative implementation which adopted a different approach in respect to the one originally discussed.
Adding some notes here to provide as much as possible additional details.
The problem encountered was an immediate surge when provisioning simultaneously an higher number of nodes, resulting in a breach to the configured threshold: the BMO reconcile loop processed the provisioning requests very rapidly, way before the Ironic subsystem was able to update its status and communicate it back. And things get worse when the BMO concurrency level is increased.
So, this PR bypasses such limitation by observing directly the BMO states instead of the Ironic ones, and by preventing an host to transition in a provisionable state (Inspecting or Provisioning) if there isn't a free slot available. This resulted in a simpler and more reliable check.
A very small critical section has been introduced to cope with the BMO concurrency level to handle properly a synchronized counter (a set) tracking the hosts currently being provisioned.
When an host cannot transition to a provisionable state due the lack of free slots, the action is rescheduled with an exponential backoff (by reusing the existing ErrorCount mechanism). A new ErrorType has been introduced to keep track of such situation. The current host state is left unchanged.
A preliminary check is applied when processing an host already delayed: if there are no free slots then the request is immediately rescheduled, to avoid putting unnecessary additional stress to the Ironic subsystem.