Limit the number of hosts simultaneously provisioned by andfasano · Pull Request #725 · metal3-io/baremetal-operator

andfasano · 2020-11-19T16:58:28Z

This PR introduces a soft limit for the number of hosts being simultaneously being provisioned.

Design doc: https://github.com/metal3-io/metal3-docs/blob/master/design/baremetal-operator/limit-hosts-provisioning.md.

A new configuration variable is provided, called PROVISIONING_LIMIT (defaulted to 20), to specify the desired batch size that the operator will try to honor while provisioning new hosts. If there are no available slots, the current action for an host will be rescheduled using a fixed backoff with jittering to spread evenly the delayed requests.
Due the asynchronous behaviour of the Ironic subsystem, and the configurable level of concurrency of the operator, it's not possible to guarantee that such limit will be always respected, and occasional overflows could happen especially during the initial burst; nevertheless the operator will try to enforce such limit as much as possible, and for such reason it's recommended to keep a configuration where PROVISIONING_LIMIT is greater than BMO_CONCURRENCY.

Demo

Here's a quick demo of the current implementation, using a limit set to 8 and a BMO concurrency level of 3 (demo speeded up). A batch of additional 15 nodes is added to the cluster. It is possible to note the overflow on the initial burst, then the limit is maintained until the provisioning was completed.

zaneb · 2020-11-30T17:29:16Z

+			continue
+		}
+
+		switch node.ProvisionState {


Would looking at the target provisioning state instead/as well provide a faster response to changes?

Unfortunately it didn't :/. In my tests also TargetState usually points just to manageable or active states

At the start of provisioning we are usually in manageable state (not always after #716, but always for the first deployment). So if we're seeing e.g. current state is manageable but target state is available, that machine is probably about to be provisioned. Ditto for current state is available and target state is active.
I assume the target state must be updated synchronously, so it really should be available whenever we look.

I suspect the real problem here is that there's a large delay between the point where you're making a go/no-go decision (on entry to the Provisioning state) and when we actually ask Ironic to go to active (not until after cleaning). And now it's even more fun because a node in the available state may or may not represent a Host in the provisioning state.

I think the solution to this is the Preparing state I proposed on the mailing list. That way all nodes will be in available before starting provisioning, so we'd only have to look for those with target state active.

@dtantsur can you please confirm such behavior? From my local tests the state doesn't seem to be update synchronously, neither the target_provision_state (just after a provision change request).

current state is manageable but target state is available

To nitpick a bit: this is not possible. So called stable states always have target state None. Stable states are states where a node simply stays without external input (BMO is external input in this context). This includes manageable, available, active. Non-empty target state appears in so called transient states: wait call-back, all states ending with ..ing or .. wait.

Now let me check the code to provide a more targeted answer.

zaneb · 2020-12-11T17:38:35Z

+			continue
+		}
+
+		switch node.ProvisionState {


At the start of provisioning we are usually in manageable state (not always after #716, but always for the first deployment). So if we're seeing e.g. current state is manageable but target state is available, that machine is probably about to be provisioned. Ditto for current state is available and target state is active.
I assume the target state must be updated synchronously, so it really should be available whenever we look.

I suspect the real problem here is that there's a large delay between the point where you're making a go/no-go decision (on entry to the Provisioning state) and when we actually ask Ironic to go to active (not until after cleaning). And now it's even more fun because a node in the available state may or may not represent a Host in the provisioning state.

I think the solution to this is the Preparing state I proposed on the mailing list. That way all nodes will be in available before starting provisioning, so we'd only have to look for those with target state active.

zaneb

OK, so if I'm following correctly here, the main thing you've changed is to check for capacity and block reconciling continuously, rather than just preventing the Host from entering the Provisioning state. (In fact, if we do it this way I think we should stop blocking the host from changing state - better to consistently block only in Provisioning than sometimes in Provisioning and sometimes in Ready.) This is suboptimal, because it may mean that we e.g. let a node start cleaning when another node that has already finished cleaning is waiting to start provisioning, but possibly necessary because of the very large delay between entering the Provisioning state and doing anything that looks like provisioning in Ironic. However, if we find that only a subset of states during provisioning are the ones that are actually placing strain on ironic (sitting there waiting for IPA to clean a node might be easy, for example) then this could still come out as a win.

The following combinations of current and target node provisioning states correspond to Hosts in the Provisioning state with what I'd expect (with one exception) to be a fairly high degree of consistency:

Manageable -> TargetProvide
Cleaning -> TargetProvide
CleanWait -> TargetProvide
CleanFail (could actually be deprovisioning, but it's brief enough that we can live with it)
Available -> TargetActive
Deploying
DeployWait

However, there is a large hole there - once the state becomes Available but before we request TargetActive, we cannot tell from the ironic states alone whether the node is Provisioning or Ready after deprovisioning. (Unless, of course, we were to store this information somewhere in ironic - which is totally possible.) While setting up the data and requesting a move to active is relatively quick, it may take some time for the operator to notice that the node has reached Available and in that window it may start allowing other Hosts to start provisioning.

If we implement my proposal to add a Preparing state before Ready, so that the request for TargetActive always happens right at the start of Provisioning, then I think that gating on state changes only would become viable again. Since we want that state anyway, it's worth at least considering whether our path would look easier if we did that first or merged this and then reworked it later, assuming we prefer to gate only on state changes.

Use of the error message and error type fields is effective (operational status would be another option, but this is more visible). However, after having written #749 I now think it's a mistake to increment the error count even once because it mixes these routine delays with other possible causes for error. I think all we really need here is a random backoff from a fixed interval. It's fine to add a new actionResult type for this - we actually need one for retry-after-delay anyway.

zaneb · 2020-12-11T17:49:38Z

+	for _, node := range allNodes {
+
+		switch node.ProvisionState {
+		case "cleaning", "clean wait", "inspecting", "inspect wait",


It may be that clean wait and inspect wait are not states that are particularly taxing on ironic and could be left off the list. It would be worth checking with the Ironic team.

dtantsur

LGTM

andfasano · 2020-12-16T17:43:53Z

The following combinations of current and target node provisioning states correspond to Hosts in the Provisioning state with what I'd expect (with one exception) to be a fairly high degree of consistency:
* `Manageable` -> `TargetProvide`

* `Cleaning` -> `TargetProvide`

* `CleanWait` -> `TargetProvide`

* `CleanFail` (could actually be deprovisioning, but it's brief enough that we can live with it)

* `Available` -> `TargetActive`

* `Deploying`

* `DeployWait`

Apart the manageable and available states that cannot be used as per the @dtantsur comment, the others have been already in use - the overall goal was to roughly define how busy is currently Ironic (in the initial proposal the subset was slightly larger, including the rescuing/unrescuing/deleting states, which probably could make sense to reintroduce now).

However, there is a large hole there - once the state becomes Available but before we request TargetActive, we cannot tell from the ironic states alone whether the node is Provisioning or Ready after deprovisioning. (Unless, of course, we were to store this information somewhere in ironic - which is totally possible.) While setting up the data and requesting a move to active is relatively quick, it may take some time for the operator to notice that the node has reached Available and in that window it may start allowing other Hosts to start provisioning.

If we implement my proposal to add a Preparing state before Ready, so that the request for TargetActive always happens right at the start of Provisioning, then I think that gating on state changes only would become viable again. Since we want that state anyway, it's worth at least considering whether our path would look easier if we did that first or merged this and then reworked it later, assuming we prefer to gate only on state changes.

Currently gating on state changes seems to work reasonably well, without affecting the convergence of the overall provisioning, so I'd prefer to proceed and merge with this approach and review it later once the new Preparing state will be available.

Use of the error message and error type fields is effective (operational status would be another option, but this is more visible). However, after having written #749 I now think it's a mistake to increment the error count even once because it mixes these routine delays with other possible causes for error. I think all we really need here is a random backoff from a fixed interval. It's fine to add a new actionResult type for this - we actually need one for retry-after-delay anyway.

The ProvisionerBusy error is now used to detect that an host have been previously delayed, so an initial check can be performed even before applying the state machine. If we want to move to a plain action result for the backoff - and also see the usefulness of that - then I suppose operational status could be used to track this situation

andfasano · 2020-12-18T16:18:58Z

Use of the error message and error type fields is effective (operational status would be another option, but this is more visible). However, after having written #749 I now think it's a mistake to increment the error count even once because it mixes these routine delays with other possible causes for error. I think all we really need here is a random backoff from a fixed interval. It's fine to add a new actionResult type for this - we actually need one for retry-after-delay anyway.

The ProvisionerBusy error is now used to detect that an host have been previously delayed, so an initial check can be performed even before applying the state machine. If we want to move to a plain action result for the backoff - and also see the usefulness of that - then I suppose operational status could be used to track this situation

Update: I've added a new actionDelayed that randomly backoff in [1m, 2m] and switched to use OperationalStatus to track delayed hosts instead of the error type

andfasano · 2020-12-18T16:19:13Z

/test-integration

andfasano · 2020-12-18T16:47:36Z

/test-integration

andfasano · 2020-12-21T16:10:54Z

/test-integration

andfasano · 2021-01-04T08:03:22Z

/test-integration

zaneb

/approve

zaneb · 2021-01-13T21:42:53Z

 	// has any sort of error.
 	OperationalStatusError OperationalStatus = "error"
+
+	OperationalStatusDelayed = "delayed"


I'm wondering if there is a better name for this, but I can't think of one.

metal3-io-bot · 2021-01-13T21:51:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andfasano, dtantsur, zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [zaneb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andfasano · 2021-01-15T15:13:53Z

Commits squashed and demo cast updated

zaneb

Rebase required, but otherwise looks good to me. Love the demo!

andfasano · 2021-01-18T14:23:13Z

/test-integration

andfasano · 2021-01-21T17:42:24Z

/test-integration

andfasano · 2021-01-21T17:55:08Z

/test-integration

zaneb · 2021-01-26T20:50:16Z

I don't see any issues with the rebase. Who wants to LGTM?

honza · 2021-01-29T12:30:36Z

/lgtm

metal3-io-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 19, 2020

andfasano mentioned this pull request Nov 26, 2020

[WIP] Limit the number of hosts simultaneously provisioned (BMO only version) #730

Closed

zaneb reviewed Nov 30, 2020

View reviewed changes

andfasano force-pushed the limit-hosts branch 3 times, most recently from 63bf033 to cac0197 Compare December 4, 2020 13:40

dhellmann reviewed Dec 4, 2020

View reviewed changes

Comment thread controllers/metal3.io/host_state_machine.go

Comment thread pkg/provisioner/ironic/ironic.go Outdated

andfasano force-pushed the limit-hosts branch 3 times, most recently from 3c4bcc7 to e2cdf65 Compare December 7, 2020 11:34

zaneb reviewed Dec 11, 2020

View reviewed changes

dtantsur approved these changes Dec 14, 2020

View reviewed changes

andfasano force-pushed the limit-hosts branch from fee645e to fe2bdf9 Compare December 18, 2020 16:11

andfasano force-pushed the limit-hosts branch 2 times, most recently from 6196aa5 to 4bd080b Compare December 21, 2020 16:57

andfasano changed the title ~~[WIP] Limit the number of hosts simultaneously provisioned~~ Limit the number of hosts simultaneously provisioned Jan 5, 2021

metal3-io-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 5, 2021

zaneb reviewed Jan 13, 2021

View reviewed changes

metal3-io-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 13, 2021

andfasano force-pushed the limit-hosts branch from 000c4ac to 920e688 Compare January 15, 2021 12:19

zaneb reviewed Jan 15, 2021

View reviewed changes

Comment thread controllers/metal3.io/action_result.go Outdated

zaneb mentioned this pull request Jan 15, 2021

Add Preparing state to do manual clean #763

Merged

andfasano force-pushed the limit-hosts branch from 920e688 to cd407b6 Compare January 18, 2021 10:57

metal3-io-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 18, 2021

Limit hosts simultaneously provisioned

1a9bba9

andfasano force-pushed the limit-hosts branch from cd407b6 to 613a4fc Compare January 21, 2021 17:40

Review ErrorCount tests

2f6b99b

andfasano force-pushed the limit-hosts branch from 613a4fc to 2f6b99b Compare January 21, 2021 17:51

andfasano requested a review from zaneb January 25, 2021 10:09

metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 29, 2021

metal3-io-bot merged commit 3f79aae into metal3-io:master Jan 29, 2021

Conversation

andfasano commented Nov 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zaneb Nov 30, 2020

Choose a reason for hiding this comment

Uh oh!

andfasano Dec 1, 2020

Choose a reason for hiding this comment

Uh oh!

zaneb Dec 11, 2020

Choose a reason for hiding this comment

Uh oh!

andfasano Dec 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dtantsur Dec 14, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zaneb Dec 11, 2020

Choose a reason for hiding this comment

Uh oh!

zaneb left a comment

Choose a reason for hiding this comment

Uh oh!

zaneb Dec 11, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dtantsur left a comment

Choose a reason for hiding this comment

Uh oh!

andfasano commented Dec 16, 2020

Uh oh!

andfasano commented Dec 18, 2020

Uh oh!

andfasano commented Dec 18, 2020

Uh oh!

andfasano commented Dec 18, 2020

Uh oh!

andfasano commented Dec 21, 2020

Uh oh!

andfasano commented Jan 4, 2021

Uh oh!

zaneb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zaneb Jan 13, 2021

Choose a reason for hiding this comment

Uh oh!

metal3-io-bot commented Jan 13, 2021

Uh oh!

andfasano commented Jan 15, 2021

Uh oh!

zaneb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andfasano commented Jan 18, 2021

Uh oh!

andfasano commented Jan 21, 2021

Uh oh!

andfasano commented Jan 21, 2021

andfasano commented Nov 19, 2020 •

edited

Loading

andfasano Dec 14, 2020 •

edited

Loading