Handle re-registration in all states by zaneb · Pull Request #716 · metal3-io/baremetal-operator

zaneb · 2020-11-12T20:09:46Z

Ensure that regardless of the current provisioning state, we always re-register the Node in ironic if the ironic database has been lost.
Ensure that if this occurs during deprovisioning, the Node is re-adopted so that we commence deprovisioning again, so that the Host is always left in a clean state.

This also changes the steady state that we use for Ready hosts in Ironic. Previously we would try to maintain the ironic state as manageable. Now we maintain it in available if possible. This means that we don't end up doing a second automated cleaning on provisioning if we have just done an automated cleaning on deprovisioning.

Create an error type to indicate that the Host needs to be registered (i.e. there is no corresponding Node in ironic), so that this situation can be identified outside of the provisioner.

Previously we called either Adopt() (in Provisioned or ExternallyProvisioned states) or ValidateManagementAccess() (in Ready state). If the node had not been registered in Ironic when calling Adopt(), it would call ValidateManagementAccess() for us. This change separates the two so that Adopt() always expects the node to be registered already. This way we always call ValidateManagementAccess() first and then additionally call Adopt() in the provisioned states. If Adopt() finds that the host is not registered, it will return an error and the call to ValidateManagementAccess() on the subsequent reconcile will register it. Previously, ValidateManagementAccess() was only called if the node did not exist, which meant there was only one chance to create the node and successfully validate the credentials. Since ValidateManagementAccess() is now always called, we can keep trying until it succeeds. Fixes metal3-io#698

Previously we only verified that the Node was registered in Ironic when the credentials changed or in a steady state (Ready, Provisioned, ExternallyProvisioned). Simplify the code by always calling actionRegistering() prior to running the state machine. This ensures that the Node is always created in the ironic database after a move of the pod, regardless of what state the host is in at the time. This allows a BMC password to be changed in any order. Previously if the credentials were updated in the Secret prior to changing them on the BMC itself, it would never retry after the new credentials failed. It also ensures that after an error caused by a temporary inability to contact the BMC, retries still happen. Previously if there had been a registration error, we just gave up because the credentials hadn't changed on the incorrect assumption that bad credentials were the only way of failing to register. (Although the Host would still be reconciled every 1-2 minutes, it had no way of getting out of this state). Fixes metal3-io#708

We may want to retrieve the checksum of the image from either the Spec (what we have used to date) or the Status (for use during adoption). Therefore, move the implementation to the Image struct so we can get it for either.

When registering a Host as a Node in ironic and setting the image information, prefer the current image data from the Status to the requested image from the Spec. Previously we looked only at the Spec. The only time this is likely to make any difference is during deprovisioning, when there may be no image specified in the Spec.

When we deprovision a node in Ironic, by setting the target state to deleted (which has nothing to do with deleting the node itself), it goes through automated cleaning and returns to the available state. Previously we would then move the state from available to manageable before the deprovisioning was considered complete. The consequence of this was that to provision the host again, we would first have to move it back to the available state, which involves a second automated cleaning. We now leave the node in the available state after deprovisioning, so that the host will now go through automated cleaning only before the first time it is provisioned and once between each provisioning. In order to make this work, when provisioning we have to set the instance info in the available state rather than in the manageable state. The first time we provision, the host will still be in the manageable state. To perform manual cleaning (which is required to change RAID settings) also requires returning to the manageable state.

If the ironic database is dropped during deprovisioning, the nodes will be re-registered with the fresh ironic. Since we can't guarantee that deprovisioning was completed before the database was lost, adopt the node so that ironic will regard it as active and we can attempt deprovisioning again.

Previously if the Host had been marked for deletion and no corresponding Node is registered in Ironic (as would happen e.g. after the ironic pod moves to another k8s node), we wouldn't re-register it in order to complete the deprovisioning. Instead we would simply continue and allow the Host to be deleted. This was to ensure that if there were no longer credentials available, we would not block deletion of the Host. This change means that we will only skip deprovisioning and go straight to deletion if there are, in fact, no credentials available. If credentials are available but wrong, then the Host will be stuck deprovisioning. A user finding themselves in this situation can either correct the credentials (to permit deprovisioning) or remove them altogether (e.g. by deleting the Secret) to skip to deletion.

Fixes metal3-io#697

zaneb · 2020-11-12T20:23:02Z

/cc @dhellmann @furkatgofurov7
/test-integration

furkatgofurov7

Looks great! I'm very much in favor of this.
@zaneb we could update the PR description maybe, as this patch, will fix at least 2 (if not 3) of the open issues (697, 698 and 708) in the repo.

zaneb · 2020-11-17T14:54:04Z

Those issues are referenced in the individual commit messages, so they will get automatically closed when this merges.

andfasano · 2020-11-27T17:30:55Z

+			imageData = &p.host.Status.Provisioning.Image
+		case p.host.Spec.Image != nil && p.host.Spec.Image.URL != "":
+			imageData = p.host.Spec.Image
+		}


Since this image extraction logic is not trivial, it could useful to refactor into a BareMetalHost.GetProvisioningImage()

andfasano · 2020-11-27T17:36:57Z

-		// credentials, just in case.
-		p.log.Info("re-registering host")
-		return p.ValidateManagementAccess(true)
+		err = provisioner.NeedsRegistration


A log trace could be useful here

Returning an error all the way up through Reconcile() (which this does) always results in a log trace.

andfasano · 2020-11-30T09:48:41Z

 	}

-	stateMachine := newHostStateMachine(host, r, prov)
+	stateMachine := newHostStateMachine(host, r, prov, haveCreds)


Nit: just to avoid passing another param to the host machine, couldn't this check be done via the Provisioner interface (verifying that bmcCreds is not empty)?

We don't know what type of Provisioner we have, so we don't know where (or even if) that data is stored. Also I don't want to have to look at the contents of the creds (which are not passed as a pointer) to see if they're empty or not.

bmCreds are already passed to the ProvisionerFactory, a plain wrapper having the same check in the Provisioner interface could avoid to store it directly in the state machine. But it's a minor point.

honza · 2020-12-03T14:56:12Z

Amazing commit messages!

The only thing I'd note is the absence of unit tests.

dtantsur · 2020-12-03T15:55:27Z

 	case nodes.Error, nodes.CleanFail:
-		if !ironicNode.Maintenance {
-			p.log.Info("setting host maintenance flag to force image delete")
-			return p.setMaintenanceFlag(ironicNode, true)


Clean failure may set maintenance mode, you need to clear it here (instead of setting it to true which we used to do for some reason).

Thanks, I added another commit to do this.

dhellmann

This looks good. I appreciate the effort you took to split it into reviewable commits with detailed commit messages.

Now that we have a framework for adding them, I hope we can prioritize adding tests in follow-up PRs.

/lgtm

dhellmann · 2020-12-03T15:28:41Z

@@ -849,27 +849,31 @@ func (host *BareMetalHost) OperationMetricForState(operation ProvisioningState)

 // GetImageChecksum returns the hash value and its algo.
 func (host *BareMetalHost) GetImageChecksum() (string, string, bool) {


I would probably go ahead and remove this method, but that's not a reason to hold up this PR.

dhellmann · 2020-12-03T15:40:34Z

+		if provResult, err := p.startProvisioning(ironicNode, hostConf); err != nil || provResult.Dirty || provResult.ErrorMessage != "" {
+			return provResult, err
+		}
+
+		return p.changeNodeProvisionState(ironicNode,
+			nodes.ProvisionStateOpts{Target: nodes.TargetActive})


This part of the change is surprising. When will startProvisioning() return that it is done immediately after a failure to allow line 1094 to be executed? Aren't we likely to need to always reconcile again at least once to give provisioning time to complete? And at that point the node's ProvisionState field won't be DeployFailed, right?

This code used to be at the end of startProvisioning() but it has moved here because we only wanted to call it in the DeployFail state and not the Available state (which now calls startProvisioning - it used to be Manageable).

metal3-io-bot · 2020-12-03T16:23:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann, zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [dhellmann,zaneb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dhellmann · 2020-12-03T16:23:45Z

/hold

I didn't notice the comment from @dtantsur before adding LGTM.

Clean failure may set maintenance mode, so clear it before attempting to continue.

zaneb · 2020-12-03T16:55:47Z

/test-integration

dhellmann · 2020-12-03T18:26:03Z

/hold cancel

dhellmann · 2020-12-03T19:07:34Z

/lgtm

maelk · 2020-12-04T09:29:41Z

Great to see this work completed! This PR solves quite some issues for us. Thank you @zaneb for taking care of it

zaneb added 10 commits November 12, 2020 12:30

Create NeedsRegistration error

6843481

Create an error type to indicate that the Host needs to be registered (i.e. there is no corresponding Node in ironic), so that this situation can be identified outside of the provisioner.

Refactor HostStateMachine.handleDeprovisioning()

93cbf64

Move checksum method to Image

502d2fb

We may want to retrieve the checksum of the image from either the Spec (what we have used to date) or the Status (for use during adoption). Therefore, move the implementation to the Image struct so we can get it for either.

Retry when adoption fails

324f7b1

Fixes metal3-io#697

metal3-io-bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 12, 2020

metal3-io-bot requested review from dhellmann and furkatgofurov7 November 12, 2020 20:23

furkatgofurov7 reviewed Nov 17, 2020

View reviewed changes

andfasano reviewed Nov 30, 2020

View reviewed changes

dtantsur reviewed Dec 3, 2020

View reviewed changes

dhellmann approved these changes Dec 3, 2020

View reviewed changes

metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 3, 2020

metal3-io-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 3, 2020

Clear maintenance flag in CleanFail

431470c

Clean failure may set maintenance mode, so clear it before attempting to continue.

metal3-io-bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 3, 2020

metal3-io-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 3, 2020

metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 3, 2020

metal3-io-bot merged commit e85bcda into metal3-io:master Dec 3, 2020

furkatgofurov7 mentioned this pull request Dec 7, 2020

BMH stucks in RegistrationError when vbmc is not reachable for short time #739

Closed

zaneb mentioned this pull request Dec 11, 2020

Limit the number of hosts simultaneously provisioned #725

Merged

zaneb mentioned this pull request Jan 15, 2021

Handle DeployFail and CleanFail state in deprovisioning #289

Closed

hardys mentioned this pull request Jan 20, 2021

Add test coverage of Spec vs Status Image #770

Merged

zaneb mentioned this pull request Mar 29, 2021

Allow disable automated cleaning #784

Merged

		@@ -849,27 +849,31 @@ func (host *BareMetalHost) OperationMetricForState(operation ProvisioningState)

		// GetImageChecksum returns the hash value and its algo.
		func (host *BareMetalHost) GetImageChecksum() (string, string, bool) {

Conversation

zaneb commented Nov 12, 2020

Uh oh!

zaneb commented Nov 12, 2020

Uh oh!

furkatgofurov7 left a comment

Choose a reason for hiding this comment

Uh oh!

zaneb commented Nov 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

honza commented Dec 3, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhellmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

metal3-io-bot commented Dec 3, 2020

Uh oh!

dhellmann commented Dec 3, 2020

Uh oh!

zaneb commented Dec 3, 2020

Uh oh!

dhellmann commented Dec 3, 2020

Uh oh!

dhellmann commented Dec 3, 2020

Uh oh!

maelk commented Dec 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants