Skip to content

Handle re-registration in all states#716

Merged
metal3-io-bot merged 11 commits intometal3-io:masterfrom
zaneb:register-always
Dec 3, 2020
Merged

Handle re-registration in all states#716
metal3-io-bot merged 11 commits intometal3-io:masterfrom
zaneb:register-always

Conversation

@zaneb
Copy link
Copy Markdown
Member

@zaneb zaneb commented Nov 12, 2020

Ensure that regardless of the current provisioning state, we always re-register the Node in ironic if the ironic database has been lost.
Ensure that if this occurs during deprovisioning, the Node is re-adopted so that we commence deprovisioning again, so that the Host is always left in a clean state.

This also changes the steady state that we use for Ready hosts in Ironic. Previously we would try to maintain the ironic state as manageable. Now we maintain it in available if possible. This means that we don't end up doing a second automated cleaning on provisioning if we have just done an automated cleaning on deprovisioning.

zaneb added 10 commits November 12, 2020 12:30
Create an error type to indicate that the Host needs to be registered
(i.e. there is no corresponding Node in ironic), so that this situation
can be identified outside of the provisioner.
Previously we called either Adopt() (in Provisioned or
ExternallyProvisioned states) or ValidateManagementAccess() (in Ready
state). If the node had not been registered in Ironic when calling
Adopt(), it would call ValidateManagementAccess() for us.

This change separates the two so that Adopt() always expects the node to
be registered already. This way we always call
ValidateManagementAccess() first and then additionally call Adopt() in
the provisioned states.

If Adopt() finds that the host is not registered, it will return an
error and the call to ValidateManagementAccess() on the subsequent
reconcile will register it.

Previously, ValidateManagementAccess() was only called if the node did
not exist, which meant there was only one chance to create the node and
successfully validate the credentials. Since ValidateManagementAccess()
is now always called, we can keep trying until it succeeds.

Fixes metal3-io#698
Previously we only verified that the Node was registered in Ironic when
the credentials changed or in a steady state (Ready, Provisioned,
ExternallyProvisioned). Simplify the code by always calling
actionRegistering() prior to running the state machine.

This ensures that the Node is always created in the ironic database
after a move of the pod, regardless of what state the host is in at the
time.

This allows a BMC password to be changed in any order. Previously if the
credentials were updated in the Secret prior to changing them on the BMC
itself, it would never retry after the new credentials failed.

It also ensures that after an error caused by a temporary inability to
contact the BMC, retries still happen. Previously if there had been a
registration error, we just gave up because the credentials hadn't
changed on the incorrect assumption that bad credentials were the only
way of failing to register. (Although the Host would still be reconciled
every 1-2 minutes, it had no way of getting out of this state).

Fixes metal3-io#708
We may want to retrieve the checksum of the image from either the Spec
(what we have used to date) or the Status (for use during adoption).
Therefore, move the implementation to the Image struct so we can get it
for either.
When registering a Host as a Node in ironic and setting the image
information, prefer the current image data from the Status to the
requested image from the Spec. Previously we looked only at the Spec.

The only time this is likely to make any difference is during
deprovisioning, when there may be no image specified in the Spec.
When we deprovision a node in Ironic, by setting the target state to
deleted (which has nothing to do with deleting the node itself), it goes
through automated cleaning and returns to the available state.

Previously we would then move the state from available to manageable
before the deprovisioning was considered complete. The consequence of
this was that to provision the host again, we would first have to move
it back to the available state, which involves a second automated
cleaning.

We now leave the node in the available state after deprovisioning, so
that the host will now go through automated cleaning only before the
first time it is provisioned and once between each provisioning. In
order to make this work, when provisioning we have to set the instance
info in the available state rather than in the manageable state.

The first time we provision, the host will still be in the manageable
state. To perform manual cleaning (which is required to change RAID
settings) also requires returning to the manageable state.
If the ironic database is dropped during deprovisioning, the nodes will
be re-registered with the fresh ironic. Since we can't guarantee that
deprovisioning was completed before the database was lost, adopt the
node so that ironic will regard it as active and we can attempt
deprovisioning again.
Previously if the Host had been marked for deletion and no corresponding
Node is registered in Ironic (as would happen e.g. after the ironic pod
moves to another k8s node), we wouldn't re-register it in order to
complete the deprovisioning. Instead we would simply continue and allow
the Host to be deleted. This was to ensure that if there were no longer
credentials available, we would not block deletion of the Host.

This change means that we will only skip deprovisioning and go straight
to deletion if there are, in fact, no credentials available.

If credentials are available but wrong, then the Host will be stuck
deprovisioning. A user finding themselves in this situation can either
correct the credentials (to permit deprovisioning) or remove them
altogether (e.g. by deleting the Secret) to skip to deletion.
@metal3-io-bot metal3-io-bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 12, 2020
@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Nov 12, 2020

/cc @dhellmann @furkatgofurov7
/test-integration

Copy link
Copy Markdown
Member

@furkatgofurov7 furkatgofurov7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I'm very much in favor of this.
@zaneb we could update the PR description maybe, as this patch, will fix at least 2 (if not 3) of the open issues (697, 698 and 708) in the repo.

@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Nov 17, 2020

Those issues are referenced in the individual commit messages, so they will get automatically closed when this merges.

imageData = &p.host.Status.Provisioning.Image
case p.host.Spec.Image != nil && p.host.Spec.Image.URL != "":
imageData = p.host.Spec.Image
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this image extraction logic is not trivial, it could useful to refactor into a BareMetalHost.GetProvisioningImage()

// credentials, just in case.
p.log.Info("re-registering host")
return p.ValidateManagementAccess(true)
err = provisioner.NeedsRegistration
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A log trace could be useful here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an error all the way up through Reconcile() (which this does) always results in a log trace.

}

stateMachine := newHostStateMachine(host, r, prov)
stateMachine := newHostStateMachine(host, r, prov, haveCreds)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: just to avoid passing another param to the host machine, couldn't this check be done via the Provisioner interface (verifying that bmcCreds is not empty)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know what type of Provisioner we have, so we don't know where (or even if) that data is stored. Also I don't want to have to look at the contents of the creds (which are not passed as a pointer) to see if they're empty or not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bmCreds are already passed to the ProvisionerFactory, a plain wrapper having the same check in the Provisioner interface could avoid to store it directly in the state machine. But it's a minor point.

@honza
Copy link
Copy Markdown
Member

honza commented Dec 3, 2020

Amazing commit messages!

The only thing I'd note is the absence of unit tests.

case nodes.Error, nodes.CleanFail:
if !ironicNode.Maintenance {
p.log.Info("setting host maintenance flag to force image delete")
return p.setMaintenanceFlag(ironicNode, true)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean failure may set maintenance mode, you need to clear it here (instead of setting it to true which we used to do for some reason).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I added another commit to do this.

Copy link
Copy Markdown
Member

@dhellmann dhellmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I appreciate the effort you took to split it into reviewable commits with detailed commit messages.

Now that we have a framework for adding them, I hope we can prioritize adding tests in follow-up PRs.

/lgtm

@@ -849,27 +849,31 @@ func (host *BareMetalHost) OperationMetricForState(operation ProvisioningState)

// GetImageChecksum returns the hash value and its algo.
func (host *BareMetalHost) GetImageChecksum() (string, string, bool) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably go ahead and remove this method, but that's not a reason to hold up this PR.

Comment on lines +1090 to +1095
if provResult, err := p.startProvisioning(ironicNode, hostConf); err != nil || provResult.Dirty || provResult.ErrorMessage != "" {
return provResult, err
}

return p.changeNodeProvisionState(ironicNode,
nodes.ProvisionStateOpts{Target: nodes.TargetActive})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the change is surprising. When will startProvisioning() return that it is done immediately after a failure to allow line 1094 to be executed? Aren't we likely to need to always reconcile again at least once to give provisioning time to complete? And at that point the node's ProvisionState field won't be DeployFailed, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code used to be at the end of startProvisioning() but it has moved here because we only wanted to call it in the DeployFail state and not the Available state (which now calls startProvisioning - it used to be Manageable).

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 3, 2020
@metal3-io-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann, zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dhellmann
Copy link
Copy Markdown
Member

/hold

I didn't notice the comment from @dtantsur before adding LGTM.

@metal3-io-bot metal3-io-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 3, 2020
Clean failure may set maintenance mode, so clear it before attempting to
continue.
@metal3-io-bot metal3-io-bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 3, 2020
@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Dec 3, 2020

/test-integration

@dhellmann
Copy link
Copy Markdown
Member

/hold cancel

@metal3-io-bot metal3-io-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 3, 2020
@dhellmann
Copy link
Copy Markdown
Member

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 3, 2020
@metal3-io-bot metal3-io-bot merged commit e85bcda into metal3-io:master Dec 3, 2020
@maelk
Copy link
Copy Markdown
Member

maelk commented Dec 4, 2020

Great to see this work completed! This PR solves quite some issues for us. Thank you @zaneb for taking care of it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants