Skip to content

handle externally provisioned hosts without image settings#609

Closed
dhellmann wants to merge 2 commits intometal3-io:masterfrom
dhellmann:externally-provisioned-without-image
Closed

handle externally provisioned hosts without image settings#609
dhellmann wants to merge 2 commits intometal3-io:masterfrom
dhellmann:externally-provisioned-without-image

Conversation

@dhellmann
Copy link
Copy Markdown
Member

Update the IronicProvisioner to deal with externally provisioned hosts that
have no image settings by recording an error message of our own.

When the host does have image settings, ensure they are sent to Ironic
before telling it to adopt the host, in case they were not present when the
host was registered (this is similar to what we do for provisioning).

Addresses #608

@metal3-io-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 29, 2020
@dhellmann dhellmann force-pushed the externally-provisioned-without-image branch from 145a01a to 9740000 Compare July 29, 2020 19:30
hsm.NextState = metal3v1alpha1.StateRegistering
return actionComplete{}
}
if hsm.Host.Spec.Image != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If credentials are wrong, and image exists, this code will constantly retry registration, isn't it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. We have another PR up to introduce similar retry behavior for other failures.

@stbenjam
Copy link
Copy Markdown
Member

/test govet

@dhellmann dhellmann force-pushed the externally-provisioned-without-image branch from 9740000 to 47df824 Compare July 29, 2020 19:51
@stbenjam
Copy link
Copy Markdown
Member

/test-integration

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
Update the IronicProvisioner to deal with externally provisioned hosts
that have no image settings by recording an error message of our
own instead of leaking the Ironic error.

When the host does have image settings, ensure they are sent to Ironic
before telling it to adopt the host, in case they were not present
when the host was registered (this is similar to what we do for
provisioning).

Addresses metal3-io#608

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
@dhellmann dhellmann force-pushed the externally-provisioned-without-image branch from 47df824 to 5652bb3 Compare July 30, 2020 15:43
@dhellmann
Copy link
Copy Markdown
Member Author

rebased to resolve merge conflict

@dhellmann
Copy link
Copy Markdown
Member Author

/test-integration

1 similar comment
@dhellmann
Copy link
Copy Markdown
Member Author

/test-integration

@dhellmann
Copy link
Copy Markdown
Member Author

/cc @zaneb

@metal3-io-bot metal3-io-bot requested a review from zaneb July 31, 2020 15:51
if hsm.Host.Spec.Image != nil {
info.log.Info("Image is set; will retry registration")
hsm.NextState = metal3v1alpha1.StateRegistering
return actionComplete{}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requeues with no delay. Since handleRegistration also requeues with no delay after a failure, this means if the credentials are wrong we will be constantly cycling the status in a tight loop.

I think we need to merge #610 instead of returning actionComplete here. With that patch if we continue to return actionFailed then we will retry with an appropriate backoff.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

// error message here avoids exposing the error message
// from Ironic that talks about fields in Ironic with
// names the user may not recognize.
result.ErrorMessage = "Image details missing for externally provisioned server."
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually an error? Can't we consider getting to the Manageable state a success? It means we got past Verifying, which is the important thing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't go any further without the image. So unless we add a new state to the BMH state machine for hosts without images, we have to do something here to keep it from progressing and ending up stuck in a failure when adoption doesn't work.

Maybe we do need another state?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're in the ExternallyProvisioned state... what further stuff do we have to do? Getting to Manageable means we can control the power, doesn't it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we have to adopt the host (see line 921).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just noticed that we allow the Host to freely switch between the ExternallyProvisioned and Ready states without deprovisioning it first. That's likely a bug.

We have to adopt the host if we are in the Provisioned state. And we might want to adopt the host if we know the image so that Host can transition directly from ExternallyProvisioned->Provisioned without cleaning/inspection (although we don't support that today... we just go from ExternallyProvisioned->Ready without cleaning lol). But I'm not aware of any reason now or in the future that we would have to adopt an externally-provisioned host when we haven't been told what image (if any) is expected to be running on it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just noticed that we allow the Host to freely switch between the ExternallyProvisioned and Ready states without deprovisioning it first. That's likely a bug.

Fun. Maybe? I could go either way.

We have to adopt the host if we are in the Provisioned state. And we might want to adopt the host if we know the image so that Host can transition directly from ExternallyProvisioned->Provisioned without cleaning/inspection (although we don't support that today... we just go from ExternallyProvisioned->Ready without cleaning lol). But I'm not aware of any reason now or in the future that we would have to adopt an externally-provisioned host when we haven't been told what image (if any) is expected to be running on it.

According to @dtantsur or @juliakreger when I was writing that code, Ironic only monitors the power state of hosts that are adopted. Maybe I'm mis-remembering that, though? In any case, adoption today requires the image, and that's why this fix is preventing adoption until the image is set.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to @dtantsur or @juliakreger when I was writing that code, Ironic only monitors the power state of hosts that are adopted.

Ah, I hadn't heard that. It's possible - adopting moves it to the active state (i.e. provisioned), so that shouldn't be required per se, but it may be we need to get to the available state before we could manage the power. If you don't have an image, the way to get there is via cleaning, which we don't want to do on an externally provisioned host (until it goes back to Ready!).

If that's the case, then I can see why we would treat this as an error, since it means we can't change the power state while externally provisioned.

It's tempting to just pass some bogus image, but there's no way to replace it with the real one if we get it later without dropping the Ironic DB.

@dhellmann
Copy link
Copy Markdown
Member Author

/hold

We think #610 may fix this issue, and it will certainly change what we need to do in this PR.

@metal3-io-bot metal3-io-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 5, 2020
@stbenjam
Copy link
Copy Markdown
Member

@dhellmann #610 has merged, what needs to happen with this PR?

@dhellmann
Copy link
Copy Markdown
Member Author

@dhellmann #610 has merged, what needs to happen with this PR?

I need to rework it. I want to merge #650 before doing anything else with this repo, though.

@metal3-io-bot
Copy link
Copy Markdown
Contributor

@dhellmann: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
gomod 5652bb3 link /test gomod
generate 5652bb3 link /test generate
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dhellmann dhellmann closed this Nov 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants