Skip to content

🐛 Skip power off BMH if registration error OR no credentials#1386

Closed
lentzi90 wants to merge 1 commit intometal3-io:mainfrom
Nordix:lentzi90/fix-bmh-deletion
Closed

🐛 Skip power off BMH if registration error OR no credentials#1386
lentzi90 wants to merge 1 commit intometal3-io:mainfrom
Nordix:lentzi90/fix-bmh-deletion

Conversation

@lentzi90
Copy link
Copy Markdown
Member

If a BMH is not registered OR we do not have proper credentials for it, we cannot power off before deleting. Therefore we should skip to deletion in either of these cases, not just when both hold.

What this PR does / why we need it:

If a BMH is not registered OR we do not have proper credentials for it, we cannot power off before deleting. Therefore we should skip to deletion in either of these cases, not just when both hold.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1385

If a BMH is not registered OR we do not have proper credentials for it,
we cannot power off before deleting. Therefore we should skip to
deletion in either of these cases, not just when both hold.
@metal3-io-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign dtantsur for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 10, 2023
@honza
Copy link
Copy Markdown
Member

honza commented Oct 10, 2023

There is already a PR for this, and a good discussion: would you mind having a look? #1356

@lentzi90
Copy link
Copy Markdown
Member Author

/hold
Looks like I'm just duplicating #1356.

@metal3-io-bot metal3-io-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 11, 2023
@lentzi90
Copy link
Copy Markdown
Member Author

/hold cancel
#1356 is merged, but it did not touch this particular line.
I think it may be impossible to get into the specific case that this would fix with the changes in #1356. However, I also think the logic is wrong and we should therefore change it.

@metal3-io-bot metal3-io-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 19, 2023
@kashifest
Copy link
Copy Markdown
Member

@honza @zaneb wdyt about this patch. This looks like a bug to me as well.

@zaneb
Copy link
Copy Markdown
Member

zaneb commented Oct 24, 2023

It was correct before. If we are getting an error that we are not registered and we don't have credentials, then we should ignore the error because it is never getting better.
If we are getting an error and it's not about needing registration, then we report the error.
(And if we are getting an error that we are not registered and we do have credentials, then we need to handle the error so we will requeue and do the registration again.)

/close

@metal3-io-bot
Copy link
Copy Markdown
Contributor

@zaneb: Closed this PR.

Details

In response to this:

It was correct before. If we are getting an error that we are not registered and we don't have credentials, then we should ignore the error because it is never getting better.
If we are getting an error and it's not about needing registration, then we report the error.
(And if we are getting an error that we are not registered and we do have credentials, then we need to handle the error so we will requeue and do the registration again.)

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lentzi90
Copy link
Copy Markdown
Member Author

@zaneb can you please explain to me why we should try to register again if we are already in handlePoweringOffBeforeDelete? It makes no sense to me.

I imagine this chain of events:

  1. BMH and BMC credentials are created
  2. The BMC credentials are bad (but they do exist) or there is something fundamentally wrong with the server
  3. Registration error.
  4. User decides to just start over, so deletes the BMH
  5. The BMH cannot be powered off, cannot be registered and is stuck retrying this until the user provides working credentials or fixes the server so it can be registered.

@zaneb
Copy link
Copy Markdown
Member

zaneb commented Oct 24, 2023

If you look at the code for ReconcileState(), registration happens in ensureRegistered(), before the state handler is called. So the only ways to get to the state handler are if registration succeeds, or we hit the exception for not having any credentials.

If we get into the state handler and the node is not registered in ironic, it means that the ironic DB has been lost between the time we did the registration and the time we tried to handle the current state. The solution to this is to requeue so that next time around ensureRegistered() will re-register the node. We do this by propagating the error, which will cause controller-runtime to requeue.

The only exception to that when we find the node not registered in ironic is if haveCreds is false, in which case we haven't attempted registration and trying again won't help because we won't attempt it next time either. In that case we proceed to deletion. This is the case that the condition on line 588 is checking for.

I think this is confusing because the 'registered' in ensureRegistered() means 'ironic is successfully talking to the BMC', but the 'registered' in ErrNeedsRegistration/NeedsRegistration() means 'a Node object exists in Ironic'. They're linked because the same code does both, but maybe this could have been more understandable. In my defense, everything was much worse before the state machine existed 😄

The case where credentials exist but are not sufficient to get us talking to the BMC is handled by #1356 - we try 3 times (note: each attempt may take a number of reconciles) and if we keep getting actionFailed (note: not actionError, which is a transient error) then we proceed to deletion.

@lentzi90 lentzi90 deleted the lentzi90/fix-bmh-deletion branch October 24, 2023 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BMH with good BMC credentials and registration error stuck deleting

5 participants