-
Notifications
You must be signed in to change notification settings - Fork 462
Bug 1855821: pkg/controller/render: log actions on machine configs #1921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@runcom: This pull request references Bugzilla bug 1855821, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pkg/daemon/daemon.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps adding a log here in else would be useful to help us debug that something may have gone wrong due to which we are fetching the currentconfig from disk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, good point, I'll wrap the err variables to be more explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following on @sinnykumari's logging note I think it would be helpful to log that we've successfully gotten the config from disk (since dn.getCurrentConfigOnDisk doesn't log, understandably so). That way if something was, for some reason out of sync, it would be easier to trace.
pkg/daemon/daemon.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: In the future it may make sense to functionize this check as it happens multiple times.
ashcrow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor updates requested by @sinnykumari and myself. getCurrentConfig is a good addition and the extra logging is 👍.
One thing I think we should recommend more often is - for MAO managed clusters, simply deleting the machine object and letting the MAO reprovision. Particularly for workers. In the future hopefully that will all work for the control plane too. We could even consider doing that automatically. But let's just at least keep it in mind as an option. (And in the future we can more cleanly support "reprovision in place" too with coreos/fedora-coreos-tracker#399 ) |
right, I think that would be good too - does that block this PR to ease the reconciliation today? |
|
That comment was just an aside, not blocking this. But now that I did actually look at this beyond just skimming the PR description briefly - are you really sure of this diagnosis? /hold You say:
Nothing GCs machineconfig objects today - isn't this much more likely to be "bootstrap rendering vs cluster" drift that's happened a few times? I think the reason we didn't do this before is it will cause an additional reboot uncoordinated by the MCO. See some related discussion in #926 Debugging this is incredibly painful for sure, and I see the temptation to just automatically reconcile. I think we need to be very loud if this ever happens because we really should fix the drift between bootstrap and cluster rendering. And also I am not sure we should do this until we've addressed the reboot coordination problem. (I only spent a few minutes on looking at this, so I may be wrong here) |
There's a reproducer here that I've written: https://bugzilla.redhat.com/show_bug.cgi?id=1855821#c3 which is the only way I'm aware of the scenario can be happening (from the BZ)
So yeah, the drift symptom is clear tho: the bootstrap MC is different than the cluster MC and what's on disk - here, we're in the cluster, no bootstrap anymore, and that's because the user manually deleted that MC - no bootstrap in the picture here. Also, I don't think this patch is making anything reboot more than it should 🤔 can you explain that given we're not in the bootstrap vs cluster case? Even in that bad drift scenario, how can this cause a reboot? In the drift scenario currentOnDisk != currentConfig|desiredConfig - this case should only cover currentOnDisk == currentConfig to be able to reconciles the only cases where somebody delete an in-use rendered machine config I think this is just a way out of:
I think I'm missing something cause I can't see how it'll cause a new reboot (bear with me, this is confusing every time 😂 ) |
I wanna re-state that this is about somebody manually deleting the rendered config (manual deletion is the only guess that makes sense (other than what it appears to be from the first must-gather) unless etcd starts having consistency issues, which I hope it doesn't) |
cgwalters
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just like to be really really sure this isn't ye olde "bootstrap rendering different" problem. Or I'd like to see strong evidence that somehow MCs are being deleted.
so, looking at the bz, there's no doubt this isn't bootstrap rendering different. The BZ has a cluster up and running and that was the way to repro the behavior in https://bugzilla.redhat.com/show_bug.cgi?id=1855821#c3 I think one way to move this forward since there should be a manual workaround is to land a fix to log exactly the lifecycle of any machine config so we're able to look at that, compare and debug if it happens again with 4.6. I've dropped the main commit in favor of just the one that does the enhanced logging and this should be good to go |
It was a pain to support our claim here https://bugzilla.redhat.com/show_bug.cgi?id=1855821 regarding the fact that _somebody/something_ deleted the relevant machine configs. These logs shouldn't be spammy, I can't remember why they were behind a 4 log level. It makes sense to make them more visible to back up our claims on MC actions. Signed-off-by: Antonio Murdaca <[email protected]>
23a54a6 to
7f1628a
Compare
|
@runcom: This pull request references Bugzilla bug 1855821, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
holding retesting as this shouldn't impact upgrades |
cgwalters
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, more logging is good by me. (I think it's possible to dig through the apiserver audit logs for some of this stuff, but TBH I've never done that)
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, runcom The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
@runcom: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@runcom: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
We've closed the underlying bug as we've determined it is not an actual bug, and we will be having more discussions on the idea of "safety" in the MCO. Correspondingly I will close this PR as it is not high priority. If you feel otherwise feel free to reopen. Thanks! |
|
@runcom: This pull request references Bugzilla bug 1855821. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
It was a pain to support our claim here https://bugzilla.redhat.com/show_bug.cgi?id=1855821
regarding the fact that somebody/something deleted the relevant machine configs.
These logs shouldn't be spammy, I can't remember why they were behind a 4 log level.
It makes sense to make them more visible to back up our claims on MC actions.
Signed-off-by: Antonio Murdaca [email protected]