Skip to content

Conversation

@runcom
Copy link
Member

@runcom runcom commented Jul 11, 2020

It was a pain to support our claim here https://bugzilla.redhat.com/show_bug.cgi?id=1855821
regarding the fact that somebody/something deleted the relevant machine configs.
These logs shouldn't be spammy, I can't remember why they were behind a 4 log level.
It makes sense to make them more visible to back up our claims on MC actions.

Signed-off-by: Antonio Murdaca [email protected]

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jul 11, 2020
@openshift-ci-robot
Copy link
Contributor

@runcom: This pull request references Bugzilla bug 1855821, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1855821: pkg/daemon: fallback to currentConfig on disk

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 11, 2020
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps adding a log here in else would be useful to help us debug that something may have gone wrong due to which we are fetching the currentconfig from disk

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, good point, I'll wrap the err variables to be more explicit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following on @sinnykumari's logging note I think it would be helpful to log that we've successfully gotten the config from disk (since dn.getCurrentConfigOnDisk doesn't log, understandably so). That way if something was, for some reason out of sync, it would be easier to trace.

Copy link
Member

@ashcrow ashcrow Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In the future it may make sense to functionize this check as it happens multiple times.

Copy link
Member

@ashcrow ashcrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor updates requested by @sinnykumari and myself. getCurrentConfig is a good addition and the extra logging is 👍.

@cgwalters
Copy link
Member

otherwise, we're stuck.

One thing I think we should recommend more often is - for MAO managed clusters, simply deleting the machine object and letting the MAO reprovision. Particularly for workers. In the future hopefully that will all work for the control plane too.

We could even consider doing that automatically. But let's just at least keep it in mind as an option. (And in the future we can more cleanly support "reprovision in place" too with coreos/fedora-coreos-tracker#399 )

@runcom
Copy link
Member Author

runcom commented Jul 13, 2020

One thing I think we should recommend more often is - for MAO managed clusters, simply deleting the machine object and letting the MAO reprovision. Particularly for workers. In the future hopefully that will all work for the control plane too.

right, I think that would be good too - does that block this PR to ease the reconciliation today?

@cgwalters
Copy link
Member

That comment was just an aside, not blocking this.

But now that I did actually look at this beyond just skimming the PR description briefly - are you really sure of this diagnosis?

/hold
for a reply to this (but feel free to lift if you disagree).

You say:

... it means that MC isn't in the cluster anymore and we know MCO doesn't delete it. It also means by the time you've got the must-gather, something else happened in the cluster (likely a machine config cleanup...and that's why the MC isn't found anymore).

Nothing GCs machineconfig objects today - isn't this much more likely to be "bootstrap rendering vs cluster" drift that's happened a few times?

I think the reason we didn't do this before is it will cause an additional reboot uncoordinated by the MCO. See some related discussion in #926

Debugging this is incredibly painful for sure, and I see the temptation to just automatically reconcile. I think we need to be very loud if this ever happens because we really should fix the drift between bootstrap and cluster rendering.

And also I am not sure we should do this until we've addressed the reboot coordination problem.

(I only spent a few minutes on looking at this, so I may be wrong here)

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 13, 2020
@runcom
Copy link
Member Author

runcom commented Jul 13, 2020

But now that I did actually look at this beyond just skimming the PR description briefly - are you really sure of this diagnosis?

There's a reproducer here that I've written: https://bugzilla.redhat.com/show_bug.cgi?id=1855821#c3 which is the only way I'm aware of the scenario can be happening (from the BZ)
Also, pretty sure we're not even at the bootstrap vs cluster drift since we're in cluster based on the BZ.

Nothing GCs machineconfig objects today - isn't this much more likely to be "bootstrap rendering vs cluster" drift that's happened a few times?

So yeah, the drift symptom is clear tho: the bootstrap MC is different than the cluster MC and what's on disk - here, we're in the cluster, no bootstrap anymore, and that's because the user manually deleted that MC - no bootstrap in the picture here.

Also, I don't think this patch is making anything reboot more than it should 🤔 can you explain that given we're not in the bootstrap vs cluster case? Even in that bad drift scenario, how can this cause a reboot? In the drift scenario currentOnDisk != currentConfig|desiredConfig - this case should only cover currentOnDisk == currentConfig to be able to reconciles the only cases where somebody delete an in-use rendered machine config

I think this is just a way out of:

grab the current config when it can't be found in the cluster because somebody manually deleted it since the MCO doesn't GC them

I think I'm missing something cause I can't see how it'll cause a new reboot (bear with me, this is confusing every time 😂 )

@runcom
Copy link
Member Author

runcom commented Jul 13, 2020

Nothing GCs machineconfig objects today

I wanna re-state that this is about somebody manually deleting the rendered config (manual deletion is the only guess that makes sense (other than what it appears to be from the first must-gather) unless etcd starts having consistency issues, which I hope it doesn't)

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just like to be really really sure this isn't ye olde "bootstrap rendering different" problem. Or I'd like to see strong evidence that somehow MCs are being deleted.

@runcom
Copy link
Member Author

runcom commented Sep 8, 2020

I'd just like to be really really sure this isn't ye olde "bootstrap rendering different" problem. Or I'd like to see strong evidence that somehow MCs are being deleted.

so, looking at the bz, there's no doubt this isn't bootstrap rendering different. The BZ has a cluster up and running and that was the way to repro the behavior in https://bugzilla.redhat.com/show_bug.cgi?id=1855821#c3

I think one way to move this forward since there should be a manual workaround is to land a fix to log exactly the lifecycle of any machine config so we're able to look at that, compare and debug if it happens again with 4.6.

I've dropped the main commit in favor of just the one that does the enhanced logging and this should be good to go

It was a pain to support our claim here https://bugzilla.redhat.com/show_bug.cgi?id=1855821
regarding the fact that _somebody/something_ deleted the relevant machine configs.
These logs shouldn't be spammy, I can't remember why they were behind a 4 log level.
It makes sense to make them more visible to back up our claims on MC actions.

Signed-off-by: Antonio Murdaca <[email protected]>
@runcom runcom changed the title Bug 1855821: pkg/daemon: fallback to currentConfig on disk Bug 1855821: pkg/controller/render: log actions on machine configs Sep 8, 2020
@openshift-ci-robot
Copy link
Contributor

@runcom: This pull request references Bugzilla bug 1855821, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1855821: pkg/controller/render: log actions on machine configs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@runcom
Copy link
Member Author

runcom commented Sep 8, 2020

holding retesting as this shouldn't impact upgrades

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, more logging is good by me. (I think it's possible to dig through the apiserver audit logs for some of this stuff, but TBH I've never done that)

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sinnykumari
Copy link
Contributor

/retest

@openshift-ci-robot
Copy link
Contributor

@runcom: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-gcp-upgrade 7f1628a link /test e2e-gcp-upgrade
ci/prow/e2e-ovn-step-registry 7f1628a link /test e2e-ovn-step-registry
ci/prow/e2e-upgrade 7f1628a link /test e2e-upgrade
ci/prow/okd-e2e-aws 7f1628a link /test okd-e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot
Copy link
Contributor

@runcom: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-serial 7f1628a link /test e2e-aws-serial

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@yuqi-zhang
Copy link
Contributor

We've closed the underlying bug as we've determined it is not an actual bug, and we will be having more discussions on the idea of "safety" in the MCO. Correspondingly I will close this PR as it is not high priority. If you feel otherwise feel free to reopen. Thanks!

@yuqi-zhang yuqi-zhang closed this Dec 8, 2020
@openshift-ci-robot
Copy link
Contributor

@runcom: This pull request references Bugzilla bug 1855821. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

Details

In response to this:

Bug 1855821: pkg/controller/render: log actions on machine configs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. team-mco

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants