Bug 1855821: pkg/controller/render: log actions on machine configs #1921

runcom · 2020-07-11T09:26:43Z

It was a pain to support our claim here https://bugzilla.redhat.com/show_bug.cgi?id=1855821
regarding the fact that somebody/something deleted the relevant machine configs.
These logs shouldn't be spammy, I can't remember why they were behind a 4 log level.
It makes sense to make them more visible to back up our claims on MC actions.

Signed-off-by: Antonio Murdaca [email protected]

openshift-ci-robot · 2020-07-11T09:26:58Z

@runcom: This pull request references Bugzilla bug 1855821, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

Bug 1855821: pkg/daemon: fallback to currentConfig on disk

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pkg/controller/render/render_controller.go

sinnykumari · 2020-07-13T10:24:03Z

pkg/daemon/daemon.go

Perhaps adding a log here in else would be useful to help us debug that something may have gone wrong due to which we are fetching the currentconfig from disk

right, good point, I'll wrap the err variables to be more explicit

Following on @sinnykumari's logging note I think it would be helpful to log that we've successfully gotten the config from disk (since dn.getCurrentConfigOnDisk doesn't log, understandably so). That way if something was, for some reason out of sync, it would be easier to trace.

ashcrow · 2020-07-13T15:04:54Z

pkg/daemon/daemon.go

nit: In the future it may make sense to functionize this check as it happens multiple times.

ashcrow

Minor updates requested by @sinnykumari and myself. getCurrentConfig is a good addition and the extra logging is 👍.

cgwalters · 2020-07-13T15:15:26Z

otherwise, we're stuck.

One thing I think we should recommend more often is - for MAO managed clusters, simply deleting the machine object and letting the MAO reprovision. Particularly for workers. In the future hopefully that will all work for the control plane too.

We could even consider doing that automatically. But let's just at least keep it in mind as an option. (And in the future we can more cleanly support "reprovision in place" too with coreos/fedora-coreos-tracker#399 )

runcom · 2020-07-13T15:21:41Z

One thing I think we should recommend more often is - for MAO managed clusters, simply deleting the machine object and letting the MAO reprovision. Particularly for workers. In the future hopefully that will all work for the control plane too.

right, I think that would be good too - does that block this PR to ease the reconciliation today?

cgwalters · 2020-07-13T15:30:07Z

That comment was just an aside, not blocking this.

But now that I did actually look at this beyond just skimming the PR description briefly - are you really sure of this diagnosis?

/hold
for a reply to this (but feel free to lift if you disagree).

You say:

... it means that MC isn't in the cluster anymore and we know MCO doesn't delete it. It also means by the time you've got the must-gather, something else happened in the cluster (likely a machine config cleanup...and that's why the MC isn't found anymore).

Nothing GCs machineconfig objects today - isn't this much more likely to be "bootstrap rendering vs cluster" drift that's happened a few times?

I think the reason we didn't do this before is it will cause an additional reboot uncoordinated by the MCO. See some related discussion in #926

Debugging this is incredibly painful for sure, and I see the temptation to just automatically reconcile. I think we need to be very loud if this ever happens because we really should fix the drift between bootstrap and cluster rendering.

And also I am not sure we should do this until we've addressed the reboot coordination problem.

(I only spent a few minutes on looking at this, so I may be wrong here)

runcom · 2020-07-13T16:25:00Z

But now that I did actually look at this beyond just skimming the PR description briefly - are you really sure of this diagnosis?

There's a reproducer here that I've written: https://bugzilla.redhat.com/show_bug.cgi?id=1855821#c3 which is the only way I'm aware of the scenario can be happening (from the BZ)
Also, pretty sure we're not even at the bootstrap vs cluster drift since we're in cluster based on the BZ.

Nothing GCs machineconfig objects today - isn't this much more likely to be "bootstrap rendering vs cluster" drift that's happened a few times?

So yeah, the drift symptom is clear tho: the bootstrap MC is different than the cluster MC and what's on disk - here, we're in the cluster, no bootstrap anymore, and that's because the user manually deleted that MC - no bootstrap in the picture here.

Also, I don't think this patch is making anything reboot more than it should 🤔 can you explain that given we're not in the bootstrap vs cluster case? Even in that bad drift scenario, how can this cause a reboot? In the drift scenario currentOnDisk != currentConfig|desiredConfig - this case should only cover currentOnDisk == currentConfig to be able to reconciles the only cases where somebody delete an in-use rendered machine config

I think this is just a way out of:

grab the current config when it can't be found in the cluster because somebody manually deleted it since the MCO doesn't GC them

I think I'm missing something cause I can't see how it'll cause a new reboot (bear with me, this is confusing every time 😂 )

runcom · 2020-07-13T16:28:19Z

Nothing GCs machineconfig objects today

I wanna re-state that this is about somebody manually deleting the rendered config (manual deletion is the only guess that makes sense (other than what it appears to be from the first must-gather) unless etcd starts having consistency issues, which I hope it doesn't)

cgwalters

I'd just like to be really really sure this isn't ye olde "bootstrap rendering different" problem. Or I'd like to see strong evidence that somehow MCs are being deleted.

runcom · 2020-09-08T11:06:40Z

I'd just like to be really really sure this isn't ye olde "bootstrap rendering different" problem. Or I'd like to see strong evidence that somehow MCs are being deleted.

so, looking at the bz, there's no doubt this isn't bootstrap rendering different. The BZ has a cluster up and running and that was the way to repro the behavior in https://bugzilla.redhat.com/show_bug.cgi?id=1855821#c3

I think one way to move this forward since there should be a manual workaround is to land a fix to log exactly the lifecycle of any machine config so we're able to look at that, compare and debug if it happens again with 4.6.

I've dropped the main commit in favor of just the one that does the enhanced logging and this should be good to go

It was a pain to support our claim here https://bugzilla.redhat.com/show_bug.cgi?id=1855821 regarding the fact that _somebody/something_ deleted the relevant machine configs. These logs shouldn't be spammy, I can't remember why they were behind a 4 log level. It makes sense to make them more visible to back up our claims on MC actions. Signed-off-by: Antonio Murdaca <[email protected]>

openshift-ci-robot · 2020-09-08T11:17:59Z

@runcom: This pull request references Bugzilla bug 1855821, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

Bug 1855821: pkg/controller/render: log actions on machine configs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

runcom · 2020-09-08T13:43:01Z

holding retesting as this shouldn't impact upgrades

cgwalters

Sure, more logging is good by me. (I think it's possible to dig through the apiserver audit logs for some of this stuff, but TBH I've never done that)

openshift-ci-robot · 2020-09-08T16:01:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sinnykumari · 2020-09-16T14:45:33Z

/retest

openshift-ci-robot · 2020-09-16T19:16:15Z

@runcom: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-gcp-upgrade	`7f1628a`	link	`/test e2e-gcp-upgrade`
ci/prow/e2e-ovn-step-registry	`7f1628a`	link	`/test e2e-ovn-step-registry`
ci/prow/e2e-upgrade	`7f1628a`	link	`/test e2e-upgrade`
ci/prow/okd-e2e-aws	`7f1628a`	link	`/test okd-e2e-aws`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-merge-robot · 2020-11-05T23:51:32Z

@runcom: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-serial	`7f1628a`	link	`/test e2e-aws-serial`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

yuqi-zhang · 2020-12-08T19:01:36Z

We've closed the underlying bug as we've determined it is not an actual bug, and we will be having more discussions on the idea of "safety" in the MCO. Correspondingly I will close this PR as it is not high priority. If you feel otherwise feel free to reopen. Thanks!

openshift-ci-robot · 2020-12-08T19:01:43Z

@runcom: This pull request references Bugzilla bug 1855821. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

Details

In response to this:

Bug 1855821: pkg/controller/render: log actions on machine configs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jul 11, 2020

openshift-ci-robot requested review from ashcrow and cgwalters July 11, 2020 09:26

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 11, 2020

sinnykumari reviewed Jul 13, 2020

View reviewed changes

pkg/controller/render/render_controller.go Outdated Show resolved Hide resolved

sinnykumari reviewed Jul 13, 2020

View reviewed changes

ashcrow reviewed Jul 13, 2020

View reviewed changes

ashcrow requested changes Jul 13, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 13, 2020

cgwalters requested changes Sep 3, 2020

View reviewed changes

runcom force-pushed the mcd-resiliency-0 branch from 23a54a6 to 7f1628a Compare September 8, 2020 11:16

runcom changed the title ~~Bug 1855821: pkg/daemon: fallback to currentConfig on disk~~ Bug 1855821: pkg/controller/render: log actions on machine configs Sep 8, 2020

cgwalters approved these changes Sep 8, 2020

View reviewed changes

kikisdeliveryservice added the team-mco label Nov 18, 2020

yuqi-zhang closed this Dec 8, 2020

Bug 1855821: pkg/controller/render: log actions on machine configs #1921

Bug 1855821: pkg/controller/render: log actions on machine configs #1921

Uh oh!

Conversation

runcom commented Jul 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jul 11, 2020

Uh oh!

Uh oh!

sinnykumari Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

runcom Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

ashcrow Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

ashcrow Jul 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashcrow left a comment

Choose a reason for hiding this comment

Uh oh!

cgwalters commented Jul 13, 2020

Uh oh!

runcom commented Jul 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgwalters commented Jul 13, 2020

Uh oh!

runcom commented Jul 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

runcom commented Jul 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgwalters left a comment

Choose a reason for hiding this comment

Uh oh!

runcom commented Sep 8, 2020

Uh oh!

openshift-ci-robot commented Sep 8, 2020

Uh oh!

runcom commented Sep 8, 2020

Uh oh!

cgwalters left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Sep 8, 2020

Uh oh!

sinnykumari commented Sep 16, 2020

Uh oh!

openshift-ci-robot commented Sep 16, 2020

Uh oh!

openshift-merge-robot commented Nov 5, 2020

Uh oh!

yuqi-zhang commented Dec 8, 2020

Uh oh!

openshift-ci-robot commented Dec 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

runcom commented Jul 11, 2020 •

edited

Loading

ashcrow Jul 13, 2020 •

edited

Loading

runcom commented Jul 13, 2020 •

edited

Loading

runcom commented Jul 13, 2020 •

edited

Loading

runcom commented Jul 13, 2020 •

edited

Loading