pkg/controller: fix container runtime registries generation #489

runcom · 2019-02-24T18:40:58Z

The CRC (container runtime config) controller recently added a check to
avoid resyncing and recreating the very same registries config if
nothing has changed on the image crd side [1]. While that's correct,
during an upgrade, the controllers need to generate the MC fragments
using the controller version they're at. Since we weren't checking the
versions of the controller that generated the registries config, we
wrongly assumed the configurations were equal and never generated a new
one with the new controller.
This patch fixes that by adding a version check before skipping a
regeneration on equal content in the registries configs.
Fixes: #487

[1] #461

Signed-off-by: Antonio Murdaca [email protected]

runcom · 2019-02-24T18:42:41Z

pkg/controller/container-runtime-config/container_runtime_config_controller.go

I guess we can also just update that annotation to the new version and avoid a whole new generation with same content, but let's keep it this way for now

cgwalters · 2019-02-24T18:48:44Z

pkg/controller/container-runtime-config/container_runtime_config_controller.go

What's the difference here? It looks like we have mc.ObjectMeta.Annotations in another place in this code.

none, but ObjectMeta is an embedded field and Go let's you access embedded fields' fields' directly on the top level struct...nothing changes functionality wise

I, of course, can change it back...it's just OCD kicking lol

ok. moved this back for consistency...

cgwalters · 2019-02-24T19:02:10Z

This looks sane to me but - today we don't have any tests that cover this. Did you test this by hand?

I think we could add something to e2e-aws-op that builds another release payload from the same code and at least verify that a no-op upgrade works. (Don't have to do it in this PR, just brainstorming)

runcom · 2019-02-24T19:06:29Z

This looks sane to me but - today we don't have any tests that cover this. Did you test this by hand?

I'm testing this right now on a cluster but unlikely to finish the whole testing today but that's the plan

I think we could add something to e2e-aws-op that builds another release payload from the same code and at least verify that a no-op upgrade works. (Don't have to do it in this PR, just brainstorming)

do we have oc around in the CI env? if yes, then yeah, let's do that, I'll follow up with something like that.

The CRC (container runtime config) controller recently added a check to avoid resyncing and recreating the very same registries config if nothing has changed on the image crd side [1]. While that's correct, during an upgrade, the controllers need to generate the MC fragments using the controller version they're at. Since we weren't checking the versions of the controller that generated the registries config, we wrongly assumed the configurations were equal and never generated a new one with the new controller. This patch fixes that by adding a version check before skipping a regeneration on equal content in the registries configs. Fixes: openshift#487 [1] openshift#461 Signed-off-by: Antonio Murdaca <[email protected]>

runcom · 2019-02-24T19:13:02Z

@umohnani8 ptal

runcom · 2019-02-24T20:17:04Z

Good to see that CI is now cooperating again...

cgwalters · 2019-02-24T20:19:34Z

/lgtm

openshift-ci-robot · 2019-02-24T20:19:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

runcom · 2019-02-24T20:21:26Z

/lgtm

just FYI, I'm spinning up two cluster, one to reproduce using a CI payload (broken) and another one to upgrade to my own payload built from this PR. First should fail with #487, second should hopefully go through (if no other errors like daemon SIGTERM kicks in?)

runcom · 2019-02-24T20:23:08Z

oh gosh, this was fast lol will report back anyway I guess

runcom · 2019-02-24T21:00:39Z

as a side note, the bug in the linked issue isn't 100% reproducible to my testing, I've upgraded to the very same version as Colin's and did not get the error

runcom · 2019-02-24T21:02:34Z

nevermind, I was actually using the very same version, so it was a no-op

runcom · 2019-02-24T21:10:03Z

I was able to upgrade to my custom payload just fine 👍 it works

runcom · 2019-02-24T22:11:54Z

ok, properly tested this one on 2 clusters, the first cluster is correctly failing w/o this patch on upgrade, whether the second one just goes through and successfully upgrade.

First cluster MCs broken (notice the registries configs being on -dirty, and notice I got the very same error Colin had):

23:10:40 [github.com/openshift/machine-config-operator] ‹master› oc get machineconfigs                     
NAME                                                        GENERATEDBYCONTROLLER        IGNITIONVERSION   CREATED
00-master                                                   3.11.0-723-g617d35e9         2.2.0             24m
00-master-ssh                                               3.11.0-723-g617d35e9                           24m
00-worker                                                   3.11.0-723-g617d35e9         2.2.0             24m
00-worker-ssh                                               3.11.0-723-g617d35e9                           24m
01-master-container-runtime                                 3.11.0-723-g617d35e9         2.2.0             24m
01-master-kubelet                                           3.11.0-723-g617d35e9         2.2.0             24m
01-worker-container-runtime                                 3.11.0-723-g617d35e9         2.2.0             24m
01-worker-kubelet                                           3.11.0-723-g617d35e9         2.2.0             24m
99-master-8bd634e9-387d-11e9-a758-06a3c1aa817c-registries   3.11.0-723-g617d35e9-dirty                     22m
99-worker-8bd751d1-387d-11e9-a758-06a3c1aa817c-registries   3.11.0-723-g617d35e9-dirty                     22m
master-2cb481673ed0b572d9a8fa9d2612d31c                     3.11.0-723-g617d35e9-dirty   2.2.0             24m
master-31bbbd1cb081eb6182c0f415f887ae99                     3.11.0-723-g617d35e9         2.2.0             5m4s
master-5597f9e68d6a65e7a7ace765139d93d8                     3.11.0-723-g617d35e9-dirty   2.2.0             24m
master-f06730041e1db83b72af95824d4d48b3                     3.11.0-723-g617d35e9         2.2.0             4m56s
worker-05e89433151d256339179b1f5703fd15                     3.11.0-723-g617d35e9         2.2.0             4m56s
worker-db82c7e62e8082acaedf117785f005d6                     3.11.0-723-g617d35e9-dirty   2.2.0             24m
worker-f375d3645f8c66d61c6fed104af20521                     3.11.0-723-g617d35e9         2.2.0             24m

23:24:07 [github.com/openshift/machine-config-operator] ‹master› oc describe clusteroperator machine-config                                                     
...
    Last Transition Time:  2019-02-24T22:23:39Z
    Message:               Failed to resync 3.11.0-723-g617d35e9 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for 99-master-8bd634e9-387d-11e9-a758-06a3c1aa817c-registries expected 3.11.0-723-g617d35e9 has 3.11.0-723-g617d35e9-dirty
    Reason:                timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for 99-master-8bd634e9-387d-11e9-a758-06a3c1aa817c-registries expected 3.11.0-723-g617d35e9 has 3.11.0-723-g617d35e9-dirty
    Status:                True
    Type:                  Failing
...

The second cluster deployed just fine as expected. Notice the MCs for registries got generated by the new controller (without -dirty):

23:38:46 [github.com/openshift/installer] ‹master*› oc get machineconfigs
NAME                                                        GENERATEDBYCONTROLLER        IGNITIONVERSION   CREATED
00-master                                                   3.11.0-725-g32f17fd5         2.2.0             33m
00-master-ssh                                               3.11.0-725-g32f17fd5                           33m
00-worker                                                   3.11.0-725-g32f17fd5         2.2.0             33m
00-worker-ssh                                               3.11.0-725-g32f17fd5                           33m
01-master-container-runtime                                 3.11.0-725-g32f17fd5         2.2.0             33m
01-master-kubelet                                           3.11.0-725-g32f17fd5         2.2.0             33m
01-worker-container-runtime                                 3.11.0-725-g32f17fd5         2.2.0             33m
01-worker-kubelet                                           3.11.0-725-g32f17fd5         2.2.0             33m
99-master-641d0c31-3880-11e9-8413-06a748e98488-registries   3.11.0-725-g32f17fd5                           30m
99-worker-641e58ec-3880-11e9-8413-06a748e98488-registries   3.11.0-725-g32f17fd5                           30m
master-b32b5de1656bb7df110ab66a73820eeb                     3.11.0-723-g617d35e9-dirty   2.2.0             33m
master-fb58b9adb318a1eb88b3a49741853f08                     3.11.0-725-g32f17fd5         2.2.0             72s
worker-029f3350172b1f39c8aeb3c7914c0757                     3.11.0-723-g617d35e9-dirty   2.2.0             33m
worker-146885daab3e2bb4943bd18b927b53bd                     3.11.0-725-g32f17fd5         2.2.0             72s

23:52:24 [github.com/openshift/installer] ‹master*› oc get clusterversion version
NAME      VERSION                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   0.0.1-2019-02-24-222941   True        False         1s      Cluster version is 0.0.1-2019-02-24-222941

And the cluster finishes the upgrade just normally.

runcom · 2019-02-24T22:28:18Z

@cgwalters I do completely agree our CI needs to have a smoke upgrade testing (be it with oc adm upgrade in our e2e, or being an official CI job that openshift provides) to be able to catch silly regressions like this one. Ideally, basic smoke test would be 1) upgrade to the same version of the operator 2) upgrade to a different version. These two scenarios should work everytime.

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 24, 2019

openshift-ci-robot requested review from crawford and jlebon February 24, 2019 18:41

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 24, 2019

runcom commented Feb 24, 2019

View reviewed changes

cgwalters reviewed Feb 24, 2019

View reviewed changes

runcom force-pushed the fix-crc-registries branch from 7bc4f6d to c2ef2c1 Compare February 24, 2019 19:11

openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 24, 2019

openshift-ci-robot assigned cgwalters Feb 24, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 24, 2019

openshift-merge-robot merged commit 32f17fd into openshift:master Feb 24, 2019

runcom deleted the fix-crc-registries branch February 24, 2019 20:22

pkg/controller: fix container runtime registries generation #489

pkg/controller: fix container runtime registries generation #489

Uh oh!

Conversation

runcom commented Feb 24, 2019

Uh oh!

runcom Feb 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgwalters Feb 24, 2019

Choose a reason for hiding this comment

Uh oh!

runcom Feb 24, 2019

Choose a reason for hiding this comment

Uh oh!

runcom Feb 24, 2019

Choose a reason for hiding this comment

Uh oh!

runcom Feb 24, 2019

Choose a reason for hiding this comment

Uh oh!

cgwalters commented Feb 24, 2019

Uh oh!

runcom commented Feb 24, 2019

Uh oh!

runcom commented Feb 24, 2019

Uh oh!

runcom commented Feb 24, 2019

Uh oh!

cgwalters commented Feb 24, 2019

Uh oh!

openshift-ci-robot commented Feb 24, 2019

Uh oh!

runcom commented Feb 24, 2019

Uh oh!

runcom commented Feb 24, 2019

Uh oh!

runcom commented Feb 24, 2019

Uh oh!

runcom commented Feb 24, 2019

Uh oh!

runcom commented Feb 24, 2019

Uh oh!

runcom commented Feb 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

runcom commented Feb 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

runcom Feb 24, 2019 •

edited

Loading

runcom commented Feb 24, 2019 •

edited

Loading

runcom commented Feb 24, 2019 •

edited

Loading