Skip to content

Conversation

@runcom
Copy link
Member

@runcom runcom commented Feb 24, 2019

The CRC (container runtime config) controller recently added a check to
avoid resyncing and recreating the very same registries config if
nothing has changed on the image crd side [1]. While that's correct,
during an upgrade, the controllers need to generate the MC fragments
using the controller version they're at. Since we weren't checking the
versions of the controller that generated the registries config, we
wrongly assumed the configurations were equal and never generated a new
one with the new controller.
This patch fixes that by adding a version check before skipping a
regeneration on equal content in the registries configs.
Fixes: #487

[1] #461

Signed-off-by: Antonio Murdaca [email protected]

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 24, 2019
@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 24, 2019
Copy link
Member Author

@runcom runcom Feb 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can also just update that annotation to the new version and avoid a whole new generation with same content, but let's keep it this way for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference here? It looks like we have mc.ObjectMeta.Annotations in another place in this code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none, but ObjectMeta is an embedded field and Go let's you access embedded fields' fields' directly on the top level struct...nothing changes functionality wise

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I, of course, can change it back...it's just OCD kicking lol

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. moved this back for consistency...

@cgwalters
Copy link
Member

This looks sane to me but - today we don't have any tests that cover this. Did you test this by hand?

I think we could add something to e2e-aws-op that builds another release payload from the same code and at least verify that a no-op upgrade works. (Don't have to do it in this PR, just brainstorming)

@runcom
Copy link
Member Author

runcom commented Feb 24, 2019

This looks sane to me but - today we don't have any tests that cover this. Did you test this by hand?

I'm testing this right now on a cluster but unlikely to finish the whole testing today but that's the plan

I think we could add something to e2e-aws-op that builds another release payload from the same code and at least verify that a no-op upgrade works. (Don't have to do it in this PR, just brainstorming)

do we have oc around in the CI env? if yes, then yeah, let's do that, I'll follow up with something like that.

The CRC (container runtime config) controller recently added a check to
avoid resyncing and recreating the very same registries config if
nothing has changed on the image crd side [1]. While that's correct,
during an upgrade, the controllers need to generate the MC fragments
using the controller version they're at. Since we weren't checking the
versions of the controller that generated the registries config, we
wrongly assumed the configurations were equal and never generated a new
one with the new controller.
This patch fixes that by adding a version check before skipping a
regeneration on equal content in the registries configs.
Fixes: openshift#487

[1] openshift#461

Signed-off-by: Antonio Murdaca <[email protected]>
@runcom runcom force-pushed the fix-crc-registries branch from 7bc4f6d to c2ef2c1 Compare February 24, 2019 19:11
@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 24, 2019
@runcom
Copy link
Member Author

runcom commented Feb 24, 2019

@umohnani8 ptal

@runcom
Copy link
Member Author

runcom commented Feb 24, 2019

Good to see that CI is now cooperating again...

@cgwalters
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 24, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom
Copy link
Member Author

runcom commented Feb 24, 2019

/lgtm

just FYI, I'm spinning up two cluster, one to reproduce using a CI payload (broken) and another one to upgrade to my own payload built from this PR. First should fail with #487, second should hopefully go through (if no other errors like daemon SIGTERM kicks in?)

@openshift-merge-robot openshift-merge-robot merged commit 32f17fd into openshift:master Feb 24, 2019
@runcom runcom deleted the fix-crc-registries branch February 24, 2019 20:22
@runcom
Copy link
Member Author

runcom commented Feb 24, 2019

oh gosh, this was fast lol will report back anyway I guess

@runcom
Copy link
Member Author

runcom commented Feb 24, 2019

as a side note, the bug in the linked issue isn't 100% reproducible to my testing, I've upgraded to the very same version as Colin's and did not get the error

@runcom
Copy link
Member Author

runcom commented Feb 24, 2019

nevermind, I was actually using the very same version, so it was a no-op

@runcom
Copy link
Member Author

runcom commented Feb 24, 2019

I was able to upgrade to my custom payload just fine 👍 it works

@runcom
Copy link
Member Author

runcom commented Feb 24, 2019

ok, properly tested this one on 2 clusters, the first cluster is correctly failing w/o this patch on upgrade, whether the second one just goes through and successfully upgrade.

First cluster MCs broken (notice the registries configs being on -dirty, and notice I got the very same error Colin had):

23:10:40 [github.com/openshift/machine-config-operator] ‹master› oc get machineconfigs                     
NAME                                                        GENERATEDBYCONTROLLER        IGNITIONVERSION   CREATED
00-master                                                   3.11.0-723-g617d35e9         2.2.0             24m
00-master-ssh                                               3.11.0-723-g617d35e9                           24m
00-worker                                                   3.11.0-723-g617d35e9         2.2.0             24m
00-worker-ssh                                               3.11.0-723-g617d35e9                           24m
01-master-container-runtime                                 3.11.0-723-g617d35e9         2.2.0             24m
01-master-kubelet                                           3.11.0-723-g617d35e9         2.2.0             24m
01-worker-container-runtime                                 3.11.0-723-g617d35e9         2.2.0             24m
01-worker-kubelet                                           3.11.0-723-g617d35e9         2.2.0             24m
99-master-8bd634e9-387d-11e9-a758-06a3c1aa817c-registries   3.11.0-723-g617d35e9-dirty                     22m
99-worker-8bd751d1-387d-11e9-a758-06a3c1aa817c-registries   3.11.0-723-g617d35e9-dirty                     22m
master-2cb481673ed0b572d9a8fa9d2612d31c                     3.11.0-723-g617d35e9-dirty   2.2.0             24m
master-31bbbd1cb081eb6182c0f415f887ae99                     3.11.0-723-g617d35e9         2.2.0             5m4s
master-5597f9e68d6a65e7a7ace765139d93d8                     3.11.0-723-g617d35e9-dirty   2.2.0             24m
master-f06730041e1db83b72af95824d4d48b3                     3.11.0-723-g617d35e9         2.2.0             4m56s
worker-05e89433151d256339179b1f5703fd15                     3.11.0-723-g617d35e9         2.2.0             4m56s
worker-db82c7e62e8082acaedf117785f005d6                     3.11.0-723-g617d35e9-dirty   2.2.0             24m
worker-f375d3645f8c66d61c6fed104af20521                     3.11.0-723-g617d35e9         2.2.0             24m

23:24:07 [github.com/openshift/machine-config-operator] ‹master› oc describe clusteroperator machine-config                                                     
...
    Last Transition Time:  2019-02-24T22:23:39Z
    Message:               Failed to resync 3.11.0-723-g617d35e9 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for 99-master-8bd634e9-387d-11e9-a758-06a3c1aa817c-registries expected 3.11.0-723-g617d35e9 has 3.11.0-723-g617d35e9-dirty
    Reason:                timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for 99-master-8bd634e9-387d-11e9-a758-06a3c1aa817c-registries expected 3.11.0-723-g617d35e9 has 3.11.0-723-g617d35e9-dirty
    Status:                True
    Type:                  Failing
...

The second cluster deployed just fine as expected. Notice the MCs for registries got generated by the new controller (without -dirty):

23:38:46 [github.com/openshift/installer] ‹master*› oc get machineconfigs
NAME                                                        GENERATEDBYCONTROLLER        IGNITIONVERSION   CREATED
00-master                                                   3.11.0-725-g32f17fd5         2.2.0             33m
00-master-ssh                                               3.11.0-725-g32f17fd5                           33m
00-worker                                                   3.11.0-725-g32f17fd5         2.2.0             33m
00-worker-ssh                                               3.11.0-725-g32f17fd5                           33m
01-master-container-runtime                                 3.11.0-725-g32f17fd5         2.2.0             33m
01-master-kubelet                                           3.11.0-725-g32f17fd5         2.2.0             33m
01-worker-container-runtime                                 3.11.0-725-g32f17fd5         2.2.0             33m
01-worker-kubelet                                           3.11.0-725-g32f17fd5         2.2.0             33m
99-master-641d0c31-3880-11e9-8413-06a748e98488-registries   3.11.0-725-g32f17fd5                           30m
99-worker-641e58ec-3880-11e9-8413-06a748e98488-registries   3.11.0-725-g32f17fd5                           30m
master-b32b5de1656bb7df110ab66a73820eeb                     3.11.0-723-g617d35e9-dirty   2.2.0             33m
master-fb58b9adb318a1eb88b3a49741853f08                     3.11.0-725-g32f17fd5         2.2.0             72s
worker-029f3350172b1f39c8aeb3c7914c0757                     3.11.0-723-g617d35e9-dirty   2.2.0             33m
worker-146885daab3e2bb4943bd18b927b53bd                     3.11.0-725-g32f17fd5         2.2.0             72s

23:52:24 [github.com/openshift/installer] ‹master*› oc get clusterversion version
NAME      VERSION                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   0.0.1-2019-02-24-222941   True        False         1s      Cluster version is 0.0.1-2019-02-24-222941

And the cluster finishes the upgrade just normally.

@runcom
Copy link
Member Author

runcom commented Feb 24, 2019

@cgwalters I do completely agree our CI needs to have a smoke upgrade testing (be it with oc adm upgrade in our e2e, or being an official CI job that openshift provides) to be able to catch silly regressions like this one. Ideally, basic smoke test would be 1) upgrade to the same version of the operator 2) upgrade to a different version. These two scenarios should work everytime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

upgrade: version sync failure

4 participants