pkg/operator: correctly sync status for the CVO #406

runcom · 2019-02-11T22:17:47Z

Based on:

https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#conditions
Slack conversation with Clayton

We should correctly report Progressing only when really progressing towards something. Re-syncing, reconciling, drift detections verifying everything is still where you left it is not changing something, thus is not progressing.

closes #346

/cc @smarterclayton @abhinavdahiya ptal

Signed-off-by: Antonio Murdaca runcom@linux.com

openshift-ci-robot · 2019-02-11T22:17:49Z

@runcom: GitHub didn't allow me to request PR reviews from the following users: ptal.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

Based on:

https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#conditions

Slack conversation with Clayton

We should correctly report Progressing only when really progressing towards something. Re-syncing, reconciling, drift detections verifying everything is still where you left it is not changing something, thus is not progressing.

/cc @smarterclayton @abhinavdahiya ptal

Signed-off-by: Antonio Murdaca runcom@linux.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

runcom · 2019-02-11T22:19:04Z

not entirely sure how to test this out on a running cluster, disabling the CVO isn't gonna work, setting the operator Deployment to unmanaged isn't gonna work, any idea?

cgwalters · 2019-02-11T22:20:43Z

If you want to test things like this cleanly it basically requires building a custom payload and passing it to the installer.

runcom · 2019-02-11T22:22:15Z

If you want to test things like this cleanly it basically requires building a custom payload and passing it to the installer.

it's probably time I start doing this

abhinavdahiya · 2019-02-11T22:34:48Z

pkg/operator/status.go

Why this drop?

setting Available doesn't necessarily clear progressing (taken from docs https://github.com/openshift/installer/blob/master/docs/user/overview.md)

Failing is true with a detailed message Unable to apply 4.0.1: could not update 0000_70_network_deployment.yaml because the resource type NetworkConfig has not been installed on the server. Available is true with message Cluster has deployed 4.0.0 Progressing is true with message Unable to apply 4.0.1: a required object is missing

It also looks like something other operators are doing (setting Avaiable can still be true while Failing and Progressing are also true)

The other scenarios I've seen in other operators set the Conditions indipendently from the other conditions as well (which makes sense I believe)

setting Available doesn't necessarily clear progressing (taken from docs https://github.com/openshift/installer/blob/master/docs/user/overview.md)

Failing is true with a detailed message Unable to apply 4.0.1: could not update 0000_70_network_deployment.yaml because the resource type NetworkConfig has not been installed on the server. Available is true with message Cluster has deployed 4.0.0 Progressing is true with message Unable to apply 4.0.1: a required object is missing

It also looks like something other operators are doing (setting Avaiable can still be true while Failing and Progressing are also true)

you are quoting an example when operator is progressing or failing...
when the operator is available
from https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#conditions

Failing is false with no message Available is true with message Cluster has deployed 4.0.1 Progressing is false with message Cluster version is 4.0.1

you cannot be failing, progressing and available for the same version.

I should check the other Conditions I guess right

anyway, the status you quoted can already converge to without my latest additions (though they're indeed needed). Progressing and Failing are set before available anyway so when we're at Avaiable, we know where we are (I think)

fixed here I believe now https://github.com/openshift/machine-config-operator/pull/406/files#diff-bee1f8f36240cb95db10244fdf335146R37

runcom · 2019-02-12T07:11:09Z

/retest

runcom · 2019-02-12T08:31:50Z

/retest

cgwalters · 2019-02-12T14:34:24Z

pkg/operator/sync.go

Why this change? The intention was to stay in "initialization" stage until we'd done a successful sync.

(Not saying it's wrong, I just want to understand why you're making the change)

just golang linting, the previous if branch has a return at the end, the else branch is therefore superfluous

cgwalters · 2019-02-12T14:42:10Z

pkg/operator/status.go

One thing I'm wondering here...we want progressing=true when optr.inClusterBringup too right?
IOW this code is handling the upgrade case but we also want to be progressing during initial install.

good call, I'm gonna work that out

fixed here https://github.com/openshift/machine-config-operator/pull/406/files#diff-bee1f8f36240cb95db10244fdf335146R68 - we now set progressing = true on bring up as well

One thing I've noticed related to this though is that we seem to stay in inClusterBringup for multiple times. Though maybe that's only failing PRs? Glancing at the logs in a recent merged PR things look fine.

mmm what do you mean "stay in inClusterBringup for multiple times"?

If the sync fails; you'll see that log multiple times. But I think that's the right thing?

oh yeah, it's correct we keep re-syncing and logging that

runcom · 2019-02-13T23:10:29Z

This is good to review and test now, I was able to test it out using a custom payload built using #421 (comment) and I can't notice the statuses bouncing anymore.

runcom · 2019-02-13T23:39:36Z

@abhinavdahiya could you also take another look at this now?

runcom · 2019-02-14T09:33:19Z

/retest

pkg/operator/status.go

Based on: - https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#conditions - Slack conversation with Clayton We should correctly report Progressing only when really progressing towards something. Re-syncing, reconciling, drift detections verifying everything is still where you left it is not _changing_ something, thus is not progressing. Signed-off-by: Antonio Murdaca <runcom@linux.com>

cgwalters · 2019-02-14T16:13:05Z

/lgtm

openshift-ci-robot · 2019-02-14T16:13:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

runcom · 2019-02-14T16:52:28Z

/retest

runcom · 2019-02-15T08:00:23Z

/retest

runcom · 2019-02-15T10:56:29Z

/retest

runcom · 2019-02-15T13:39:20Z

/retest

runcom · 2019-02-15T13:58:01Z

/retest

runcom · 2019-02-15T14:23:03Z

/retest

abhinavdahiya · 2019-02-16T06:01:39Z

@runcom @cgwalters
With this PR if MCO ever fails it will never go available again because of
https://github.com/openshift/machine-config-operator/pull/406/files#diff-bee1f8f36240cb95db10244fdf335146R37

example:
https://gubernator.k8s.io/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/4246
MCO is reporting failing...
but if you look at https://storage.cloud.google.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/4246/artifacts/release-e2e-aws/pods/openshift-machine-config-operator_machine-config-operator-7888cd6-c8wp4_machine-config-operator.log.gz

mco looses the connection to apiserver interminently when reconciling and then reports the machine-config failing.

once MCO has reported failing, even on a successful reconcile later on calling syncAvailbleStatus here will still report failing because of that if condition I mentioned above.

Please look into this.

abhinavdahiya · 2019-02-16T06:11:54Z

I also suspect this was similar error https://gubernator.k8s.io/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/4197

runcom · 2019-02-16T08:29:21Z

Thanks @abhinavdahiya I'm fixing that (I was also thinking about making all of this simpler as well but later).

Fix a bug reported in openshift#406 (comment) where we weren't properly resetting the failing status. Signed-off-by: Antonio Murdaca <runcom@linux.com>

runcom · 2019-02-16T11:04:16Z

@abhinavdahiya please take a look at #442

Fix a bug reported in openshift#406 (comment) where we weren't properly resetting the failing status. Signed-off-by: Antonio Murdaca <runcom@linux.com>

Specifically, add a test to verify that we're clearing out the failing condition on subsequent sync operations (covering a bug reported here openshift#406 (comment)) Signed-off-by: Antonio Murdaca <runcom@linux.com>

Fix a bug reported in openshift#406 (comment) where we weren't properly resetting the failing status. Signed-off-by: Antonio Murdaca <runcom@linux.com>

Specifically, add a test to verify that we're clearing out the failing condition on subsequent sync operations (covering a bug reported here openshift#406 (comment)) Signed-off-by: Antonio Murdaca <runcom@linux.com>

openshift-ci-robot requested a review from smarterclayton February 11, 2019 22:17

openshift-ci-robot requested a review from abhinavdahiya February 11, 2019 22:17

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 11, 2019

abhinavdahiya reviewed Feb 11, 2019

View reviewed changes

runcom force-pushed the correct-sync-operator branch from 20a3b47 to b363f85 Compare February 11, 2019 23:17

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 11, 2019

cgwalters reviewed Feb 12, 2019

View reviewed changes

runcom mentioned this pull request Feb 12, 2019

MCO Cluster Operator statuses frequently changing #346

Closed

runcom force-pushed the correct-sync-operator branch 2 times, most recently from 143c1ac to 6d57092 Compare February 13, 2019 23:07

cgwalters reviewed Feb 14, 2019

View reviewed changes

pkg/operator/status.go Outdated Show resolved Hide resolved

pkg/operator/status.go Outdated Show resolved Hide resolved

pkg/operator/status.go Outdated Show resolved Hide resolved

runcom force-pushed the correct-sync-operator branch from 6d57092 to b8fdc27 Compare February 14, 2019 16:08

openshift-ci-robot assigned cgwalters Feb 14, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2019

openshift-merge-robot merged commit 813142b into openshift:master Feb 15, 2019

runcom deleted the correct-sync-operator branch February 15, 2019 16:00

runcom mentioned this pull request Feb 16, 2019

MCO: clear out failing status on success and add tests #442

Merged

pkg/operator: correctly sync status for the CVO #406

pkg/operator: correctly sync status for the CVO #406

Uh oh!

Conversation

runcom commented Feb 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Feb 11, 2019

Uh oh!

runcom commented Feb 11, 2019

Uh oh!

cgwalters commented Feb 11, 2019

Uh oh!

runcom commented Feb 11, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya Feb 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

runcom Feb 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

runcom commented Feb 12, 2019

Uh oh!

runcom commented Feb 12, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgwalters Feb 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

runcom commented Feb 13, 2019

Uh oh!

runcom commented Feb 13, 2019

Uh oh!

runcom commented Feb 14, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cgwalters commented Feb 14, 2019

Uh oh!

openshift-ci-robot commented Feb 14, 2019

Uh oh!

runcom commented Feb 14, 2019

Uh oh!

runcom commented Feb 15, 2019

Uh oh!

runcom commented Feb 15, 2019

Uh oh!

runcom commented Feb 15, 2019

Uh oh!

runcom commented Feb 11, 2019 •

edited

Loading

abhinavdahiya Feb 11, 2019 •

edited

Loading

runcom Feb 11, 2019 •

edited

Loading

cgwalters Feb 14, 2019 •

edited

Loading