Skip to content

Conversation

@runcom
Copy link
Member

@runcom runcom commented Feb 11, 2019

Based on:

We should correctly report Progressing only when really progressing towards something. Re-syncing, reconciling, drift detections verifying everything is still where you left it is not changing something, thus is not progressing.

closes #346

/cc @smarterclayton @abhinavdahiya ptal

Signed-off-by: Antonio Murdaca runcom@linux.com

@openshift-ci-robot
Copy link
Contributor

@runcom: GitHub didn't allow me to request PR reviews from the following users: ptal.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

Based on:

We should correctly report Progressing only when really progressing towards something. Re-syncing, reconciling, drift detections verifying everything is still where you left it is not changing something, thus is not progressing.

/cc @smarterclayton @abhinavdahiya ptal

Signed-off-by: Antonio Murdaca runcom@linux.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 11, 2019
@runcom
Copy link
Member Author

runcom commented Feb 11, 2019

not entirely sure how to test this out on a running cluster, disabling the CVO isn't gonna work, setting the operator Deployment to unmanaged isn't gonna work, any idea?

@cgwalters
Copy link
Member

If you want to test things like this cleanly it basically requires building a custom payload and passing it to the installer.

@runcom
Copy link
Member Author

runcom commented Feb 11, 2019

If you want to test things like this cleanly it basically requires building a custom payload and passing it to the installer.

it's probably time I start doing this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this drop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting Available doesn't necessarily clear progressing (taken from docs https://github.com/openshift/installer/blob/master/docs/user/overview.md)

Failing is true with a detailed message Unable to apply 4.0.1: could not update 0000_70_network_deployment.yaml because the resource type NetworkConfig has not been installed on the server.
Available is true with message Cluster has deployed 4.0.0
Progressing is true with message Unable to apply 4.0.1: a required object is missing

It also looks like something other operators are doing (setting Avaiable can still be true while Failing and Progressing are also true)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other scenarios I've seen in other operators set the Conditions indipendently from the other conditions as well (which makes sense I believe)

Copy link
Contributor

@abhinavdahiya abhinavdahiya Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting Available doesn't necessarily clear progressing (taken from docs https://github.com/openshift/installer/blob/master/docs/user/overview.md)

Failing is true with a detailed message Unable to apply 4.0.1: could not update 0000_70_network_deployment.yaml because the resource type NetworkConfig has not been installed on the server.
Available is true with message Cluster has deployed 4.0.0
Progressing is true with message Unable to apply 4.0.1: a required object is missing

It also looks like something other operators are doing (setting Avaiable can still be true while Failing and Progressing are also true)

you are quoting an example when operator is progressing or failing...
when the operator is available
from https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#conditions

    Failing is false with no message
    Available is true with message Cluster has deployed 4.0.1
    Progressing is false with message Cluster version is 4.0.1

you cannot be failing, progressing and available for the same version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should check the other Conditions I guess right

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Member Author

@runcom runcom Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway, the status you quoted can already converge to without my latest additions (though they're indeed needed). Progressing and Failing are set before available anyway so when we're at Avaiable, we know where we are (I think)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@runcom runcom force-pushed the correct-sync-operator branch from 20a3b47 to b363f85 Compare February 11, 2019 23:17
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 11, 2019
@runcom
Copy link
Member Author

runcom commented Feb 12, 2019

/retest

1 similar comment
@runcom
Copy link
Member Author

runcom commented Feb 12, 2019

/retest

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? The intention was to stay in "initialization" stage until we'd done a successful sync.

(Not saying it's wrong, I just want to understand why you're making the change)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just golang linting, the previous if branch has a return at the end, the else branch is therefore superfluous

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm wondering here...we want progressing=true when optr.inClusterBringup too right?
IOW this code is handling the upgrade case but we also want to be progressing during initial install.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, I'm gonna work that out

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@cgwalters cgwalters Feb 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I've noticed related to this though is that we seem to stay in inClusterBringup for multiple times. Though maybe that's only failing PRs? Glancing at the logs in a recent merged PR things look fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm what do you mean "stay in inClusterBringup for multiple times"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the sync fails; you'll see that log multiple times. But I think that's the right thing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah, it's correct we keep re-syncing and logging that

@runcom runcom force-pushed the correct-sync-operator branch 2 times, most recently from 143c1ac to 6d57092 Compare February 13, 2019 23:07
@runcom
Copy link
Member Author

runcom commented Feb 13, 2019

This is good to review and test now, I was able to test it out using a custom payload built using #421 (comment) and I can't notice the statuses bouncing anymore.

@runcom
Copy link
Member Author

runcom commented Feb 13, 2019

@abhinavdahiya could you also take another look at this now?

@runcom
Copy link
Member Author

runcom commented Feb 14, 2019

/retest

Based on:

- https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#conditions
- Slack conversation with Clayton

We should correctly report Progressing only when really progressing towards
something. Re-syncing, reconciling, drift detections verifying everything is
still where you left it is not _changing_ something, thus is not progressing.

Signed-off-by: Antonio Murdaca <runcom@linux.com>
@runcom runcom force-pushed the correct-sync-operator branch from 6d57092 to b8fdc27 Compare February 14, 2019 16:08
@cgwalters
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom
Copy link
Member Author

runcom commented Feb 14, 2019

/retest

@runcom
Copy link
Member Author

runcom commented Feb 15, 2019

/retest

4 similar comments
@runcom
Copy link
Member Author

runcom commented Feb 15, 2019

/retest

@runcom
Copy link
Member Author

runcom commented Feb 15, 2019

/retest

@runcom
Copy link
Member Author

runcom commented Feb 15, 2019

/retest

@runcom
Copy link
Member Author

runcom commented Feb 15, 2019

/retest

@openshift-merge-robot openshift-merge-robot merged commit 813142b into openshift:master Feb 15, 2019
@runcom runcom deleted the correct-sync-operator branch February 15, 2019 16:00
@abhinavdahiya
Copy link
Contributor

@runcom @cgwalters
With this PR if MCO ever fails it will never go available again because of
https://github.com/openshift/machine-config-operator/pull/406/files#diff-bee1f8f36240cb95db10244fdf335146R37

example:
https://gubernator.k8s.io/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/4246
MCO is reporting failing...
but if you look at https://storage.cloud.google.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/4246/artifacts/release-e2e-aws/pods/openshift-machine-config-operator_machine-config-operator-7888cd6-c8wp4_machine-config-operator.log.gz

mco looses the connection to apiserver interminently when reconciling and then reports the machine-config failing.

once MCO has reported failing, even on a successful reconcile later on calling syncAvailbleStatus here will still report failing because of that if condition I mentioned above.

Please look into this.

@abhinavdahiya
Copy link
Contributor

@runcom
Copy link
Member Author

runcom commented Feb 16, 2019

Thanks @abhinavdahiya I'm fixing that (I was also thinking about making all of this simpler as well but later).

runcom added a commit to runcom/machine-config-operator that referenced this pull request Feb 16, 2019
Fix a bug reported in openshift#406 (comment)
where we weren't properly resetting the failing status.

Signed-off-by: Antonio Murdaca <runcom@linux.com>
@runcom
Copy link
Member Author

runcom commented Feb 16, 2019

@abhinavdahiya please take a look at #442

runcom added a commit to runcom/machine-config-operator that referenced this pull request Feb 16, 2019
Fix a bug reported in openshift#406 (comment)
where we weren't properly resetting the failing status.

Signed-off-by: Antonio Murdaca <runcom@linux.com>
runcom added a commit to runcom/machine-config-operator that referenced this pull request Feb 16, 2019
Specifically, add a test to verify that we're clearing out the failing
condition on subsequent sync operations (covering a bug reported here
openshift#406 (comment))

Signed-off-by: Antonio Murdaca <runcom@linux.com>
runcom added a commit to runcom/machine-config-operator that referenced this pull request Feb 17, 2019
Fix a bug reported in openshift#406 (comment)
where we weren't properly resetting the failing status.

Signed-off-by: Antonio Murdaca <runcom@linux.com>
runcom added a commit to runcom/machine-config-operator that referenced this pull request Feb 17, 2019
Specifically, add a test to verify that we're clearing out the failing
condition on subsequent sync operations (covering a bug reported here
openshift#406 (comment))

Signed-off-by: Antonio Murdaca <runcom@linux.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCO Cluster Operator statuses frequently changing

5 participants