-
Notifications
You must be signed in to change notification settings - Fork 461
pkg/operator: correctly sync status for the CVO #406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/operator: correctly sync status for the CVO #406
Conversation
|
@runcom: GitHub didn't allow me to request PR reviews from the following users: ptal. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
not entirely sure how to test this out on a running cluster, disabling the CVO isn't gonna work, setting the operator Deployment to unmanaged isn't gonna work, any idea? |
|
If you want to test things like this cleanly it basically requires building a custom payload and passing it to the installer. |
it's probably time I start doing this |
pkg/operator/status.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this drop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting Available doesn't necessarily clear progressing (taken from docs https://github.com/openshift/installer/blob/master/docs/user/overview.md)
Failing is true with a detailed message Unable to apply 4.0.1: could not update 0000_70_network_deployment.yaml because the resource type NetworkConfig has not been installed on the server.
Available is true with message Cluster has deployed 4.0.0
Progressing is true with message Unable to apply 4.0.1: a required object is missing
It also looks like something other operators are doing (setting Avaiable can still be true while Failing and Progressing are also true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other scenarios I've seen in other operators set the Conditions indipendently from the other conditions as well (which makes sense I believe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting Available doesn't necessarily clear progressing (taken from docs https://github.com/openshift/installer/blob/master/docs/user/overview.md)
Failing is true with a detailed message Unable to apply 4.0.1: could not update 0000_70_network_deployment.yaml because the resource type NetworkConfig has not been installed on the server. Available is true with message Cluster has deployed 4.0.0 Progressing is true with message Unable to apply 4.0.1: a required object is missingIt also looks like something other operators are doing (setting Avaiable can still be true while Failing and Progressing are also true)
you are quoting an example when operator is progressing or failing...
when the operator is available
from https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#conditions
Failing is false with no message
Available is true with message Cluster has deployed 4.0.1
Progressing is false with message Cluster version is 4.0.1
you cannot be failing, progressing and available for the same version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should check the other Conditions I guess right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyway, the status you quoted can already converge to without my latest additions (though they're indeed needed). Progressing and Failing are set before available anyway so when we're at Avaiable, we know where we are (I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
20a3b47 to
b363f85
Compare
|
/retest |
1 similar comment
|
/retest |
pkg/operator/sync.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? The intention was to stay in "initialization" stage until we'd done a successful sync.
(Not saying it's wrong, I just want to understand why you're making the change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just golang linting, the previous if branch has a return at the end, the else branch is therefore superfluous
pkg/operator/status.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I'm wondering here...we want progressing=true when optr.inClusterBringup too right?
IOW this code is handling the upgrade case but we also want to be progressing during initial install.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, I'm gonna work that out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed here https://github.com/openshift/machine-config-operator/pull/406/files#diff-bee1f8f36240cb95db10244fdf335146R68 - we now set progressing = true on bring up as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I've noticed related to this though is that we seem to stay in inClusterBringup for multiple times. Though maybe that's only failing PRs? Glancing at the logs in a recent merged PR things look fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm what do you mean "stay in inClusterBringup for multiple times"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the sync fails; you'll see that log multiple times. But I think that's the right thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah, it's correct we keep re-syncing and logging that
143c1ac to
6d57092
Compare
|
This is good to review and test now, I was able to test it out using a custom payload built using #421 (comment) and I can't notice the statuses bouncing anymore. |
|
@abhinavdahiya could you also take another look at this now? |
|
/retest |
Based on: - https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#conditions - Slack conversation with Clayton We should correctly report Progressing only when really progressing towards something. Re-syncing, reconciling, drift detections verifying everything is still where you left it is not _changing_ something, thus is not progressing. Signed-off-by: Antonio Murdaca <runcom@linux.com>
6d57092 to
b8fdc27
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, runcom The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
/retest |
4 similar comments
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
@runcom @cgwalters example: mco looses the connection to apiserver interminently when reconciling and then reports the machine-config failing. once MCO has reported failing, even on a successful reconcile later on calling syncAvailbleStatus here will still report failing because of that Please look into this. |
|
I also suspect this was similar error https://gubernator.k8s.io/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/4197 |
|
Thanks @abhinavdahiya I'm fixing that (I was also thinking about making all of this simpler as well but later). |
Fix a bug reported in openshift#406 (comment) where we weren't properly resetting the failing status. Signed-off-by: Antonio Murdaca <runcom@linux.com>
|
@abhinavdahiya please take a look at #442 |
Fix a bug reported in openshift#406 (comment) where we weren't properly resetting the failing status. Signed-off-by: Antonio Murdaca <runcom@linux.com>
Specifically, add a test to verify that we're clearing out the failing condition on subsequent sync operations (covering a bug reported here openshift#406 (comment)) Signed-off-by: Antonio Murdaca <runcom@linux.com>
Fix a bug reported in openshift#406 (comment) where we weren't properly resetting the failing status. Signed-off-by: Antonio Murdaca <runcom@linux.com>
Specifically, add a test to verify that we're clearing out the failing condition on subsequent sync operations (covering a bug reported here openshift#406 (comment)) Signed-off-by: Antonio Murdaca <runcom@linux.com>
Based on:
We should correctly report Progressing only when really progressing towards something. Re-syncing, reconciling, drift detections verifying everything is still where you left it is not changing something, thus is not progressing.
closes #346
/cc @smarterclayton @abhinavdahiya ptal
Signed-off-by: Antonio Murdaca runcom@linux.com