-
Notifications
You must be signed in to change notification settings - Fork 462
operator: fix status updates on ClusterOperator #351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@abhinavdahiya Thanks for this! This is for #346 right? I will pull and test tomorrow 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled this and am testing but wanted to clear up a few questions before it merges:
-
Was the bug seen in #346 was because the Available status was tracking the Progressing status so that when progressing status was True Available went False?
-
Are the MCC/MCD/MCO operands of the MCO clusteroperator?
-
The status will be available so long as the "operand is functional" is this the MCO or the MCO+MCC+MCD? And does this mean that it is in the version it is supposed to be in (not considering any upgrades that may or may not be progressing)?
-
I notice that Progressing alternates between True and False - and the status says:
Running resync for 3.11.0-521-g377bde3bso the MCO resyncs every ~15 seconds? And we will expect Progressing to change all the time regardless of whether an upgrade happened or not?
cgwalters
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only skimmed this...still reading and digesting the CVO docs. Some superficial comments.
| type versionStore struct { | ||
| *sync.Mutex | ||
|
|
||
| versions map[string]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note, in Rust this seems like it'd just be type Versions = Arc<Mutex<HashMap<String, String>>>. (Although if I were making an operator I'm not sure it'd be multi-threaded to begin with but that's a larger topic)
pkg/operator/status.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Cluster/MCO/? (or s/Cluster/Operator/, or...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MCO is reporting that its view (in terms of ownership) of the cluster is at that version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- lastTransitionTime: 2019-01-29T20:53:23Z
message: Cluster is available at 3.11.0-521-g377bde3b
status: "True"
type: Available
- lastTransitionTime: 2019-01-30T20:56:32Z
message: Running resync for 3.11.0-521-g377bde3b
status: "True"
type: Progressing
- lastTransitionTime: 2019-01-30T06:21:34Z
status: "False"
type: Failing
@cgwalters it ends up reading this way in the yaml ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sounds fine to me!
Yep, I believe that's the core issue.
Yes, as I understand it.
Why it changes every 15 seconds is an interesting question...I think we'll try to resynchronize on any status change from a lot of objects, including e.g. the MCD daemonset. It feels like we should be doing more "diff detection" here or so, there are presumably good patterns for this but I am not familiar enough with the ecosystem yet. |
- ClusterOperator now reports list of versions, each for an operand and the operator itself. - The versions reported for only available conditions, as per defined behavior in [1] - Failing and Progressing do not change last Available, as per defined behavor in [2] [1] https://github.com/openshift/api/blob/677153041bd22c1aa01fef67b68a8502ab1333d5/config/v1/types_cluster_operator.go#L59-L61 [2] https://github.com/openshift/cluster-version-operator/blob/4c2d5e0e3662f9643ed4f0992496539ef6dfdaac/docs/dev/clusteroperator.md#conditions
based on machine-config-operator/pkg/operator/operator.go Lines 124 to 132 in 96228eb
it is reacting to changes to these objects. |
|
@abhinavdahiya just to clarify: the MCO/Progressing status is reacting to changes in any Informers in lines 124-132 that you linked above? |
|
Since there were some changes, I'm going to repull this and test again. |
|
/retest |
kikisdeliveryservice
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version column values seem to be missing?:
$ oc get clusteroperators machine-config-operator
NAME VERSION AVAILABLE PROGRESSING FAILING SINCE
machine-config-operator True False False 12m
but in oc get clusteroperator machine-config-operator -o yaml -w :
- name: operator
version: 3.11.0-521-gcf207959
We do not control the definition CRD of cluster operator. CVO does. MCO only creates the CR. |
kikisdeliveryservice
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this a few times and no longer see Available changing every few seconds as described in #346 . Progressing does change back and forth depending on the state of mco informers are doing, which is expected, but they no longer have an effect on Available, for ex:
machine-config-operator True False False 15h
machine-config-operator True True False 15h
I'll leave this to @cgwalters to LGTM.
|
/lgtm |
hmm, looks like effect of the Prow service degradation. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, cgwalters, kikisdeliveryservice The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
2 similar comments
|
/retest |
|
/retest |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
Bug 1914723: SamplesTBRInaccessibleOnBoot Alert has a misspelling
[1] https://github.com/openshift/api/blob/677153041bd22c1aa01fef67b68a8502ab1333d5/config/v1/types_cluster_operator.go#L59-L61
[2] https://github.com/openshift/cluster-version-operator/blob/4c2d5e0e3662f9643ed4f0992496539ef6dfdaac/docs/dev/clusteroperator.md#conditions
/cc @kikisdeliveryservice @cgwalters