operator: fix status updates on ClusterOperator #351

abhinavdahiya · 2019-01-29T00:38:01Z

ClusterOperator now reports list of versions, each for an operand and the operator itself.
The versions reported for only available conditions, as per defined behavior in [1]
Failing and Progressing do not change last Available, as per defined behavor in [2]

[1] https://github.com/openshift/api/blob/677153041bd22c1aa01fef67b68a8502ab1333d5/config/v1/types_cluster_operator.go#L59-L61
[2] https://github.com/openshift/cluster-version-operator/blob/4c2d5e0e3662f9643ed4f0992496539ef6dfdaac/docs/dev/clusteroperator.md#conditions

/cc @kikisdeliveryservice @cgwalters

kikisdeliveryservice · 2019-01-29T01:53:27Z

@abhinavdahiya Thanks for this! This is for #346 right?

I will pull and test tomorrow 😄

kikisdeliveryservice

I pulled this and am testing but wanted to clear up a few questions before it merges:

Was the bug seen in #346 was because the Available status was tracking the Progressing status so that when progressing status was True Available went False?
Are the MCC/MCD/MCO operands of the MCO clusteroperator?
The status will be available so long as the "operand is functional" is this the MCO or the MCO+MCC+MCD? And does this mean that it is in the version it is supposed to be in (not considering any upgrades that may or may not be progressing)?
I notice that Progressing alternates between True and False - and the status says: Running resync for 3.11.0-521-g377bde3b so the MCO resyncs every ~15 seconds? And we will expect Progressing to change all the time regardless of whether an upgrade happened or not?

cgwalters

I only skimmed this...still reading and digesting the CVO docs. Some superficial comments.

cgwalters · 2019-01-29T20:54:17Z

pkg/operator/version.go

+type versionStore struct {
+	*sync.Mutex
+
+	versions map[string]string


Side note, in Rust this seems like it'd just be type Versions = Arc<Mutex<HashMap<String, String>>>. (Although if I were making an operator I'm not sure it'd be multi-threaded to begin with but that's a larger topic)

cgwalters · 2019-01-30T15:29:42Z

pkg/operator/status.go

s/Cluster/MCO/? (or s/Cluster/Operator/, or...)

MCO is reporting that its view (in terms of ownership) of the cluster is at that version.

- lastTransitionTime: 2019-01-29T20:53:23Z message: Cluster is available at 3.11.0-521-g377bde3b status: "True" type: Available - lastTransitionTime: 2019-01-30T20:56:32Z message: Running resync for 3.11.0-521-g377bde3b status: "True" type: Progressing - lastTransitionTime: 2019-01-30T06:21:34Z status: "False" type: Failing

@cgwalters it ends up reading this way in the yaml ^^

OK, sounds fine to me!

pkg/operator/status.go

cgwalters · 2019-01-30T16:02:49Z

Was the bug seen in #346 was because the Available status was tracking the Progressing status so that when progressing status was True Available went False?

Yep, I believe that's the core issue.

Are the MCC/MCD/MCO operands of the MCO clusteroperator?

Yes, as I understand it.

I notice that Progressing alternates between True and False - and the status says: Running resync for 3.11.0-521-g377bde3b so the MCO resyncs every ~15 seconds?

Why it changes every 15 seconds is an interesting question...I think we'll try to resynchronize on any status change from a lot of objects, including e.g. the MCD daemonset. It feels like we should be doing more "diff detection" here or so, there are presumably good patterns for this but I am not familiar enough with the ecosystem yet.

- ClusterOperator now reports list of versions, each for an operand and the operator itself. - The versions reported for only available conditions, as per defined behavior in [1] - Failing and Progressing do not change last Available, as per defined behavor in [2] [1] https://github.com/openshift/api/blob/677153041bd22c1aa01fef67b68a8502ab1333d5/config/v1/types_cluster_operator.go#L59-L61 [2] https://github.com/openshift/cluster-version-operator/blob/4c2d5e0e3662f9643ed4f0992496539ef6dfdaac/docs/dev/clusteroperator.md#conditions

abhinavdahiya · 2019-01-30T19:11:48Z

Why it changes every 15 seconds is an interesting question...I think we'll try to resynchronize on any status change from a lot of objects, including e.g. the MCD daemonset. It feels like we should be doing more "diff detection" here or so, there are presumably good patterns for this but I am not familiar enough with the ecosystem yet.

based on

machine-config-operator/pkg/operator/operator.go

Lines 124 to 132 in 96228eb

    
           mcoconfigInformer.Informer().AddEventHandler(optr.eventHandler()) 
        
           controllerConfigInformer.Informer().AddEventHandler(optr.eventHandler()) 
        
           serviceAccountInfomer.Informer().AddEventHandler(optr.eventHandler()) 
        
           crdInformer.Informer().AddEventHandler(optr.eventHandler()) 
        
           deployInformer.Informer().AddEventHandler(optr.eventHandler()) 
        
           daemonsetInformer.Informer().AddEventHandler(optr.eventHandler()) 
        
           clusterRoleInformer.Informer().AddEventHandler(optr.eventHandler()) 
        
           clusterRoleBindingInformer.Informer().AddEventHandler(optr.eventHandler()) 
        
           cmInformer.Informer().AddEventHandler(optr.eventHandler())

it is reacting to changes to these objects.

kikisdeliveryservice · 2019-01-30T19:26:04Z

@abhinavdahiya just to clarify: the MCO/Progressing status is reacting to changes in any Informers in lines 124-132 that you linked above?

kikisdeliveryservice · 2019-01-30T21:05:55Z

Since there were some changes, I'm going to repull this and test again.

abhinavdahiya · 2019-01-31T00:13:18Z

/retest

kikisdeliveryservice

The version column values seem to be missing?:

$ oc get clusteroperators machine-config-operator
NAME                      VERSION   AVAILABLE   PROGRESSING   FAILING   SINCE
machine-config-operator             True        False          False     12m

but in oc get clusteroperator machine-config-operator -o yaml -w :

  - name: operator
    version: 3.11.0-521-gcf207959

abhinavdahiya · 2019-01-31T03:40:04Z

The version column values seem to be missing?:

$ oc get clusteroperators machine-config-operator
NAME                      VERSION   AVAILABLE   PROGRESSING   FAILING   SINCE
machine-config-operator             True        False          False     12m

but in oc get clusteroperator machine-config-operator -o yaml -w :

  - name: operator
    version: 3.11.0-521-gcf207959

We do not control the definition CRD of cluster operator. CVO does. MCO only creates the CR.

kikisdeliveryservice

I've tested this a few times and no longer see Available changing every few seconds as described in #346 . Progressing does change back and forth depending on the state of mco informers are doing, which is expected, but they no longer have an effect on Available, for ex:

machine-config-operator             True      False     False     15h
machine-config-operator             True      True      False     15h

I'll leave this to @cgwalters to LGTM.

cgwalters · 2019-01-31T18:52:09Z

/lgtm

abhinavdahiya · 2019-01-31T21:58:41Z

/lgtm

hmm, looks like effect of the Prow service degradation.

cgwalters · 2019-01-31T21:59:39Z

/lgtm

openshift-ci-robot · 2019-01-31T21:59:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, cgwalters, kikisdeliveryservice

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

abhinavdahiya · 2019-01-31T22:14:58Z

/retest

abhinavdahiya · 2019-01-31T22:18:27Z

/retest

abhinavdahiya · 2019-01-31T22:37:24Z

/retest

openshift-bot · 2019-02-01T01:05:47Z

/retest

Please review the full test history for this PR and help us cut down flakes.

Bug 1914723: SamplesTBRInaccessibleOnBoot Alert has a misspelling

abhinavdahiya added 2 commits January 28, 2019 16:32

vendor: bump openshift/{api,client-go}

70ad2d0

operator: track versions for each major component

eb30ef6

openshift-ci-robot requested review from cgwalters and kikisdeliveryservice January 29, 2019 00:38

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 29, 2019

kikisdeliveryservice reviewed Jan 29, 2019

View reviewed changes

cgwalters reviewed Jan 30, 2019

View reviewed changes

abhinavdahiya force-pushed the co branch from 377bde3 to cf20795 Compare January 30, 2019 19:09

abhinavdahiya mentioned this pull request Jan 31, 2019

operator: use infra and network manifests to create controllerconfigspec #357

Merged

kikisdeliveryservice reviewed Jan 31, 2019

View reviewed changes

kikisdeliveryservice approved these changes Jan 31, 2019

View reviewed changes

openshift-ci-robot assigned cgwalters Jan 31, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 31, 2019

openshift-merge-robot merged commit f31d48f into openshift:master Feb 1, 2019

kikisdeliveryservice mentioned this pull request Feb 1, 2019

MCO Cluster Operator statuses frequently changing #346

Closed

runcom mentioned this pull request Feb 6, 2019

install ClusterOperator CRD as part of release payload #383

Closed

enxebre mentioned this pull request Feb 14, 2019

Stop reporting unavailable when progressing and add support for multiple operands status openshift/machine-api-operator#209

Merged

osherdp pushed a commit to osherdp/machine-config-operator that referenced this pull request Apr 13, 2021

Merge pull request openshift#351 from jhadvig/bz1914723

f69923a

Bug 1914723: SamplesTBRInaccessibleOnBoot Alert has a misspelling

operator: fix status updates on ClusterOperator #351

operator: fix status updates on ClusterOperator #351

Uh oh!

Conversation

abhinavdahiya commented Jan 29, 2019

Uh oh!

kikisdeliveryservice commented Jan 29, 2019

Uh oh!

kikisdeliveryservice left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgwalters left a comment

Choose a reason for hiding this comment

Uh oh!

cgwalters Jan 29, 2019

Choose a reason for hiding this comment

Uh oh!

cgwalters Jan 30, 2019

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya Jan 30, 2019

Choose a reason for hiding this comment

Uh oh!

kikisdeliveryservice Jan 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgwalters Jan 30, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cgwalters commented Jan 30, 2019

Uh oh!

abhinavdahiya commented Jan 30, 2019

Uh oh!

kikisdeliveryservice commented Jan 30, 2019

Uh oh!

kikisdeliveryservice commented Jan 30, 2019

Uh oh!

abhinavdahiya commented Jan 31, 2019

Uh oh!

kikisdeliveryservice left a comment

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya commented Jan 31, 2019

Uh oh!

kikisdeliveryservice left a comment

Choose a reason for hiding this comment

Uh oh!

cgwalters commented Jan 31, 2019

Uh oh!

abhinavdahiya commented Jan 31, 2019

Uh oh!

cgwalters commented Jan 31, 2019

Uh oh!

openshift-ci-robot commented Jan 31, 2019

Uh oh!

abhinavdahiya commented Jan 31, 2019

Uh oh!

abhinavdahiya commented Jan 31, 2019

Uh oh!

abhinavdahiya commented Jan 31, 2019

Uh oh!

openshift-bot commented Feb 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kikisdeliveryservice left a comment •

edited

Loading

kikisdeliveryservice Jan 30, 2019 •

edited

Loading