pkg/operator: report available when failing to progress #450

runcom · 2019-02-17T20:02:36Z

- What I did

This PR ~~builds on top of #442 and it~~ is the result of the conversation here: #442 (comment)

The CVO docs tell us that we may still report Available=True when we fail to progress to a newer version and this patch accounts for that case. The unit test change is the fix to my assumption that if we fail progressing, we do not need to report available cause state may have changed already and the MCO's components aren't good anymore.
The patch still accounts for reporting Available=False when we fail to bootstrap.

@abhinavdahiya @smarterclayton ptal

- How to verify it

- Description for the changelog

openshift-ci-robot · 2019-02-17T20:02:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

runcom · 2019-02-17T20:03:08Z

/hold

till #442 goes in, #442 is still fixing a bug when we fail to clear out the Failing status on success

runcom · 2019-02-17T23:41:34Z

/hold cancel

rebased

runcom · 2019-02-17T23:52:01Z

unit test flake #417

/retest

runcom · 2019-02-17T23:58:54Z

new unit test flake #451

/retest

runcom · 2019-02-18T00:10:19Z

#449

/retest

Signed-off-by: Antonio Murdaca <runcom@linux.com>

runcom · 2019-02-18T10:45:42Z

unit flake tracked here #449

/retest

runcom · 2019-02-18T12:02:02Z

time="2019-02-18T10:54:18Z" level=info msg="Waiting up to 30m0s for the cluster to initialize..."
time="2019-02-18T10:54:18Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T10:54:42Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T10:54:57Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T10:56:27Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T10:57:27Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T10:57:42Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T10:59:18Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T11:00:42Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Grafana failed: reconciling Grafana Deployment failed: creating deployment object failed: Get https://172.30.0.1:443/apis/apps/v1beta2/namespaces/openshift-monitoring/deployments/grafana: dial tcp 172.30.0.1:443: connect: connection refused"
time="2019-02-18T11:01:19Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Grafana failed: reconciling Grafana Deployment failed: creating deployment object failed: Get https://172.30.0.1:443/apis/apps/v1beta2/namespaces/openshift-monitoring/deployments/grafana: dial tcp 172.30.0.1:443: connect: connection refused"
time="2019-02-18T11:04:39Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator network has not yet reported success"
time="2019-02-18T11:05:42Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T11:08:57Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator network has not yet reported success"
time="2019-02-18T11:10:57Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T11:14:12Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator network has not yet reported success"
time="2019-02-18T11:17:42Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T11:21:12Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator network has not yet reported success"
time="2019-02-18T11:24:18Z" level=fatal msg="failed to initialize the cluster: Cluster operator network has not yet reported success"

/retest

runcom · 2019-02-19T08:12:57Z

I still think we shouldn't report available=true if we failed progressing as we may have touched some objects during syncFuncs (daemonsets and controllers) that may misbehave if they're at different versions...

runcom · 2019-02-19T13:27:52Z

/hold

because of my previous comment, seeking confirmation but my thoughts are:

t0: available=true, progressing=false, failing=false
t1: available=true, progressing=true, failing=true <--- the sync functions failed at this point but they may have deployed a new controller meanwhile but the daemonsets are at previous version, so why do we need to report available=true?

kikisdeliveryservice · 2019-02-19T18:59:05Z

@runcom I read the CVO docs as Available meaning it's at a specific version and functioning properly (for that version) somewhat divorced from the progressing & failing statuses. So that you could have an fail to progress but say a rollback that still left it available at a previous version or a failed progress that doesn't make anything degrade, etc.... However if the progressing failed, some elements were updated, some weren't and this will cause improper functioning then Available = False.

runcom · 2019-02-20T13:27:01Z

This PR doesn't make sense for the MCO, when an upgrade for the MCO happens, this is gonna be roughly the flow:

old operator goes away
new updated operator rolls out
set Progressing to true (which means any of the operator, daemon, server or controller changed)
starts the roll out of the components, and fail
no reason to reports available for the current version as the current version is no longer current in any way

Feel free to reopen or comment if my flow isn't right.

openshift-ci-robot requested review from ashcrow and smarterclayton February 17, 2019 20:02

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 17, 2019

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 17, 2019

runcom mentioned this pull request Feb 17, 2019

MCO: clear out failing status on success and add tests #442

Merged

runcom force-pushed the report-available branch from a583f1e to aee0fbe Compare February 17, 2019 23:40

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 17, 2019

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 17, 2019

runcom force-pushed the report-available branch from aee0fbe to 1be24af Compare February 17, 2019 23:44

pkg/operator: report available when failing to progress

5158fe2

Signed-off-by: Antonio Murdaca <runcom@linux.com>

runcom force-pushed the report-available branch from 1be24af to 5158fe2 Compare February 18, 2019 10:01

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 19, 2019

runcom closed this Feb 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/operator: report available when failing to progress #450

pkg/operator: report available when failing to progress #450

Uh oh!

runcom commented Feb 17, 2019 •

edited

Loading

Uh oh!

openshift-ci-robot commented Feb 17, 2019

Uh oh!

runcom commented Feb 17, 2019

Uh oh!

runcom commented Feb 17, 2019

Uh oh!

runcom commented Feb 17, 2019

Uh oh!

runcom commented Feb 17, 2019

Uh oh!

runcom commented Feb 18, 2019

Uh oh!

runcom commented Feb 18, 2019

Uh oh!

runcom commented Feb 18, 2019

Uh oh!

runcom commented Feb 19, 2019 •

edited

Loading

Uh oh!

runcom commented Feb 19, 2019 •

edited

Loading

Uh oh!

kikisdeliveryservice commented Feb 19, 2019

Uh oh!

runcom commented Feb 20, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pkg/operator: report available when failing to progress #450

pkg/operator: report available when failing to progress #450

Uh oh!

Conversation

runcom commented Feb 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Feb 17, 2019

Uh oh!

runcom commented Feb 17, 2019

Uh oh!

runcom commented Feb 17, 2019

Uh oh!

runcom commented Feb 17, 2019

Uh oh!

runcom commented Feb 17, 2019

Uh oh!

runcom commented Feb 18, 2019

Uh oh!

runcom commented Feb 18, 2019

Uh oh!

runcom commented Feb 18, 2019

Uh oh!

runcom commented Feb 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

runcom commented Feb 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kikisdeliveryservice commented Feb 19, 2019

Uh oh!

runcom commented Feb 20, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

runcom commented Feb 17, 2019 •

edited

Loading

runcom commented Feb 19, 2019 •

edited

Loading

runcom commented Feb 19, 2019 •

edited

Loading