Skip to content

Conversation

@runcom
Copy link
Member

@runcom runcom commented Feb 17, 2019

- What I did

This PR builds on top of #442 and it is the result of the conversation here: #442 (comment)

The CVO docs tell us that we may still report Available=True when we fail to progress to a newer version and this patch accounts for that case. The unit test change is the fix to my assumption that if we fail progressing, we do not need to report available cause state may have changed already and the MCO's components aren't good anymore.
The patch still accounts for reporting Available=False when we fail to bootstrap.

@abhinavdahiya @smarterclayton ptal

- How to verify it

- Description for the changelog

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 17, 2019
@runcom
Copy link
Member Author

runcom commented Feb 17, 2019

/hold

till #442 goes in, #442 is still fixing a bug when we fail to clear out the Failing status on success

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 17, 2019
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 17, 2019
@runcom
Copy link
Member Author

runcom commented Feb 17, 2019

/hold cancel

rebased

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 17, 2019
@runcom
Copy link
Member Author

runcom commented Feb 17, 2019

unit test flake #417

/retest

@runcom
Copy link
Member Author

runcom commented Feb 17, 2019

new unit test flake #451

/retest

@runcom
Copy link
Member Author

runcom commented Feb 18, 2019

#449

/retest

Signed-off-by: Antonio Murdaca <runcom@linux.com>
@runcom
Copy link
Member Author

runcom commented Feb 18, 2019

unit flake tracked here #449

/retest

@runcom
Copy link
Member Author

runcom commented Feb 18, 2019

time="2019-02-18T10:54:18Z" level=info msg="Waiting up to 30m0s for the cluster to initialize..."
time="2019-02-18T10:54:18Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T10:54:42Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T10:54:57Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T10:56:27Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T10:57:27Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T10:57:42Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T10:59:18Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T11:00:42Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Grafana failed: reconciling Grafana Deployment failed: creating deployment object failed: Get https://172.30.0.1:443/apis/apps/v1beta2/namespaces/openshift-monitoring/deployments/grafana: dial tcp 172.30.0.1:443: connect: connection refused"
time="2019-02-18T11:01:19Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Grafana failed: reconciling Grafana Deployment failed: creating deployment object failed: Get https://172.30.0.1:443/apis/apps/v1beta2/namespaces/openshift-monitoring/deployments/grafana: dial tcp 172.30.0.1:443: connect: connection refused"
time="2019-02-18T11:04:39Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator network has not yet reported success"
time="2019-02-18T11:05:42Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T11:08:57Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator network has not yet reported success"
time="2019-02-18T11:10:57Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T11:14:12Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator network has not yet reported success"
time="2019-02-18T11:17:42Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T11:21:12Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator network has not yet reported success"
time="2019-02-18T11:24:18Z" level=fatal msg="failed to initialize the cluster: Cluster operator network has not yet reported success"

/retest

@runcom
Copy link
Member Author

runcom commented Feb 19, 2019

I still think we shouldn't report available=true if we failed progressing as we may have touched some objects during syncFuncs (daemonsets and controllers) that may misbehave if they're at different versions...

@runcom
Copy link
Member Author

runcom commented Feb 19, 2019

/hold

because of my previous comment, seeking confirmation but my thoughts are:

  • t0: available=true, progressing=false, failing=false
  • t1: available=true, progressing=true, failing=true <--- the sync functions failed at this point but they may have deployed a new controller meanwhile but the daemonsets are at previous version, so why do we need to report available=true?

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 19, 2019
@kikisdeliveryservice
Copy link
Contributor

@runcom I read the CVO docs as Available meaning it's at a specific version and functioning properly (for that version) somewhat divorced from the progressing & failing statuses. So that you could have an fail to progress but say a rollback that still left it available at a previous version or a failed progress that doesn't make anything degrade, etc.... However if the progressing failed, some elements were updated, some weren't and this will cause improper functioning then Available = False.

@runcom
Copy link
Member Author

runcom commented Feb 20, 2019

This PR doesn't make sense for the MCO, when an upgrade for the MCO happens, this is gonna be roughly the flow:

  • old operator goes away
  • new updated operator rolls out
  • set Progressing to true (which means any of the operator, daemon, server or controller changed)
  • starts the roll out of the components, and fail
  • no reason to reports available for the current version as the current version is no longer current in any way

Feel free to reopen or comment if my flow isn't right.

@runcom runcom closed this Feb 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants