Installer: check operators for stability #1189

patrickdillon · 2022-07-14T19:24:08Z

Following up on Cluster Lifecycle Arch call from late February, this is my attempt to reframe the discussion in an enhancement.

I have written this enhancement from the point of view of implementing in the installer, but do think this is still an open question as to whether this should go in the installer or CVO.

/assign @deads2k
/assign @sdodson
/assign @bparees
/assign @LalatenduMohanty
/assign @wking

patrickdillon · 2022-07-14T19:29:16Z

PoC implementation in openshift/installer#6124

bparees · 2022-07-14T19:34:08Z

enhancements/installer/operators-check.md

+Cluster admin will begin installing cluster as usual. The Installation workflow will be:
+
+1. Cluster initializes as usual
+2. As usual, installer checks that cluster version is Available=true Progressing=False Degraded=False 


clusterversion doesn't assert a degraded condition, should this be failing?

Probably, even though that's not yet wired up it was agreed to be.
Need to chase https://bugzilla.redhat.com/show_bug.cgi?id=1951835 and openshift/cluster-version-operator#662

@patrickdillon @jottofar lets make sure we sync up on this.

clusterversion doesn't assert a degraded condition, should this be failing?

Yes, you are correct the installer code checks against the failing condition.

bparees · 2022-07-14T19:35:15Z

enhancements/installer/operators-check.md

+
+A fundamental design principle of 4.x and the operator framework has been
+delegating responsibility away from the Installer to allow clean separation
+of concerns and better maintainbility. Without this enhancement, it is the


Suggested change

of concerns and better maintainbility. Without this enhancement, it is the

of concerns and better maintainability. Without this enhancement, it is the

bparees · 2022-07-14T19:37:17Z

enhancements/installer/operators-check.md

+### Open Questions [optional]
+
+1. What is the correct definition of a stable operator? More importantly, how can we
+refine this definition?


For purposes of this EP and the installer's perspective, it sounds like you're defining it as "progressing=false for 5 minutes"?

Is the installer going to look at any other conditions on the operator, e.g. available=false, or it will continue to rely on the CVO to proxy that?

I think it's progressing=False for the last 30 seconds but only waiting for all COs to achieve that up to 5 minutes. Only waiting up to five minutes because these aren't failed installs, the cluster is in all cases I'm aware of, a viable cluster that just differs from the exact spec defined in install-config.

yeah sorry i described it badly. but i'm still interested in what the full set of conditions that determine "install is complete (and healthy or unhealthy)" are, beyond this "30s window of progressing=false"

Is the installer going to look at any other conditions on the operator, e.g. available=false, or it will continue to rely on the CVO to proxy that?

No, it would continue to rely on CVO as a proxy which is a good way of putting it.

<edited to copy/paste/fix here: https://github.com/openshift/enhancements/pull/1189#discussion_r921698993>

I'm happy with the definition as put forward by this enhancement plus the existing logic that requires A=t and D=f.

enhancements/installer/operators-check.md

bparees · 2022-07-14T19:40:44Z

enhancements/installer/operators-check.md

+responsibility of the Cluster-Version Operator to determine whether given
+Cluster Operators constitute a successful version. The idea of keeping the
+cluster-version operator as the single responsible party is discussed in the
+alternatives section.


another aspect to this is it opens the door for differing behavior between which clusteroperators CVO cares about, vs installer.

Today my understanding is the CVO only watches clusteroperator objects that it created (came out of manifests in the payload). If another component creates its own CO for some reason, that will have no bearing on CVO's observation of the cluster and reporting of available/progressing/failing conditions. I imagine the installer is not going to make that distinction (though perhaps it could, via looking at annotations on the COs), so we could potentially end up with a situation where the installer has one view of "what matters" and the CVO has a different one.

@bparees Which of those, in the context of installation, makes the most sense given where we're going in the future?

Given that the installer has nothing to do w/ upgrades, and i'd expect similar semantics to function around upgrades as they do around install (in terms of when we think it is "done"), i'd be inclined towards the CVO owning this aspect over the installer.

As for future, i assume you mean w/ respect to platform operators? So far the PO plan is to have the platform operator manager proxy the individual PO status, via a single CO. So PO doesn't really care whether it's CVO or installer that is watching the COs. But part of why the POM has to proxy the status to a single CO is because the CVO doesn't watch non-payload COs. I could see us changing that in the future, which is another reason i'd have to see us have two different components watching potentially different sets of COs to make similar decisions.

bparees · 2022-07-14T19:41:43Z

enhancements/installer/operators-check.md

+1. Cluster initializes as usual
+2. As usual, installer checks that cluster version is Available=true Progressing=False Degraded=False 
+3. Installer checks status of each cluster operator for stability
+4. If a cluster operator is not stable the installer logs a message and throws an error


Can we get a summary of all the things the installer looks at today to determine that the installation is done (or failed), so we can better understand the delta+implications of that delta being proposed here?

Serially, as I posted above, I believe it is:

Checks for bootstrap complete configmap. If that's good:
Checks for API. If that's good:
Checks cluster version. If that's good:
Enhancement: Sets a 5 minute context deadline then watches each CO to ensure they achieve Progress=False with LastTransition time > 30 seconds If that's good,
Checks console availability

Can we fold the console check into the proposed CO checks?

What do we do in a cluster where the console is disabled?

Checks console availability

is this checked by looking at the console CO conditions, or by explicitly trying to access the console url?

What do we do in a cluster where the console is disabled?

Since openshift/installer#5336, the installer will warning-log Cluster does not have a console available... and move on, without failing. For example, here is output from a no-caps run:

level=info msg=Checking to see if there is a route at openshift-console/console... level=warning msg=Cluster does not have a console available: could not get openshift-console URL level=info msg=Install complete! level=info msg=To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/tmp/installer/auth/kubeconfig' level=info msg=Time elapsed: 29m45s

bparees · 2022-07-14T19:46:19Z

enhancements/installer/operators-check.md

+
+```
+If a cluster operator does not maintain Progressing=False for at least 30 seconds,
+during a five minute period it is unstable.


can we get some more detail on how this will work in practice?

At what point will the installer start the "5 minute window" for example?

And presumably it will be looking for a 30 second period during which all operators are progressing=false, meaning you could have a situation where:

installer starts watching

all operators are progressing=false

25 seconds go by

operatorA goes progressing=true

some time passes (less than 5 mins total have expired)

operator A goes progressing=false

25 seconds go by

operator B goes progressing=true

some time passes (less than 5 mins total have expired)

operator B goes progressing=false

25 seconds go by

operatorA (or C) goes progressing=true
etc.

right?

Not saying that's a problem, just trying to understand how the detection/determination will work in practice.

Do we trust lastTransition Time?

Wait for current install-complete signal

Start's a five minute timer

Every 5s check each ClusterOperators for Progressing=False and lastTransitionTime < now - 30s?

Exit non-zero at end of five minute timer

The implementation is here:
https://github.com/openshift/installer/pull/6124/files#diff-56276d5381d618d46ec8d35d93210662c8fdd4c9bcd90fd36afbc6a59227eb0bR713-R738

The intent It is pretty much what scott said except we don't necessarily wait the whole 5 minutes. lastTransitionTime < now - 30s is the level for stability, once that is cleared the CO is cleared. If all COs are "stable" on the first check we exit immediately and the whole check took seconds at most.

The 5s detail in point 3 is abstracted away into library-go & watch events.

I will update these crucial details in the enhancement.

Scott's explanation in https://github.com/openshift/enhancements/pull/1189/files#r921528565 makes sense to me. I think the logic in the PR may have a bug. I commented.

sdodson · 2022-07-14T19:52:44Z

enhancements/installer/operators-check.md

+
+This enhancement allows the Installer to identify and catch a class of errors
+which are currently going unchecked. Without this enhancement, the Installer can
+(and does) declare clusters successfully installed when the cluster has failing components.


Suggested change

(and does) declare clusters successfully installed when the cluster has failing components.

(and does) declare clusters successfully installed when the cluster still has progressing components.

Just because failing here is not always the case. MAPI doesn't consider it a failure that you only have 4 of 5 worker instances because AWS is out of your instance type at the moment.

I didn’t realize this enhancement would cause installs to fail if a machine is not properly provisioned, but that makes sense. Is that correct?

regardless of MAPI behavior i agree w/ the sentiment behind the suggested edit. All we know about the components is they are still progressing, we don't actually know that they are failing (and if they were truly failing, they should probably be setting available=false or at least degraded=true).

In my mind this is more about ensuring that when we still the user their cluster is ready, it's actually ready and not still tidying up some things, than it is about reporting a failure that we currently ignore (not that that isn't also useful).

Do we actually have any data points about how often clusters never "settle", such that these changes to the installer would result in reported failures that today are ignored?

I think the more interesting aspect is that w/ this EP we'll no longer report install complete "prematurely"

Yeah, that's a good summary of why I suggested this change. I don't agree that still progressing is the same as failing.

WRT to timing The only thing that I know is service delivery waits up to 20 minutes before handing things off to the customer no matter what the result of their supplemental readiness checks. I'll ask if they've collected data on how long they generally wait and how often they hit the 20 minute cap.

(and if they were truly failing, they should probably be setting available=false or at least degraded=false).

@bparees I guess you meant degraded=true

@bparees I guess you meant degraded=true

whoops, yes, of course. will edit original comment.

sdodson · 2022-07-14T19:54:03Z

enhancements/installer/operators-check.md

+
+### Goals
+
+* Installer correctly identifies a failed cluster install where cluster operators are not stable


Suggested change

* Installer correctly identifies a failed cluster install where cluster operators are not stable

* Installer correctly identifies a progressing cluster install where cluster operators are not stable

sdodson · 2022-07-14T19:56:42Z

enhancements/installer/operators-check.md

+
+### Non-Goals
+
+* Installer handling other classes of errors, such as failure to provision machines by MAO


Suggested change

* Installer handling other classes of errors, such as failure to provision machines by MAO

* Installer handling of specific errors, such as failure to provision machines by MAO. The Installer only reports what ClusterOperators convey.

sdodson · 2022-07-14T20:00:52Z

enhancements/installer/operators-check.md

+Cluster admin will begin installing cluster as usual. The Installation workflow will be:
+
+1. Cluster initializes as usual
+2. As usual, installer checks that cluster version is Available=true Progressing=False Degraded=False 


Probably, even though that's not yet wired up it was agreed to be.
Need to chase https://bugzilla.redhat.com/show_bug.cgi?id=1951835 and openshift/cluster-version-operator#662

@patrickdillon @jottofar lets make sure we sync up on this.

sdodson · 2022-07-14T20:05:53Z

enhancements/installer/operators-check.md

+
+```
+If a cluster operator does not maintain Progressing=False for at least 30 seconds,
+during a five minute period it is unstable.


Do we trust lastTransition Time?

Wait for current install-complete signal

Start's a five minute timer

Every 5s check each ClusterOperators for Progressing=False and lastTransitionTime < now - 30s?

Exit non-zero at end of five minute timer

sdodson · 2022-07-14T20:11:41Z

enhancements/installer/operators-check.md

+responsibility of the Cluster-Version Operator to determine whether given
+Cluster Operators constitute a successful version. The idea of keeping the
+cluster-version operator as the single responsible party is discussed in the
+alternatives section.


@bparees Which of those, in the context of installation, makes the most sense given where we're going in the future?

sdodson · 2022-07-14T20:13:49Z

enhancements/installer/operators-check.md

+Should operators themselves be setting Degraded=True when they don't meet this stability criterion?
+
+As we have seen with other timeouts in the Installer, developers and users will want to change these.
+We should define a process for refining our stability criteria. 


Via another five minutes of openshift-install wait-for stable-operators ? Given all of our wait-for conditions I'm not too worried about knobs for the timeout.

sdodson · 2022-07-14T20:17:51Z

enhancements/installer/operators-check.md

+### Open Questions [optional]
+
+1. What is the correct definition of a stable operator? More importantly, how can we
+refine this definition?


I think it's progressing=False for the last 30 seconds but only waiting for all COs to achieve that up to 5 minutes. Only waiting up to five minutes because these aren't failed installs, the cluster is in all cases I'm aware of, a viable cluster that just differs from the exact spec defined in install-config.

sdodson · 2022-07-14T20:21:01Z

enhancements/installer/operators-check.md

+
+### Test Plan
+
+This code would go directly into the installer and be exercised by e2e tests.


By a synthetic CO that immediately becomes Available=True fulfilling the historic requirement for install-complete but maintains Progressing=True for 30 seconds after? Do we have tests that watch the cluster during the install or is this new ground? I think @deads2k has called these observers?

I think @deads2k has called these observers?

We don't have e2e observers yet, but if they existed it could work. I built a backstop into our pre-test CI step that we can change from "wait five minutes to settle" to "fail if anything is not settled. If it trips, this feature has a bug.

dhellmann · 2022-07-20T19:11:36Z

enhancements/installer/operators-check.md

+tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement
+  - none
+see-also:
+  - "/enhancements/this-other-neat-thing.md"


You could remove this

And all the comments above

I now have a desire to place something awesome at this location :)

dhellmann · 2022-07-20T19:14:51Z

enhancements/installer/operators-check.md

+1. Cluster initializes as usual
+2. As usual, installer checks that cluster version is Available=true Progressing=False Degraded=False 
+3. Installer checks status of each cluster operator for stability
+4. If a cluster operator is not stable the installer logs a message and throws an error


Can we fold the console check into the proposed CO checks?

What do we do in a cluster where the console is disabled?

enhancements/installer/operators-check.md

romfreiman · 2022-07-21T10:51:46Z

enhancements/installer/operators-check.md

+will also help the Technical Release Team identify this class of problem in CI and triage
+issues to the appropriate operator development team.
+
+### User Stories


can we think about providing a library with such functionality - so it can be reused in AI as well?

What functionality specificall? Scanning operator conditions and ensuring they've all settled for at least 30s?

patrickdillon · 2022-07-25T18:58:36Z

Thanks to everyone for feedback on this enhancement. There are definitely suggestions that have been made that I could incorporate to improve this enhancement; on the other hand, I think it is best to avoid sinking further effort into this particular version of the enhancement until we resolve the general decision of whether this is better suited for the installer or the CVO. @LalatenduMohanty or @wking do you have thoughts in this regard?

cblecker

Extremely happy to see this discussion from an SRE point of view. Having a clearer picture to the cluster administrator of "does the thing the installer produce match my intent when invoking it" is extremely beneficial.

cblecker · 2022-07-27T20:09:22Z

enhancements/installer/operators-check.md

+
+1. Cluster initializes as usual
+2. As usual, installer checks that cluster version is Available=true Progressing=False Degraded=False 
+3. Installer checks status of each cluster operator for stability


For reference, here's the things we are currently looking at today with OSD/ROSA to try and determine when a cluster install has actually finished:
https://github.com/openshift/osde2e/blob/71f4013df9420fc961104a9388e42bc7c6af9b2f/pkg/common/cluster/clusterutil.go#L430-L470

cblecker · 2022-07-27T20:15:21Z

enhancements/installer/operators-check.md

+
+- The worst case failure more is that the Installer throws an error when there is not an actual
+problem with a cluster. In this case, an admin would need to investigate an error or automation would
+need to rerun an install. We would hope to eliminate this failures through monitoring CI.


Would this failed state cause a full TF destroy/retry in it's interactions with hive?

dhellmann · 2022-07-27T22:31:18Z

/test markdownlint

dhellmann · 2022-07-27T22:47:50Z

The current linter failure is real

enhancements/installer/operators-check.md:84 MD040/fenced-code-language Fenced code blocks should have a language specified [Context: "```"]

deads2k · 2022-07-29T18:58:30Z

enhancements/installer/operators-check.md

+One risk would be a false positive: the Installer identifies that a cluster
+operator is unstable, but it turns out the operator is perfectly healthy;
+the install was declared a failure but was actually successful. This risk
+seems low and a risk that could be managed by monitoring these specific failures


Agreed. It would also be an operator bug, not an installer bug.

deads2k · 2022-07-29T19:21:17Z

enhancements/installer/operators-check.md

+and introduce the potential for false positives or other failures.
+
+Does implementing this enhancement address symptoms of issues with operator status definitions?
+Should operators themselves be setting Degraded=True when they don't meet this stability criterion?


I don't think so. Progressing is about moving from one steady state to another. Degraded is about "something is broken". They serve different purposes and some progressing states easily last longer than 5 minutes. For instance, rolling out a new kube-apiserver level takes 15-20 minutes.

deads2k · 2022-07-29T19:23:33Z

enhancements/installer/operators-check.md

+
+1. What is the correct definition of a stable operator? More importantly, how can we
+refine this definition?
+2. Should this logic belong in the Installer, CVO, or another component?


I don't have a strong opinion about this one. The logic described here seems fine. I'd also be fine seeing the CVO do it.

LalatenduMohanty · 2022-08-02T19:03:45Z

enhancements/installer/operators-check.md

+determining whether an install is a success (exit 0) or failure (exit != 0).
+Specifically, the Installer should check whether cluster operators have stopped
+progressing for a certain amount of time. If the Installer sees that an operator
+is Available=true but fails to enter a stable Progressing=false state, the Installer


Wondering if you have considered when operator is degraded for a certain amount of time .

LalatenduMohanty · 2022-08-02T19:36:45Z

enhancements/installer/operators-check.md

+
+This enhancement allows the Installer to identify and catch a class of errors
+which are currently going unchecked. Without this enhancement, the Installer can
+(and does) declare clusters successfully installed when the cluster has failing components.


(and if they were truly failing, they should probably be setting available=false or at least degraded=false).

@bparees I guess you meant degraded=true

LalatenduMohanty · 2022-08-02T19:53:48Z

enhancements/installer/operators-check.md

+progressing so that I can check whether the operator has an issue.
+
+As a member of the technical release team, I want the Installer to exit non-zero when
+an operator never stops progressing so that I can triage operator bugs.


You can add another use case which SRE has around the installer. They can not deploy the workload immediately after the installation because the cluster is not ready , so they had to develop extra code which checks the cluster to see if the operators are not progressing anymore which is inconvenient (same for our customers) cc @cblecker

enhancements/installer/operators-check.md

sdodson · 2022-08-08T18:33:08Z

@patrickdillon I think there's just a few outstanding change recommendations which need to get applied before we move forward.

jottofar · 2022-08-08T20:26:34Z

Probably, even though that's not yet wired up it was agreed to be.
Need to chase https://bugzilla.redhat.com/show_bug.cgi?id=1951835 and openshift/cluster-version-operator#662

@patrickdillon @jottofar lets make sure we sync up on this.

I'll be working that PR again during this Sprint. Currently in that PR if CVO sees op Degraded True during Initializing mode, as opposed to Reconciling or Upgrading modes, it reports it but does not fail.

openshift-bot · 2022-09-16T01:15:41Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

sdodson · 2022-09-16T13:51:31Z

/remove-lifecycle stale

openshift-bot · 2022-10-15T01:15:29Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

wking · 2022-10-24T06:43:34Z

enhancements/installer/operators-check.md

+creation-date: 2021-07-14
+last-updated: yyyy-mm-dd
+tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement
+  - none


Could possibly link openshift/installer#6124 if there are no Jira trackers? That got reverted in openshift/installer#6503, but presumably whichever pull request restores it with an adjusted threshold will also link 6124.

openshift-bot · 2022-10-31T08:45:30Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2022-11-08T00:15:53Z

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2022-11-08T00:16:03Z

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sdodson · 2022-11-11T16:31:30Z

/reopen

openshift-ci · 2022-11-11T16:31:40Z

@sdodson: Reopened this PR.

Details

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sdodson · 2022-11-11T16:32:17Z

/lgtm
/approve
Implementation on this is already in progress and aligned with the enhancement as written so I'm merging this now.

openshift-ci · 2022-11-11T16:32:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sdodson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [sdodson]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sdodson · 2022-11-11T16:51:22Z

/override ci/prow/markdownlint
This is really a block quote rather than a code block

openshift-ci · 2022-11-11T16:51:56Z

@sdodson: Overrode contexts on behalf of sdodson: ci/prow/markdownlint

Details

In response to this:

/override ci/prow/markdownlint
This is really a block quote rather than a code block

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-11-11T16:51:59Z

@patrickdillon: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

The linter job for openshift#1189 was overridden, so invalid markdown landed. This fixes the markup to make the job work for other PRs. Signed-off-by: Doug Hellmann <[email protected]>

wking · 2023-07-06T18:54:36Z

openshift/installer#7289 is in flight with the implementation.

Installer: check operators for stability

1064885

openshift-ci bot assigned bparees, deads2k, LalatenduMohanty, sdodson and wking Jul 14, 2022

openshift-ci bot requested review from bparees and dtantsur July 14, 2022 19:24

patrickdillon mentioned this pull request Jul 14, 2022

create: add check for cluster operator stability openshift/installer#6124

Merged

bparees reviewed Jul 14, 2022

View reviewed changes

sdodson reviewed Jul 14, 2022

View reviewed changes

dhellmann reviewed Jul 20, 2022

View reviewed changes

romfreiman reviewed Jul 21, 2022

View reviewed changes

cblecker reviewed Jul 27, 2022

View reviewed changes

deads2k reviewed Jul 29, 2022

View reviewed changes

LalatenduMohanty reviewed Aug 4, 2022

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 16, 2022

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 16, 2022

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2022

wking reviewed Oct 24, 2022

View reviewed changes

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 31, 2022

openshift-ci bot closed this Nov 8, 2022

openshift-ci bot reopened this Nov 11, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 11, 2022

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 11, 2022

openshift-merge-robot merged commit c6afdd0 into openshift:master Nov 11, 2022

dhellmann mentioned this pull request Nov 14, 2022

fix linter issue in installer/operators-check.md #1283

Merged

wking mentioned this pull request Jul 6, 2023

create: add check for cluster operator stability openshift/installer#7289

Merged

	of concerns and better maintainbility. Without this enhancement, it is the
	of concerns and better maintainability. Without this enhancement, it is the

	(and does) declare clusters successfully installed when the cluster has failing components.
	(and does) declare clusters successfully installed when the cluster still has progressing components.


		### Goals

		* Installer correctly identifies a failed cluster install where cluster operators are not stable

	* Installer correctly identifies a failed cluster install where cluster operators are not stable
	* Installer correctly identifies a progressing cluster install where cluster operators are not stable


		### Non-Goals

		* Installer handling other classes of errors, such as failure to provision machines by MAO

	* Installer handling other classes of errors, such as failure to provision machines by MAO
	* Installer handling of specific errors, such as failure to provision machines by MAO. The Installer only reports what ClusterOperators convey.


		### Test Plan

		This code would go directly into the installer and be exercised by e2e tests.

Installer: check operators for stability #1189

Installer: check operators for stability #1189

Uh oh!

Conversation

patrickdillon commented Jul 14, 2022

Uh oh!

patrickdillon commented Jul 14, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickdillon Jul 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickdillon Jul 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bparees Jul 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdodson Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

patrickdillon Jul 14, 2022 •

edited

Loading

patrickdillon Jul 15, 2022 •

edited

Loading

bparees Jul 15, 2022 •

edited

Loading

sdodson Jul 21, 2022 •

edited

Loading