Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 174 additions & 0 deletions enhancements/installer/operators-check.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
---
title: cluster-operator-installation-check
authors:
- @patrickdillon
reviewers: # Include a comment about what domain expertise a reviewer is expected to bring and what area of the enhancement you expect them to focus on. For example: - "@networkguru, for networking aspects, please look at IP bootstrapping aspect"
- @wking
- @LalatenduMohanty
approvers:
- @sdodson
- @deads2k
- @bparees
api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers). If there is no API change, use "None"
- None
creation-date: 2021-07-14
last-updated: yyyy-mm-dd
tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement
- none
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could possibly link openshift/installer#6124 if there are no Jira trackers? That got reverted in openshift/installer#6503, but presumably whichever pull request restores it with an adjusted threshold will also link 6124.

see-also:
- "/enhancements/this-other-neat-thing.md"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could remove this

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And all the comments above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now have a desire to place something awesome at this location :)

replaces:
- n/a
superseded-by:
- n/a
---

# Cluster Operator Installation Check

## Summary

This enhancement proposes that the OpenShift Installer should check the status
of individual cluster operators and use those statuses as a threshold for
determining whether an install is a success (exit 0) or failure (exit != 0).
Specifically, the Installer should check whether cluster operators have stopped
progressing for a certain amount of time. If the Installer sees that an operator
is Available=true but fails to enter a stable Progressing=false state, the Installer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if you have considered when operator is degraded for a certain amount of time .

will exit non-zero and log an error.

## Motivation

This enhancement allows the Installer to identify and catch a class of errors
which are currently going unchecked. Without this enhancement, the Installer can
(and does) declare clusters successfully installed when the cluster has failing components.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(and does) declare clusters successfully installed when the cluster has failing components.
(and does) declare clusters successfully installed when the cluster still has progressing components.

Just because failing here is not always the case. MAPI doesn't consider it a failure that you only have 4 of 5 worker instances because AWS is out of your instance type at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t realize this enhancement would cause installs to fail if a machine is not properly provisioned, but that makes sense. Is that correct?

Copy link
Contributor

@bparees bparees Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regardless of MAPI behavior i agree w/ the sentiment behind the suggested edit. All we know about the components is they are still progressing, we don't actually know that they are failing (and if they were truly failing, they should probably be setting available=false or at least degraded=true).

In my mind this is more about ensuring that when we still the user their cluster is ready, it's actually ready and not still tidying up some things, than it is about reporting a failure that we currently ignore (not that that isn't also useful).

Do we actually have any data points about how often clusters never "settle", such that these changes to the installer would result in reported failures that today are ignored?

I think the more interesting aspect is that w/ this EP we'll no longer report install complete "prematurely"

Copy link
Member

@sdodson sdodson Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good summary of why I suggested this change. I don't agree that still progressing is the same as failing.

WRT to timing The only thing that I know is service delivery waits up to 20 minutes before handing things off to the customer no matter what the result of their supplemental readiness checks. I'll ask if they've collected data on how long they generally wait and how often they hit the 20 minute cap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and if they were truly failing, they should probably be setting available=false or at least degraded=false).

@bparees I guess you meant degraded=true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bparees I guess you meant degraded=true

whoops, yes, of course. will edit original comment.

Adding a check for cluster operator stability will allow normal users and managed services teams
to more easily identify and troubleshoot issues when they encounter this class of failure. The change
will also help the Technical Release Team identify this class of problem in CI and triage
issues to the appropriate operator development team.

### User Stories

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we think about providing a library with such functionality - so it can be reused in AI as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What functionality specificall? Scanning operator conditions and ensuring they've all settled for at least 30s?


As an openshift-install user, I want the Installer to tell me when an operator never stops
progressing so that I can check whether the operator has an issue.

As a member of the technical release team, I want the Installer to exit non-zero when
an operator never stops progressing so that I can triage operator bugs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add another use case which SRE has around the installer. They can not deploy the workload immediately after the installation because the cluster is not ready , so they had to develop extra code which checks the cluster to see if the operators are not progressing anymore which is inconvenient (same for our customers) cc @cblecker


### Goals

* Installer correctly identifies a failed cluster install where cluster operators are not stable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Installer correctly identifies a failed cluster install where cluster operators are not stable
* Installer correctly identifies a progressing cluster install where cluster operators are not stable

* Installer exits non-zero in those situations and delivers meaningful error message

### Non-Goals

* Installer handling other classes of errors, such as failure to provision machines by MAO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Installer handling other classes of errors, such as failure to provision machines by MAO
* Installer handling of specific errors, such as failure to provision machines by MAO. The Installer only reports what ClusterOperators convey.


## Proposal

### Workflow Description

Cluster admin will begin installing cluster as usual. The Installation workflow will be:

1. Cluster initializes as usual
2. As usual, installer checks that cluster version is Available=true Progressing=False Degraded=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterversion doesn't assert a degraded condition, should this be failing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, even though that's not yet wired up it was agreed to be.
Need to chase https://bugzilla.redhat.com/show_bug.cgi?id=1951835 and openshift/cluster-version-operator#662

@patrickdillon @jottofar lets make sure we sync up on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterversion doesn't assert a degraded condition, should this be failing?

Yes, you are correct the installer code checks against the failing condition.

3. Installer checks status of each cluster operator for stability
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, here's the things we are currently looking at today with OSD/ROSA to try and determine when a cluster install has actually finished:
https://github.com/openshift/osde2e/blob/71f4013df9420fc961104a9388e42bc7c6af9b2f/pkg/common/cluster/clusterutil.go#L430-L470

4. If a cluster operator is not stable the installer logs a message and throws an error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a summary of all the things the installer looks at today to determine that the installation is done (or failed), so we can better understand the delta+implications of that delta being proposed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serially, as I posted above, I believe it is:

Checks for bootstrap complete configmap. If that's good:
Checks for API. If that's good:
Checks cluster version. If that's good:
Enhancement: Sets a 5 minute context deadline then watches each CO to ensure they achieve Progress=False with LastTransition time > 30 seconds If that's good,
Checks console availability

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fold the console check into the proposed CO checks?

What do we do in a cluster where the console is disabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checks console availability

is this checked by looking at the console CO conditions, or by explicitly trying to access the console url?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we do in a cluster where the console is disabled?

Since openshift/installer#5336, the installer will warning-log Cluster does not have a console available... and move on, without failing. For example, here is output from a no-caps run:

level=info msg=Checking to see if there is a route at openshift-console/console...
level=warning msg=Cluster does not have a console available: could not get openshift-console URL
level=info msg=Install complete!
level=info msg=To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/tmp/installer/auth/kubeconfig'
level=info msg=Time elapsed: 29m45s


### API Extensions

None

### Implementation Details/Notes/Constraints [optional]

The important detail is how we define _stable_. The definition proposed here is:

```
If a cluster operator does not maintain Progressing=False for at least 30 seconds,
during a five minute period it is unstable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get some more detail on how this will work in practice?

At what point will the installer start the "5 minute window" for example?

And presumably it will be looking for a 30 second period during which all operators are progressing=false, meaning you could have a situation where:

  1. installer starts watching
  2. all operators are progressing=false
  3. 25 seconds go by
  4. operatorA goes progressing=true
  5. some time passes (less than 5 mins total have expired)
  6. operator A goes progressing=false
  7. 25 seconds go by
  8. operator B goes progressing=true
  9. some time passes (less than 5 mins total have expired)
  10. operator B goes progressing=false
  11. 25 seconds go by
  12. operatorA (or C) goes progressing=true
    etc.

right?

Not saying that's a problem, just trying to understand how the detection/determination will work in practice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we trust lastTransition Time?

  1. Wait for current install-complete signal
  2. Start's a five minute timer
  3. Every 5s check each ClusterOperators for Progressing=False and lastTransitionTime < now - 30s?
  4. Exit non-zero at end of five minute timer

Copy link
Contributor Author

@patrickdillon patrickdillon Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is here:
https://github.com/openshift/installer/pull/6124/files#diff-56276d5381d618d46ec8d35d93210662c8fdd4c9bcd90fd36afbc6a59227eb0bR713-R738

The intent It is pretty much what scott said except we don't necessarily wait the whole 5 minutes. lastTransitionTime < now - 30s is the level for stability, once that is cleared the CO is cleared. If all COs are "stable" on the first check we exit immediately and the whole check took seconds at most.

The 5s detail in point 3 is abstracted away into library-go & watch events.

I will update these crucial details in the enhancement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scott's explanation in https://github.com/openshift/enhancements/pull/1189/files#r921528565 makes sense to me. I think the logic in the PR may have a bug. I commented.

```

### Risks and Mitigations

One risk would be a false positive: the Installer identifies that a cluster
operator is unstable, but it turns out the operator is perfectly healthy;
the install was declared a failure but was actually successful. This risk
seems low and a risk that could be managed by monitoring these specific failures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It would also be an operator bug, not an installer bug.

in CI.

### Drawbacks

This enhancement has some significant potential drawbacks in terms of design:

A fundamental design principle of 4.x and the operator framework has been
delegating responsibility away from the Installer to allow clean separation
of concerns and better maintainbility. Without this enhancement, it is the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
of concerns and better maintainbility. Without this enhancement, it is the
of concerns and better maintainability. Without this enhancement, it is the

responsibility of the Cluster-Version Operator to determine whether given
Cluster Operators constitute a successful version. The idea of keeping the
cluster-version operator as the single responsible party is discussed in the
alternatives section.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another aspect to this is it opens the door for differing behavior between which clusteroperators CVO cares about, vs installer.

Today my understanding is the CVO only watches clusteroperator objects that it created (came out of manifests in the payload). If another component creates its own CO for some reason, that will have no bearing on CVO's observation of the cluster and reporting of available/progressing/failing conditions. I imagine the installer is not going to make that distinction (though perhaps it could, via looking at annotations on the COs), so we could potentially end up with a situation where the installer has one view of "what matters" and the CVO has a different one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bparees Which of those, in the context of installation, makes the most sense given where we're going in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the installer has nothing to do w/ upgrades, and i'd expect similar semantics to function around upgrades as they do around install (in terms of when we think it is "done"), i'd be inclined towards the CVO owning this aspect over the installer.

As for future, i assume you mean w/ respect to platform operators? So far the PO plan is to have the platform operator manager proxy the individual PO status, via a single CO. So PO doesn't really care whether it's CVO or installer that is watching the COs. But part of why the POM has to proxy the status to a single CO is because the CVO doesn't watch non-payload COs. I could see us changing that in the future, which is another reason i'd have to see us have two different components watching potentially different sets of COs to make similar decisions.


Following the previous point, this enhancement could introduce a slippery slope.
Where before this enhancement, the Installer would simply query the CVO; now we
are adding additional conditional logic. Introducing this conditional logic leads
to the risk of further conditions, which could make the install process more brittle
and introduce the potential for false positives or other failures.

Does implementing this enhancement address symptoms of issues with operator status definitions?
Should operators themselves be setting Degraded=True when they don't meet this stability criterion?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Progressing is about moving from one steady state to another. Degraded is about "something is broken". They serve different purposes and some progressing states easily last longer than 5 minutes. For instance, rolling out a new kube-apiserver level takes 15-20 minutes.


As we have seen with other timeouts in the Installer, developers and users will want to change these.
We should define a process for refining our stability criteria.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Via another five minutes of openshift-install wait-for stable-operators ? Given all of our wait-for conditions I'm not too worried about knobs for the timeout.


## Design Details

### Open Questions [optional]

1. What is the correct definition of a stable operator? More importantly, how can we
refine this definition?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For purposes of this EP and the installer's perspective, it sounds like you're defining it as "progressing=false for 5 minutes"?

Is the installer going to look at any other conditions on the operator, e.g. available=false, or it will continue to rely on the CVO to proxy that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's progressing=False for the last 30 seconds but only waiting for all COs to achieve that up to 5 minutes. Only waiting up to five minutes because these aren't failed installs, the cluster is in all cases I'm aware of, a viable cluster that just differs from the exact spec defined in install-config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sorry i described it badly. but i'm still interested in what the full set of conditions that determine "install is complete (and healthy or unhealthy)" are, beyond this "30s window of progressing=false"

Copy link
Contributor Author

@patrickdillon patrickdillon Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the installer going to look at any other conditions on the operator, e.g. available=false, or it will continue to rely on the CVO to proxy that?

No, it would continue to rely on CVO as a proxy which is a good way of putting it.

<edited to copy/paste/fix here: https://github.com/openshift/enhancements/pull/1189#discussion_r921698993>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the definition as put forward by this enhancement plus the existing logic that requires A=t and D=f.

2. Should this logic belong in the Installer, CVO, or another component?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion about this one. The logic described here seems fine. I'd also be fine seeing the CVO do it.


### Test Plan

This code would go directly into the installer and be exercised by e2e tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By a synthetic CO that immediately becomes Available=True fulfilling the historic requirement for install-complete but maintains Progressing=True for 30 seconds after? Do we have tests that watch the cluster during the install or is this new ground? I think @deads2k has called these observers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @deads2k has called these observers?

We don't have e2e observers yet, but if they existed it could work. I built a backstop into our pre-test CI step that we can change from "wait five minutes to settle" to "fail if anything is not settled. If it trips, this feature has a bug.


### Graduation Criteria



#### Removing a deprecated feature

N/A

### Upgrade / Downgrade Strategy

The cluster-version operator can institute similar logic and trigger alerts when
operator stability is in question during an upgrade. See alternatives section.

### Version Skew Strategy

N/A

### Operational Aspects of API Extensions

N/A

#### Failure Modes

- The worst case failure more is that the Installer throws an error when there is not an actual
problem with a cluster. In this case, an admin would need to investigate an error or automation would
need to rerun an install. We would hope to eliminate this failures through monitoring CI.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this failed state cause a full TF destroy/retry in it's interactions with hive?


#### Support Procedures

We should instruct users on how to debug or review their operators when this error occurs.

## Implementation History

N/A

## Alternatives

We should seriously consider whether this logic should go into the cluster-version
operator rather than the installer. (Or if it should be in both, perhaps just the CVO
for upgrades.) Management of Cluster Operators has been the responsibility of the CVO
and the Installer should defer responsibility where possible to maintain separation
of concerns.