NE-2066: Set degraded=true when OSSM 3 can't be installed#1268
Conversation
|
@rfredette: This pull request references NE-2066 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@rfredette: This pull request references NE-2066 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
61af22a to
797c13d
Compare
| case versionDiff < 0: | ||
| // Installed version is newer than expected. Gateway API install may still work if the correct Istio | ||
| // version is supported. | ||
| log.Info("found newer OSSM version than expected. Gateway API install may not work as intended", "installed", subscription.Status.InstalledCSV, "expected", state.expectedGatewayAPIOperatorVerison) |
There was a problem hiding this comment.
Better to put something in the status condition message rather than logging it on every status reconciliation. (This might require changes to joinConditions to preserve the message even if the status is False.)
There was a problem hiding this comment.
So, it will be Degraded=False but with Found newer OSSM version than expected message. Will this sound contradictory?
There was a problem hiding this comment.
I intend to keep the Gateway API install may not work as intended part of the message, which hopefully at least gives warning that something may be wrong.
In the future, we can be more certain if the ingress operator needs to be degraded by checking if the version of Istio that it needs is still supported by the newer OSSM version. With that, we can do away with the vague warning about gateway API possibly working or possibly not. I don't think I have the time to include that in this PR, but I created NE-2119 to track that work.
There was a problem hiding this comment.
I think it is fine to have Degraded with status False and a message that says it isn't necessarily degraded but the subscription isn't exactly what we expected. I think that's good enough.
If we get ambitious, maybe it would make sense to add a metric and an alerting rule? That would make the communication a little more explicit and difficult to miss, but it could also be annoying, and it might be more work than we have time to do right now.
There was a problem hiding this comment.
I intend to keep the
Gateway API install may not work as intendedpart of the message, which hopefully at least gives warning that something may be wrong.
Right, I meant the whole message may look contradictory because "may not work as intended" part look serious enough to justify Degraded=True.
In the future, we can be more certain if the ingress operator needs to be degraded by checking if the version of Istio that it needs is still supported by the newer OSSM version. With that, we can do away with the vague warning about gateway API possibly working or possibly not.
Yes, checking the Istio compatibility in combination with the OSSM operator's N-X guarantees (by the way, is X==3?) is the best way to determine the degraded condition. Until we have the Istio compatibility check implemented, staying with a log message seems like a reasonable compromise: tt doesn't alert the end user but gives engineering a hint. I just feel like the current message may look a little indecisive, as if we don't really know whether the OSSM operator can support older versions.
There was a problem hiding this comment.
It's intentionally at least a bit indecisive; The idea is to say that the ingress operator isn't necessarily degraded, but you're in uncharted waters. I'm open to changing the message if we can find a way to be clearer about that, though
|
/assign |
|
/assign @alebedev87 |
|
@rfredette : I forgot to mention that |
ac0ef6c to
fc62e72
Compare
|
/retest required |
|
@candita: The The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
alebedev87
left a comment
There was a problem hiding this comment.
Last round, about the wording of the Degraded messages.
| Type: configv1.OperatorDegraded, | ||
| Status: configv1.ConditionTrue, | ||
| Reason: "GatewayAPIInstallConflict", | ||
| Message: "Package sailoperator from subscription foo/sailoperator prevents enabling operator-managed Gateway API. Uninstall foo/sailoperator to enable.", |
There was a problem hiding this comment.
This message still somewhat assumes that the sailoperator is installed before the user enables the platform-managed Gateway API (i.e. before the GatewayClass is created). However, another scenario is when the CIO-managed OSSM is already installed (and therefore enabled), in which case the sailoperator’s installation may conflict with the currently running OSSM but does not prevent its enablement, since it has already occurred.
There was a problem hiding this comment.
I could add a check for if InstalledCSV is set (with the understanding that we also add a check of the CSV phase in a later iteration, as discussed elsewhere). If the subscription exists but hasn't installed anything, there shouldn't be anything that's actually conflicting, but if something was installed, we do need the user to remove it.
There was a problem hiding this comment.
I didn't mean to check the installedCVS. What I meant is the wording: "enabling" and "to enable" work only in the case when the CIO managed subscription tries to be created after a conflicting subscription. There is still a case when, for instance, a sailoperator is trying t be installed after the CIO managed OSSM was already installed. In that case the wording "to enable" is inaccurate because the CIO is already enabled.
| Type: configv1.OperatorDegraded, | ||
| Status: configv1.ConditionTrue, | ||
| Reason: "GatewayAPIInstallConflict", | ||
| Message: "Installed version servicemeshoperator3.v3.0.0 is too old to support operator-managed Gateway API. Install version servicemeshoperator3.v3.1.0 or uninstall foo/servicemeshoperator3 to enable.\nPackage sailoperator from subscription foo/sailoperator prevents enabling operator-managed Gateway API. Uninstall foo/sailoperator to enable.", |
There was a problem hiding this comment.
The message for an older subscription says "Install version servicemeshoperator3.v3.1.0 or uninstall foo/servicemeshoperator3 to enable". However the test scenario already has v3.1.0 subscription owned by the operator:
ossmSubscriptions: []operatorsv1alpha1.Subscription{
sub("servicemeshoperator3", "servicemeshoperator3", "servicemeshoperator3.v3.1.0", true),
sub("servicemeshoperator3", "servicemeshoperator3", "servicemeshoperator3.v3.0.0", false),
sub("sailoperator", "sailoperator", "sailoperator.v1.0.0", false),
},Should we add something like "Install version servicemeshoperator3.v3.1.0 (if doesn't exist)" or try to detect such a situation and say just "Uninstall foo/servicemeshoperator3"?
There was a problem hiding this comment.
Technically I don't believe they can both be installed simultaneously; ossm3 and sail both claim the same CRDs, and OLM prevents installing any operator that claims CRDs already claimed by an installed operator. I should probably adjust this test scenario to only have one of the installs successful, but I think the actual message is fine
| var aName, bName string | ||
| var aX, aY, aZ, bX, bY, bZ int | ||
| aSplit := strings.Split(a, ".") | ||
| if len(aSplit) != 4 { |
There was a problem hiding this comment.
Should we be concerned that we don't really have any control over the version formatting here? See https://github.com/openshift/cluster-ingress-operator/pull/1268/files#r2289636868.
There was a problem hiding this comment.
It's worth looking into, but I'm not sure I have the time to do so before this deadline. Can we agree that this is good enough for now, and I will create a followup task to future proof this in the next iteration?
There was a problem hiding this comment.
That's fine with me if you don't set Degraded=true when there is a formatting error, i.e. add the message to warnings rather than conflicts.
| case err != nil: | ||
| conflicts = append(conflicts, fmt.Sprintf("failed to compare installed OSSM version to expected: %v", err)) |
There was a problem hiding this comment.
It would be better to future-proof this so we didn't start getting Degraded=true if a version formatting change happened without our knowledge (or we forgot to check this with every update). How about if there's an error, we set this to be a warning?
15bdd86 to
bc5ba2e
Compare
| if aName != bName { | ||
| return 0, fmt.Errorf("%q and %q are different packages. cannot compare version numbers", a, b) | ||
| } |
There was a problem hiding this comment.
It would be better to check for the same package name before we even call this function.
There was a problem hiding this comment.
This is more of a safety check. We shouldn't hit this in actual use.
There was a problem hiding this comment.
You're sure we don't end up comparing "sailoperatorv1.0.0" with "servicemeshoperatorv3.1.0", for example?
There was a problem hiding this comment.
Yeah. Currently, compareVersionNums only used in the case where subscription.Spec.Package is servicemeshoperator3, and it's compared against the expected version, which will be servicemeshoperator3.v<version>. It's possible someone could change the expected version, since it's a field set in the ingress operator deployment, but that's not really supported.
bc5ba2e to
114f3e9
Compare
Detect subscriptions that would prevent the Ingress Operator from installing OSSM 3, and set the operator's degraded condition to true when any of those subscriptions are present. This is the implementation of NE-2066
114f3e9 to
1b58225
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: candita The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
e2e-gcp-operator is failing with the designated override reason, so I will override it here. /override ci/prow/e2e-gcp-operator |
|
@candita: Overrode contexts on behalf of candita: ci/prow/e2e-gcp-operator DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@rfredette: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
b966710
into
openshift:master
|
[ART PR BUILD NOTIFIER] Distgit: ose-cluster-ingress-operator |
Detect subscriptions that would prevent the Ingress Operator from installing OSSM 3, and set the operator's degraded condition to true when any of those subscriptions are present.
This is the implementation of NE-2066