Skip to content

Comments

NE-2066: Set degraded=true when OSSM 3 can't be installed#1268

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
rfredette:ne-2066-conflicting-subs
Aug 22, 2025
Merged

NE-2066: Set degraded=true when OSSM 3 can't be installed#1268
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
rfredette:ne-2066-conflicting-subs

Conversation

@rfredette
Copy link
Contributor

@rfredette rfredette commented Aug 13, 2025

Detect subscriptions that would prevent the Ingress Operator from installing OSSM 3, and set the operator's degraded condition to true when any of those subscriptions are present.

This is the implementation of NE-2066

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 13, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 13, 2025

@rfredette: This pull request references NE-2066 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

Details

In response to this:

Detect subscriptions that would prevent the Ingress Operator from installing OSSM 3, and set the operator's degraded condition to true when any of those subscriptions are present.

This is the implementation of NE-2066

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from knobunc and miheer August 13, 2025 19:21
@rfredette
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 13, 2025

@rfredette: This pull request references NE-2066 which is a valid jira issue.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rfredette rfredette force-pushed the ne-2066-conflicting-subs branch from 61af22a to 797c13d Compare August 13, 2025 19:35
case versionDiff < 0:
// Installed version is newer than expected. Gateway API install may still work if the correct Istio
// version is supported.
log.Info("found newer OSSM version than expected. Gateway API install may not work as intended", "installed", subscription.Status.InstalledCSV, "expected", state.expectedGatewayAPIOperatorVerison)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to put something in the status condition message rather than logging it on every status reconciliation. (This might require changes to joinConditions to preserve the message even if the status is False.)

Copy link
Contributor

@alebedev87 alebedev87 Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it will be Degraded=False but with Found newer OSSM version than expected message. Will this sound contradictory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend to keep the Gateway API install may not work as intended part of the message, which hopefully at least gives warning that something may be wrong.

In the future, we can be more certain if the ingress operator needs to be degraded by checking if the version of Istio that it needs is still supported by the newer OSSM version. With that, we can do away with the vague warning about gateway API possibly working or possibly not. I don't think I have the time to include that in this PR, but I created NE-2119 to track that work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to have Degraded with status False and a message that says it isn't necessarily degraded but the subscription isn't exactly what we expected. I think that's good enough.

If we get ambitious, maybe it would make sense to add a metric and an alerting rule? That would make the communication a little more explicit and difficult to miss, but it could also be annoying, and it might be more work than we have time to do right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend to keep the Gateway API install may not work as intended part of the message, which hopefully at least gives warning that something may be wrong.

Right, I meant the whole message may look contradictory because "may not work as intended" part look serious enough to justify Degraded=True.

In the future, we can be more certain if the ingress operator needs to be degraded by checking if the version of Istio that it needs is still supported by the newer OSSM version. With that, we can do away with the vague warning about gateway API possibly working or possibly not.

Yes, checking the Istio compatibility in combination with the OSSM operator's N-X guarantees (by the way, is X==3?) is the best way to determine the degraded condition. Until we have the Istio compatibility check implemented, staying with a log message seems like a reasonable compromise: tt doesn't alert the end user but gives engineering a hint. I just feel like the current message may look a little indecisive, as if we don't really know whether the OSSM operator can support older versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's intentionally at least a bit indecisive; The idea is to say that the ingress operator isn't necessarily degraded, but you're in uncharted waters. I'm open to changing the message if we can find a way to be clearer about that, though

@candita
Copy link
Contributor

candita commented Aug 13, 2025

/assign
/assign @Miciah

@candita
Copy link
Contributor

candita commented Aug 13, 2025

/assign @alebedev87

@alebedev87
Copy link
Contributor

@rfredette : I forgot to mention that Test_computeOperatorDegradedCondition unit test should be updated to covert the changes.

@rfredette rfredette force-pushed the ne-2066-conflicting-subs branch 3 times, most recently from ac0ef6c to fc62e72 Compare August 15, 2025 19:05
@candita
Copy link
Contributor

candita commented Aug 15, 2025

/retest required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 15, 2025

@candita: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test e2e-aws-operator
/test e2e-aws-ovn
/test e2e-aws-ovn-hypershift-conformance
/test e2e-aws-ovn-serial
/test e2e-aws-ovn-upgrade
/test e2e-azure-operator
/test e2e-gcp-operator
/test e2e-hypershift
/test hypershift-e2e-aks
/test images
/test okd-scos-images
/test unit
/test verify
/test verify-deps

The following commands are available to trigger optional jobs:

/test e2e-aws-gatewayapi-conformance
/test e2e-aws-operator-techpreview
/test e2e-aws-ovn-single-node
/test e2e-aws-ovn-techpreview
/test e2e-azure-manual-oidc
/test e2e-azure-ovn
/test e2e-gcp-ovn
/test e2e-ibmcloud-operator
/test e2e-openstack-operator
/test okd-scos-e2e-aws-ovn

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator-techpreview
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-hypershift-conformance
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-serial
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-single-node
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-techpreview
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-upgrade
pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-operator
pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-ovn
pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator
pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-ovn
pull-ci-openshift-cluster-ingress-operator-master-e2e-hypershift
pull-ci-openshift-cluster-ingress-operator-master-hypershift-e2e-aks
pull-ci-openshift-cluster-ingress-operator-master-images
pull-ci-openshift-cluster-ingress-operator-master-okd-scos-e2e-aws-ovn
pull-ci-openshift-cluster-ingress-operator-master-okd-scos-images
pull-ci-openshift-cluster-ingress-operator-master-unit
pull-ci-openshift-cluster-ingress-operator-master-verify
pull-ci-openshift-cluster-ingress-operator-master-verify-deps
Details

In response to this:

/retest required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@alebedev87 alebedev87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last round, about the wording of the Degraded messages.

Type: configv1.OperatorDegraded,
Status: configv1.ConditionTrue,
Reason: "GatewayAPIInstallConflict",
Message: "Package sailoperator from subscription foo/sailoperator prevents enabling operator-managed Gateway API. Uninstall foo/sailoperator to enable.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message still somewhat assumes that the sailoperator is installed before the user enables the platform-managed Gateway API (i.e. before the GatewayClass is created). However, another scenario is when the CIO-managed OSSM is already installed (and therefore enabled), in which case the sailoperator’s installation may conflict with the currently running OSSM but does not prevent its enablement, since it has already occurred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add a check for if InstalledCSV is set (with the understanding that we also add a check of the CSV phase in a later iteration, as discussed elsewhere). If the subscription exists but hasn't installed anything, there shouldn't be anything that's actually conflicting, but if something was installed, we do need the user to remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean to check the installedCVS. What I meant is the wording: "enabling" and "to enable" work only in the case when the CIO managed subscription tries to be created after a conflicting subscription. There is still a case when, for instance, a sailoperator is trying t be installed after the CIO managed OSSM was already installed. In that case the wording "to enable" is inaccurate because the CIO is already enabled.

Type: configv1.OperatorDegraded,
Status: configv1.ConditionTrue,
Reason: "GatewayAPIInstallConflict",
Message: "Installed version servicemeshoperator3.v3.0.0 is too old to support operator-managed Gateway API. Install version servicemeshoperator3.v3.1.0 or uninstall foo/servicemeshoperator3 to enable.\nPackage sailoperator from subscription foo/sailoperator prevents enabling operator-managed Gateway API. Uninstall foo/sailoperator to enable.",
Copy link
Contributor

@alebedev87 alebedev87 Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message for an older subscription says "Install version servicemeshoperator3.v3.1.0 or uninstall foo/servicemeshoperator3 to enable". However the test scenario already has v3.1.0 subscription owned by the operator:

			ossmSubscriptions: []operatorsv1alpha1.Subscription{
					sub("servicemeshoperator3", "servicemeshoperator3", "servicemeshoperator3.v3.1.0", true),
					sub("servicemeshoperator3", "servicemeshoperator3", "servicemeshoperator3.v3.0.0", false),
					sub("sailoperator", "sailoperator", "sailoperator.v1.0.0", false),
				},

Should we add something like "Install version servicemeshoperator3.v3.1.0 (if doesn't exist)" or try to detect such a situation and say just "Uninstall foo/servicemeshoperator3"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically I don't believe they can both be installed simultaneously; ossm3 and sail both claim the same CRDs, and OLM prevents installing any operator that claims CRDs already claimed by an installed operator. I should probably adjust this test scenario to only have one of the installs successful, but I think the actual message is fine

var aName, bName string
var aX, aY, aZ, bX, bY, bZ int
aSplit := strings.Split(a, ".")
if len(aSplit) != 4 {
Copy link
Contributor

@candita candita Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be concerned that we don't really have any control over the version formatting here? See https://github.com/openshift/cluster-ingress-operator/pull/1268/files#r2289636868.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth looking into, but I'm not sure I have the time to do so before this deadline. Can we agree that this is good enough for now, and I will create a followup task to future proof this in the next iteration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me if you don't set Degraded=true when there is a formatting error, i.e. add the message to warnings rather than conflicts.

Comment on lines 652 to 662
case err != nil:
conflicts = append(conflicts, fmt.Sprintf("failed to compare installed OSSM version to expected: %v", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to future-proof this so we didn't start getting Degraded=true if a version formatting change happened without our knowledge (or we forgot to check this with every update). How about if there's an error, we set this to be a warning?

@rfredette rfredette force-pushed the ne-2066-conflicting-subs branch from 15bdd86 to bc5ba2e Compare August 21, 2025 17:41
Comment on lines +1011 to +1013
if aName != bName {
return 0, fmt.Errorf("%q and %q are different packages. cannot compare version numbers", a, b)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to check for the same package name before we even call this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a safety check. We shouldn't hit this in actual use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're sure we don't end up comparing "sailoperatorv1.0.0" with "servicemeshoperatorv3.1.0", for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Currently, compareVersionNums only used in the case where subscription.Spec.Package is servicemeshoperator3, and it's compared against the expected version, which will be servicemeshoperator3.v<version>. It's possible someone could change the expected version, since it's a field set in the ingress operator deployment, but that's not really supported.

@rfredette rfredette force-pushed the ne-2066-conflicting-subs branch from bc5ba2e to 114f3e9 Compare August 21, 2025 18:42
Detect subscriptions that would prevent the Ingress Operator from
installing OSSM 3, and set the operator's degraded condition to true
when any of those subscriptions are present.

This is the implementation of NE-2066
@rfredette rfredette force-pushed the ne-2066-conflicting-subs branch from 114f3e9 to 1b58225 Compare August 21, 2025 20:17
@candita
Copy link
Contributor

candita commented Aug 21, 2025

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 21, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 21, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: candita

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 21, 2025
@candita
Copy link
Contributor

candita commented Aug 21, 2025

e2e-gcp-operator is failing with the designated override reason, so I will override it here.

/override ci/prow/e2e-gcp-operator

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 21, 2025

@candita: Overrode contexts on behalf of candita: ci/prow/e2e-gcp-operator

Details

In response to this:

e2e-gcp-operator is failing with the designated override reason, so I will override it here.

/override ci/prow/e2e-gcp-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 22, 2025

@rfredette: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-operator-techpreview 1b58225 link false /test e2e-aws-operator-techpreview
ci/prow/okd-scos-e2e-aws-ovn 1b58225 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-gcp-ovn 1b58225 link false /test e2e-gcp-ovn
ci/prow/e2e-aws-ovn-single-node 1b58225 link false /test e2e-aws-ovn-single-node

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit b966710 into openshift:master Aug 22, 2025
17 of 21 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-ingress-operator
This PR has been included in build ose-cluster-ingress-operator-container-v4.21.0-202508212332.p0.gb966710.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants