NE-2470: Gateway API without OLM#1933
NE-2470: Gateway API without OLM#1933openshift-merge-bot[bot] merged 9 commits intoopenshift:masterfrom
Conversation
Adds enhancements/ingress/gateway-api-without-olm.md for installing Gateway API support without OLM dependency. This enhancement transitions cluster-ingress-operator from creating an OLM Subscription to installing istiod directly using Helm charts and sail-operator libraries. This eliminates the OLM dependency, avoids conflicts with existing OSSM subscriptions, enables Gateway API on clusters without OLM/Marketplace capabilities, and allows faster Gateway API releases independent of OLM release cycles.
|
@gcs278: This pull request references NE-2470 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/assign |
| installation (4.22). | ||
| - Support downgrade from Helm-based installation (4.22) to OLM-based | ||
| installation (4.21). | ||
| - Enable faster Gateway API releases independent of OLM release cycles. |
There was a problem hiding this comment.
I've heard once that this may not be a problem, because once OSSM is released it goes to OLM directly. But maybe this is not the behavior between releases, pre-releases, etc. I may be wrong, but would be good to validate if this is a real issue
There was a problem hiding this comment.
Right, i think this goal might be overstated. It's installing the same official release of the istiod charts from sail operator, but through a different mechanism. It doesn't allow us to get the official releases into OpenShift any faster.
But I think it is a simplification of the process of getting releases, and pre-releases. I see we have a script and a e2e job e2e-aws-pre-release-ossm to test pre-releases. This impacts QE's testing our and e2e jobs, so I should probably metion that somewhere too CC: @rhamini3
I'll update this to reflect it's not necessarily faster.
|
|
||
| This enhancement does not introduce new CRDs. It changes ownership of three | ||
| Istio CRDs (`EnvoyFilter`, `WasmPlugin`, `DestinationRule`) and associated | ||
| webhooks (`ValidatingWebhookConfiguration`, `MutatingWebhookConfiguration`) |
There was a problem hiding this comment.
not to be answered now, but I have some questions on webhooks:
- Who manages their certificates? (is this still a modern problem?)
- During the meeting, Daniel mentioned that the webhooks are created to select just the resources with revision label. We may need to confirm this is respected with the sail operator library during implementation
There was a problem hiding this comment.
I will add these to the open questions section now, and folks can help me answer/discuss them there.
There was a problem hiding this comment.
yes, cert mgtm is still a problem. istiod manages them itself
There was a problem hiding this comment.
what exactly is meant here with "ownership changing to CIO"? the webhooks should be treated like any other resource created by the helm chart
| **Downtime Characteristics**: | ||
| - Brief istiod control plane downtime during the transition. | ||
| - No data plane downtime occurs during the transition. `Gateway` pods may roll | ||
| out due to the new istiod version, but this is a standard rolling update. |
There was a problem hiding this comment.
makes me wonder:
- we may need to cover that Sail Operator library/helm must deploy a PDB for the Istiod
- we may need to check about rolling strategy on the deployment generated by Helm for the Istiod
- we may need to work on the customization of Gateway/dataplane soon to be sure they are also deployed with a PDB (different subject from this EP)
Probably not related to this EP, but do we care about infraConfig.Status.InfrastructureTopology to be sure that control plane is deployed with replicas? (ref.: https://github.com/openshift/cluster-ingress-operator/pull/1326/files)
We can make it a different subject, tho, as we will just be consuming the sail operator library.
There was a problem hiding this comment.
Ack - good points I hadn't considered - though this seems like a pre-existing issue rather than one introduced by this enhancement.
| 3. **Reconciliation Logic**: Verify operator correctly detects and reconciles | ||
| Helm objects, Istio CRDs, and istiod deployments. | ||
|
|
||
| ## Graduation Criteria |
There was a problem hiding this comment.
during a discussion with @Miciah he raised that having a graduation criteria (or at least a FeatureGate during implementation) would be interesting to have some test metrics, and also to allow backporting this feature if we consider it will resolve the current OLM issues we have.
I will defer to him to give some more details here
|
/label tide/merge-method-squash |
- Add support procedures to satisfy markdownlint job - Replace vague "lifecycle management" with "reduce component dependencies" - Add note explaining "Helm-based" vs "OLM-based" terminology
Miciah
left a comment
There was a problem hiding this comment.
Some details remain to be worked out, but looks good overall. Thanks!
/approve
| - Support seamless upgrade from OLM-based installation (4.21) to Helm-based | ||
| installation (4.22). | ||
| - Support downgrade from Helm-based installation (4.22) to OLM-based | ||
| installation (4.21). |
There was a problem hiding this comment.
I suggest omitting supporting downgrades as a goal because (a) we don't generally support downgrading the OpenShift platform or components and (b) it is potentially a large effort to support downgrades.
There was a problem hiding this comment.
we don't generally support downgrading the OpenShift platform or components
Hmmm, I think my reference of 4.22 -> 4.21 downgrade is misleading. I was referring to a downgrade during a failed installation of 4.22.
Is rolling back of a failed installation (4.21-->4.22) something that is mandatory to support in all features? And, either way, it might be unnecessary to put this as an explicit goal if it's standard for new features. Removing for now.
There was a problem hiding this comment.
Is rolling back of a failed installation (4.21-->4.22) something that is mandatory to support in all features?
It isn't mandatory or expected to support rollback.
|
|
||
| The OSSM team plans to provide library functions in the sail-operator to handle | ||
| CRD management implementation, so while the cluster-ingress-operator would run | ||
| this code, the OSSM team would own the maintenance of the implementation. |
There was a problem hiding this comment.
This is an interesting development. Presumably these library functions would need to have the same approach of doing nothing if the CRDs were already present, so as to avoid conflicting with OLM and OSSM if it's installed. But, again, I thought the majority opinion was that CIO shouldn't manage or create these CRDs at all.
There was a problem hiding this comment.
I suppose the key question is: does installing CRDs mean we own them? I don't know. By using vendored OSSM code to install CRDs, does that just mean the CIO is acting as a pass-through/middleman while OSSM remains responsible for the CRD definitions and maintenance?
Maybe it's worth framing the discussion whether ownership and installation can be separated at our next meeting.
- Enhance motivation section and shorten summary - Remove downgrade from goals, move day 0 installation to non-goals - Add goal to preserve user-facing Gateway API experience - Clarify upgrade migration, drift, and image sourcing wording - Add Hypershift and OKE notes - Update maintenance/testing burden in drawbacks - Remove "Loss of OLM Benefits" drawback - Move Istio CRD management question from openshift#5 to openshift#2 - Update feature gate and downgrade section wording
| - Support seamless upgrade from OLM-based installation (4.21) to Helm-based | ||
| installation (4.22). | ||
| - Support downgrade from Helm-based installation (4.22) to OLM-based | ||
| installation (4.21). |
There was a problem hiding this comment.
we don't generally support downgrading the OpenShift platform or components
Hmmm, I think my reference of 4.22 -> 4.21 downgrade is misleading. I was referring to a downgrade during a failed installation of 4.22.
Is rolling back of a failed installation (4.21-->4.22) something that is mandatory to support in all features? And, either way, it might be unnecessary to put this as an explicit goal if it's standard for new features. Removing for now.
|
|
||
| The OSSM team plans to provide library functions in the sail-operator to handle | ||
| CRD management implementation, so while the cluster-ingress-operator would run | ||
| this code, the OSSM team would own the maintenance of the implementation. |
There was a problem hiding this comment.
I suppose the key question is: does installing CRDs mean we own them? I don't know. By using vendored OSSM code to install CRDs, does that just mean the CIO is acting as a pass-through/middleman while OSSM remains responsible for the CRD definitions and maintenance?
Maybe it's worth framing the discussion whether ownership and installation can be separated at our next meeting.
| gateway-api configurations. | ||
|
|
||
| 2. **Upgrade Migration**: Detect when upgrading from an OLM-based | ||
| installation (4.21) to Helm-based (4.22), delete the `Istio` CR in order to |
There was a problem hiding this comment.
just as a followup: let's make it explicit that it will delete the Istio CR owned by CIO and present on openshift-ingress namespace, so there is no fear we will be deleting random Istio CRs
There was a problem hiding this comment.
The Istio CR is cluster-scoped. The operator always uses the name "openshift-gateway" for its Istio CR, so it is unlikely that someone would accidentally create a conflicting Istio CR.
There was a problem hiding this comment.
oooh true, good catch! I was sure it was namespaced!
|
/hold |
| 2. Wait for the sail-operator to delete all Helm chart resources (istiod | ||
| deployment, services, etc.). | ||
| 3. Install istiod using the operator's Helm-based approach. |
There was a problem hiding this comment.
Originally I thought this wouldn't work because the CA for the Sail Istio and the CA for the helm-installed Istio wouldn't match and the existing gateway proxies wouldn't trust the new CA. But it turns out that istiod creates this self signed CA and stores it in a secret called istio-ca-secret. Because istiod creates it, the secret is not included in the charts and therefore does not have an ownerref like the other resources. So when the Istio resource gets deleted the CA secret stays around and when the helm-installed Istio comes online, it sees an existing istio-ca-secret and reuses that rather than creating a new one.
All that to say, I'm unsure if this is a bug or feature. If the Sail Operator were to add an ownerref for this secret in the future so that it gets cleaned up when you delete Istio that would break this migration process. @dgn thoughts?
There was a problem hiding this comment.
this is a feature- without it, in-place upgrades would be impossible because the root cert would change
There was a problem hiding this comment.
I will consider this comment as resolved
| This approach ensures charts are version controlled and synchronized via Go | ||
| vendoring, eliminating drift between the Helm charts and the Istio version they | ||
| deploy. When the sail-operator library version is updated in `go.mod`, the Helm | ||
| charts are automatically synchronized with the corresponding Istio version. |
There was a problem hiding this comment.
Sorry for the last-minute comment, but we probably need to address what happens when the Helm chart changes, especially if items that we watch/reconcile are deleted, or new items are added and we don't watch/reconcile those items. We'd need to somehow keep on top of this unless we did not allow automatic synchronization of the Helm chart.
There was a problem hiding this comment.
The Sail Library handles that currently by having an API like:
reconciler := installer.Install(ctx, Options)
reconciler.Start() | Stop()
.... update happens, repeat the install/start cycle
Start registers informers on the types found in the charts. That way we always watch what we installed.
| products (RHOAI, Kuadrant, MCP Gateway) rely on Istio CRDs that are provided | ||
| when the cluster-ingress-operator installs OSSM via OLM. Without CRD management | ||
| in 4.22, these features would break unless layered products install OSSM | ||
| themselves (recreating the subscription conflict problem this EP solves). |
There was a problem hiding this comment.
How does this create a subscription conflict? Install OSSM for the Istio CRDs, then create a GatewayClass and let c-i-o install istiod without a subscription.
There was a problem hiding this comment.
Installing OSSM is the problem of subscription conflict that we are trying to get rid off. In this case, you are assuming that someone will install OSSM, and this "someone" does that via a subscription, which would make us also enforce the existence of this subscription.
I think what we want is actually move away of any "relation" with OSSM install for Gateway API. We still use OSSM, we still use parts of sail-operator but we are not installing OSSM to have Gateway API working on the cluster
| - Require OSSM installation for these features (adds resource overhead when service | ||
| mesh capabilities aren't needed) |
There was a problem hiding this comment.
Not after the "gateway api only" flag is implemented.
There was a problem hiding this comment.
Right, so right now what we are saying here is that we can install Istio, and not install Istio CRDs + disable its service mesh mode and it should work.
The Gateway API only mode is a safe measure to, even if the CRDs are installed by another product (like OSSM), the north/south controller will ignore them
| - Layered products manage CRDs themselves (coordination challenge, duplicated work) | ||
| - Require OSSM installation for these features (adds resource overhead when service | ||
| mesh capabilities aren't needed) | ||
| - Defer to future enhancement (causes regression from 4.21 to 4.22) |
There was a problem hiding this comment.
| - Defer to future enhancement (causes regression from 4.21 to 4.22) | |
| - Defer to future enhancement (causes regression from 4.21 to 4.22) for a small number of people using Istio CRDs. However, if they already had Istio CRDs installed, the upgrade should not remove those the CRDs, so there would be no regression. |
There was a problem hiding this comment.
I am a -1 on this one. I wouldn't assume the amount of people using it and add to the enhancement, it may be misleading. If it causes impacts, it should be considered for 1 or for 10k users.
| - `DestinationRule`: Required by RHCL/Kuadrant versions not yet supporting | ||
| `BackendTLSPolicy` | ||
|
|
||
| The proposed ownership model would be: |
| will also benefit other OSSM projects that plan to consume them, making the | ||
| coordination valuable beyond just the ingress operator use case. | ||
|
|
||
| ### Alternative 3: Make OSSM a Core Operator |
There was a problem hiding this comment.
I want to make the Sail operator a core operator (or at least manage it directly from the CIO).
BUT I want to rev the Sail operator (and the CRDs) on the Z streams to later versions. We would not move the version of Istio we use for GWAPI forward (except for bugfixes) but this would allow customers to have a later version of Istio for OSSM as they can do today (or an older version).
CIO would need to ensure that the CRDs are backwards compatible with the version we use for GWAPI (and we will need to restrict the range supported by Sail for testing as well, but it should be no worse than the matrix we support today, and that matrix must include the version needed by GWAPI).
Note that this is my eventual goal (in a release shortly after 5.0). So we should not do things that make this harder.
There was a problem hiding this comment.
From what I can tell, the proposed approach by Grant is reversible back to Sail Operator, and in fact it was written considering that at some point sail-operator would be back in place. This was tested as part of the "downgrade" section: https://github.com/openshift/enhancements/pull/1933/files#diff-dd1263054e56bf6e018d18643409d93fb5a0ad3451272befda08bcde91e1810dR629
So yeah, we may be good here.
| CRD management implementation, so while the cluster-ingress-operator would run | ||
| this code, the OSSM team would own the maintenance of the implementation. | ||
|
|
||
| **Alternatives:** |
There was a problem hiding this comment.
Agreed. These are bad choices.
| yields control and no longer manages them. | ||
| - **If OLM has previously created CRDs or taken ownership**: The | ||
| cluster-ingress-operator does not touch them and skips installation. | ||
| - **CRD deletion**: CRDs are never deleted when `GatewayClass` is deleted, to |
There was a problem hiding this comment.
Is there also the case where all OLM subscriptions are removed? We should take over again.
More precisely, you talk about creation in the bullets, but ownership in the heading. I would like to make clear what happens with updates to the CRDs too.
I want a clear, identifiable, owner of the CRDs for the cluster that I can walk up to, look at the CRDs and be able to precisely say which one operator owns them. (Or, that a third-party owns them and it is not our problem, and we should reflect third-party ownsership in CIO status somewhere so support can know easily. We may want to reflect ownership status all the time just so it's clear).
So, when you are running (or upgrading to) a version of OpenShift where CIO installs Istio:
- If no CRDs exist: The cluster-ingress-operator creates them when a
GatewayClassis created. CIO owns upgrades. - If OLM subscription is created afterwards: OLM takes ownership of the
CRDs that the cluster-ingress-operator created. The cluster-ingress-operator
yields control and no longer manages them. OLM owns upgrades. - If all OLM subscriptions are removed: The cluster-ingress-operator owns them. More below.
When we take ownership, it gets messy. The CRDs are not versioned so we need to add missing fields... and leave extra fields.
Open Question: How do you handle fields that are extra but withouit a default.
Seem reasonable?
There was a problem hiding this comment.
When you say fields that are extra, do you mean CRD fields that appear between versions? I would say that on Istio case, the control is "better" than Gateway API because we are saying about Istio CRDs for an Istio controller, so let's say, we update the CRDs for Istio 1.28, but are running Istio 1.27 it won't be aware of the existence of these extra fields (would need to confirm with @dgn and @rcernich ).
I think my main concern on the workflow you proposed is that we are considering now that either OSSM/OLM or CIO manage the Istio CRDs, but what about a 3rd party? What happens if we have something like CIO -> OLM -> 3rd party install? Does OLM yeld its control as well?
I am happy to explore it more as part of the implementation, tho
There was a problem hiding this comment.
There are two things that are being overlooked here. One is that NID is already in the process of getting out of CRD management for Gateway API, we do not want to add additional CRD management tasks to CIO for Istio. Secondly, anyone installing CRDs should also be upgrading them and providing user support for them. How can we explain to a support person for example - yes, we installed DestinationRule and we watch it but we don't support it if there's a problem with traffic delivery via DestinationRule. What happens when there is an OSSM installation on the same cluster that requires these CRDs, but in OSSM a different version is provided than was provided by CIO for RHCL or RHOAI?
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgn, Miciah, rikatz The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
I have added the comments from this PR at rikatz@53d92cd so we can keep track on my followup PR. |
|
/hold cancel |
|
@gcs278: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Adds enhancements/ingress/gateway-api-without-olm.md for installing Gateway API support without OLM dependency.
This enhancement transitions cluster-ingress-operator from creating an OLM Subscription to installing istiod directly using Helm charts and sail-operator libraries. This eliminates the OLM dependency, avoids conflicts with existing OSSM subscriptions, enables Gateway API on clusters without OLM/Marketplace capabilities, and allows faster Gateway API releases independent of OLM release cycles.
This proposal was initially drafted using the /add-enhancement ai-helper command and fully reviewed and edited by the author.