NE-2022: Bump to OSSM 3.0.1 and Istio 1.24.4 #1227

Miciah · 2025-05-02T17:00:02Z

Bump to OSSM 3.0.1 and Istio 1.24.4

Bump from OSSM v3.0.0 to v3.0.1 and from Istio v1.24.3 to v1.24.4.

Also, explicitly set ENABLE_GATEWAY_API_MANUAL_DEPLOYMENT to "false" on the Istio CR. For Istio 1.24.3, OSSM has a vendor override that sets this option. However, for Istio 1.24.4, the option must be explicitly set.

Enable Gateway only CA Bundles and custom CA CM name

To avoid conflicts with user-managed control-planes, set a custom name for the CA bundle configmaps for the Istio control-plane that the operator manages. Also, configure Istio to inject the configmaps only into namespaces where gateways exist in order to avoid polluting the whole cluster.

Set one new environment variable in the Istio CR:

PILOT_ENABLE_GATEWAY_API_CA_CERT_ONLY

Set the Istio CR's trustBundleName global value to match the custom configmap name. This change requires bumping the sail-operator API:

go get github.com/istio-ecosystem/sail-operator@30be83268d6b6bfaf6fb0562a6c3e505a17422ea

This change is related to OSSM-9076.

This change incorporates #1209.

Don't copy labels or annotations

Configure Istiod not to copy annotations or labels from gateways onto associated resources, such as the proxy deployment and load-balancer service for a gateway.

This copying behavior is Istio-specific, not part of the Gateway API spec, and could be used to inject unsupported configuration. For example, an end-user could set a service annotation on the gateway in order to configure a load-balancer. Setting annotations on the gateway to configure the load-balancer would not be portable to other Gateway API implementations and would complicate product support.

One new environment variable is set:

PILOT_ENABLE_GATEWAY_API_COPY_LABELS_ANNOTATIONS

This change is related to OSSM-8989.

Delete old controller-mode setting

Delete the obsolete PILOT_ENABLE_GATEWAY_CONTROLLER_MODE environment variable from the Istiod configuration. This environment variable is no longer recognized in OSSM 3, and the variable has been superseded by EnhancedResourceScoping.

`assertDNSRecord`: Increase timeout to 10m

Increase the timeout in assertDNSRecord for polling for the DNSRecord CR from 1 minute to 10 minutes.

The cloud provider can easily take over a minute to provision the load balancer, and the operator cannot create the DNSRecord CR before the load balancer has been provisioned and assigned a host name or address. Consequently, the polling loop could easily reach the 1-minute timeout just on account of the time that it takes to provision the load balancer.

`testGatewayAPIManualDeployment`: Increase timeout

Increase the timeout for polling the gateway, and dump the gateway if the test fails.

openshift-ci-robot · 2025-05-02T17:00:06Z

@Miciah: This pull request references NE-2022 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

Details

In response to this:

Bump to OSSM 3.0.1 and Istio 1.24.4

Bump from OSSM v3.0.0 to v3.0.1 and from Istio v1.24.3 to v1.24.4.

Enable Gateway only CA Bundles and custom CA CM name

Avoid conflict with a user control plane by setting a custom CA Bundle CM name for the Gateway Control plane and enable Istio to only inject CA Bundle CMs in namespaces where Gateways exist to avoid polluting the whole cluster.

Two new environment variables are set for the Istio control plane deployment CR:

PILOT_CA_CERT_CONFIGMAP
PILOT_ENABLE_GATEWAY_API_CA_CERT_ONLY

This change is related to OSSM-9076.

Don't copy labels or annotations

Configure Istiod not to copy annotations or labels from gateways onto associated resources, such as the proxy deployment and load-balancer service for a gateway.

This copying behavior is Istio-specific, not part of the Gateway API spec, and could be used to inject unsupported configuration. For example, an end-user could set a service annotation on the gateway in order to configure a load-balancer. Setting annotations on the gateway to configure the load-balancer would not be portable to other Gateway API implementations and would complicate product support.

One new environment variable is set:

PILOT_ENABLE_GATEWAY_API_COPY_LABELS_ANNOTATIONS

This change is related to OSSM-8989.

Delete old controller-mode setting

Delete the obsolete PILOT_ENABLE_GATEWAY_CONTROLLER_MODE environment variable from the Istiod configuration. This environment variable is no longer recognized in OSSM 3, and the variable has been superseded by EnhancedResourceScoping.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Miciah · 2025-05-02T18:55:54Z

e2e-hypershift failed because TestCreateClusterV2/Main/break-glass-credentials/independent_signers failed. The failure appears to be the same as the one tracked in OCPBUGS-44582.

Miciah · 2025-05-02T20:33:17Z

e2e-aws-gatewayapi failed because TestGatewayAPI/testGatewayAPIObjects and TestGatewayAPI/testGatewayAPIManualDeployment failed.

Once the httproute is accepted, the testGatewayAPIObjects test only polls for 1 minute, which is not necessarily enough time to provision an ELB. In this case, it took over 1 minute, from T18:31:56 to T18:33:16:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1227/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi/1918349792623202304/artifacts/e2e-aws-gatewayapi/gather-extra/artifacts/events.json' | jq -c '.items.[]|select(.involvedObject.name == "test-gateway-openshift-default" and .source.component == "service-controller")|[.firstTimestamp, .message]'
["2025-05-02T18:31:56Z","Ensuring load balancer"]
["2025-05-02T18:31:57Z","Error syncing load balancer: failed to ensure load balancer: TooManyLoadBalancers: Exceeded quota of account 130757279292\n\tstatus code: 400, request id: 22c489e5-6ce3-42d2-b054-873e041491dd"]
["2025-05-02T18:32:03Z","Error syncing load balancer: failed to ensure load balancer: TooManyLoadBalancers: Exceeded quota of account 130757279292\n\tstatus code: 400, request id: 9c395cac-7a89-4c05-9a48-981dddff62c6"]
["2025-05-02T18:32:13Z","Error syncing load balancer: failed to ensure load balancer: TooManyLoadBalancers: Exceeded quota of account 130757279292\n\tstatus code: 400, request id: ce39636a-cc44-4964-9e9c-230eb61f0274"]
["2025-05-02T18:32:34Z","Error syncing load balancer: failed to ensure load balancer: TooManyLoadBalancers: Exceeded quota of account 130757279292\n\tstatus code: 400, request id: f8817ecb-fc85-4ee7-85ee-9f1fb6ac037e"]
["2025-05-02T18:33:16Z","Ensured load balancer"]
["2025-05-02T18:43:05Z","Deleting load balancer"]
["2025-05-02T18:43:26Z","Deleted load balancer"]
%

I will push a commit to increase the timeout to 10 minutes.

It is less clear why testGatewayAPIManualDeployment failed. Istiod logs that it observe the gateway at T18:32:59:

2025-05-02T18:32:59.423291Z	info	ads	Push debounce stable[7] 2 for config Gateway/openshift-ingress/manual-deployment: 101.0015ms since last change, 110.068219ms since last push, full=true
2025-05-02T18:32:59.540520Z	info	ads	Push debounce stable[8] 1 for config Gateway/openshift-ingress/manual-deployment: 100.231743ms since last change, 100.231574ms since last push, full=true
{"metadata":{"name":"manual-deployment.183bc973b47c7b74","namespace":"openshift-ingress","uid":"9eaf145b-968c-40d6-8c0c-896cbdfafb36","resourceVersion":"35647","creationTimestamp":"2025-05-02T18:32:59Z"},"reason":"AddedLabel","message":"Added label istio.io/rev=openshift-gateway to gateway manual-deployment","source":{"component":"gateway_labeler_controller"},"firstTimestamp":"2025-05-02T18:32:59Z","lastTimestamp":"2025-05-02T18:32:59Z","count":1,"type":"Normal","eventTime":null,"reportingComponent":"gateway_labeler_controller","reportingInstance":"","involvedObject":{"kind":"Gateway","namespace":"openshift-ingress","name":"manual-deployment","uid":"2bc7c5db-8a21-4404-90eb-dfd32daa5b68","apiVersion":"gateway.networking.k8s.io/v1","resourceVersion":"35646","labels":{"istio.io/rev":"openshift-gateway"}}}
2025-05-02T18:34:04.456589Z	info	ads	Push debounce stable[17] 1 for config Gateway/openshift-ingress/manual-deployment: 100.247504ms since last change, 100.247364ms since last push, full=true

I do see that the Istiod pod starts failing readiness probes at T18:31:55 and gets stopped at T18:34:10, with a new pod created at T18:34:11:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1227/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi/1918349792623202304/artifacts/e2e-aws-gatewayapi/gather-extra/artifacts/events.json' | jq -c '[.items.[]|select(.involvedObject.name//""|startswith("istiod-"))]|sort_by(.firstTimestamp//.metadata.creationTimestamp)|.[]|[.firstTimestamp//.metadata.creationTimestamp, .source.component, .message]'
["2025-05-02T18:31:46Z",null,"Successfully assigned openshift-ingress/istiod-openshift-gateway-d88579fb6-8svpf to ip-10-0-99-198.us-east-2.compute.internal"]
["2025-05-02T18:31:46Z","replicaset-controller","Created pod: istiod-openshift-gateway-d88579fb6-8svpf"]
["2025-05-02T18:31:46Z","controllermanager","No matching pods found"]
["2025-05-02T18:31:46Z","deployment-controller","Scaled up replica set istiod-openshift-gateway-d88579fb6 from 0 to 1"]
["2025-05-02T18:31:47Z","multus","Add eth0 [10.128.2.15/23] from ovn-kubernetes"]
["2025-05-02T18:31:47Z","kubelet","Pulling image \"registry.redhat.io/openshift-service-mesh/istio-pilot-rhel9@sha256:36557fa4817c3d0bac499dc65a4a9d6673500681641943d2f8cec5bfec4355be\""]
["2025-05-02T18:31:54Z","kubelet","Successfully pulled image \"registry.redhat.io/openshift-service-mesh/istio-pilot-rhel9@sha256:36557fa4817c3d0bac499dc65a4a9d6673500681641943d2f8cec5bfec4355be\" in 6.862s (6.862s including waiting). Image size: 197874371 bytes."]
["2025-05-02T18:31:54Z","kubelet","Created container: discovery"]
["2025-05-02T18:31:54Z","kubelet","Started container discovery"]
["2025-05-02T18:31:55Z","kubelet","Readiness probe error: HTTP probe failed with statuscode: 503\nbody: \n"]
["2025-05-02T18:31:55Z","kubelet","Readiness probe failed: HTTP probe failed with statuscode: 503"]
["2025-05-02T18:32:01Z","horizontal-pod-autoscaler","failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API"]
["2025-05-02T18:32:01Z","horizontal-pod-autoscaler","invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API"]
["2025-05-02T18:34:10Z","kubelet","Stopping container discovery"]
["2025-05-02T18:34:11Z",null,"Successfully assigned openshift-ingress/istiod-openshift-gateway-d88579fb6-s4rh7 to ip-10-0-99-198.us-east-2.compute.internal"]
["2025-05-02T18:34:11Z","replicaset-controller","Created pod: istiod-openshift-gateway-d88579fb6-s4rh7"]
["2025-05-02T18:34:11Z","deployment-controller","Scaled up replica set istiod-openshift-gateway-d88579fb6 from 0 to 1"]
["2025-05-02T18:34:11Z","controllermanager","No matching pods found"]
["2025-05-02T18:34:12Z","multus","Add eth0 [10.128.2.16/23] from ovn-kubernetes"]
["2025-05-02T18:34:12Z","kubelet","Container image \"registry.redhat.io/openshift-service-mesh/istio-pilot-rhel9@sha256:36557fa4817c3d0bac499dc65a4a9d6673500681641943d2f8cec5bfec4355be\" already present on machine"]
["2025-05-02T18:34:12Z","kubelet","Created container: discovery"]
["2025-05-02T18:34:12Z","kubelet","Started container discovery"]
["2025-05-02T18:34:17Z","horizontal-pod-autoscaler","failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API"]
["2025-05-02T18:34:17Z","horizontal-pod-autoscaler","invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API"]
%

Maybe testGatewayAPIManualDeployment just happened to catch Istiod at a bad time. I'll increase its log output and timeout.

Finally, during Istiod's shutdown, it logged some permissions errors:

2025-05-02T18:34:10.811160Z	error	klog	Failed to release lock: leases.coordination.k8s.io "istio-gateway-ca-openshift-gateway" is forbidden: User "system:serviceaccount:openshift-ingress:istiod-openshift-gateway" cannot update resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-ingress"
2025-05-02T18:34:10.811187Z	info	leader election lock lost: istio-gateway-ca-openshift-gateway
2025-05-02T18:34:10.811434Z	error	klog	Failed to release lock: leases.coordination.k8s.io "istio-gateway-deployment-openshift-gateway" is forbidden: User "system:serviceaccount:openshift-ingress:istiod-openshift-gateway" cannot update resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-ingress"
2025-05-02T18:34:10.811505Z	info	leader election lock lost: istio-gateway-deployment-openshift-gateway
2025-05-02T18:34:10.811604Z	error	klog	Failed to release lock: configmaps "istio-gateway-status-leader" is forbidden: User "system:serviceaccount:openshift-ingress:istiod-openshift-gateway" cannot update resource "configmaps" in API group "" in the namespace "openshift-ingress"
2025-05-02T18:34:10.811617Z	info	leader election lock lost: istio-gateway-status-leader

I'll check with the Service Mesh team about those errors.

openshift-ci-robot · 2025-05-02T20:33:25Z

@Miciah: This pull request references NE-2022 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

Details

In response to this:

Bump to OSSM 3.0.1 and Istio 1.24.4

Bump from OSSM v3.0.0 to v3.0.1 and from Istio v1.24.3 to v1.24.4.

Enable Gateway only CA Bundles and custom CA CM name

Avoid conflict with a user control plane by setting a custom CA Bundle CM name for the Gateway Control plane and enable Istio to only inject CA Bundle CMs in namespaces where Gateways exist to avoid polluting the whole cluster.

Two new environment variables are set for the Istio control plane deployment CR:

PILOT_CA_CERT_CONFIGMAP
PILOT_ENABLE_GATEWAY_API_CA_CERT_ONLY

This change is related to OSSM-9076.

Don't copy labels or annotations

Configure Istiod not to copy annotations or labels from gateways onto associated resources, such as the proxy deployment and load-balancer service for a gateway.

This copying behavior is Istio-specific, not part of the Gateway API spec, and could be used to inject unsupported configuration. For example, an end-user could set a service annotation on the gateway in order to configure a load-balancer. Setting annotations on the gateway to configure the load-balancer would not be portable to other Gateway API implementations and would complicate product support.

One new environment variable is set:

PILOT_ENABLE_GATEWAY_API_COPY_LABELS_ANNOTATIONS

This change is related to OSSM-8989.

Delete old controller-mode setting

Delete the obsolete PILOT_ENABLE_GATEWAY_CONTROLLER_MODE environment variable from the Istiod configuration. This environment variable is no longer recognized in OSSM 3, and the variable has been superseded by EnhancedResourceScoping.

assertDNSRecord: Increase timeout to 10m

Increase the timeout in assertDNSRecord for polling for the DNSRecord CR from 1 minute to 10 minutes.

The cloud provider can easily take over a minute to provision the load balancer, and the operator cannot create the DNSRecord CR before the load balancer has been provisioned and assigned a host name or address. Consequently, the polling loop could easily reach the 1-minute timeout just on account of the time that it takes to provision the load balancer.

testGatewayAPIManualDeployment: Increase timeout

Increase the timeout for polling the gateway, and dump the gateway if the test fails.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Miciah · 2025-05-02T23:41:22Z

e2e-aws-operator failed on testGatewayAPIManualDeployment again. The gateway has this status:

          - type: Programmed
            status: "True"
            observedgeneration: 1
            lasttransitiontime: "2025-05-02T22:28:10Z"
            reason: Programmed
            message: Resource programmed, assigned to service(s) router-internal-default.openshift-ingress.svc.cluster.local:80

This might reflect a real regression in OSSM 3.0.1.

testGatewayAPIObjects found the DNSRecord CR:

util_gatewayapi_test.go:1048: Found DNSRecord openshift-ingress/test-gateway-7d8d8f5f88-wildcard Published=True

However, it timed out waiting for the DNS record to resolve:

util_gatewayapi_test.go:903: GET test-hostname-5qzdj.gws.ci-op-1mssq3hn-43abb.origin-ci-int-aws.dev.rhcloud.com failed: GET test-hostname-5qzdj.gws.ci-op-1mssq3hn-43abb.origin-ci-int-aws.dev.rhcloud.com failed: Get "http://test-hostname-5qzdj.gws.ci-op-1mssq3hn-43abb.origin-ci-int-aws.dev.rhcloud.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers), retrying...
util_gatewayapi_test.go:918: Error connecting to test-hostname-5qzdj.gws.ci-op-1mssq3hn-43abb.origin-ci-int-aws.dev.rhcloud.com: context deadline exceeded

It is possible that DNS caching or propagation delay caused this failure.

/test e2e-aws-operator

Miciah · 2025-05-02T23:44:55Z

Actually, in the e2e-aws-operator job, it looks like the DNS name did eventually resolve (though it took a while), but then the connection timed out.

openshift-ci-robot · 2025-05-28T14:56:44Z

@Miciah: This pull request references NE-2022 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

Details

In response to this:

Bump to OSSM 3.0.1 and Istio 1.24.4

Bump from OSSM v3.0.0 to v3.0.1 and from Istio v1.24.3 to v1.24.4.

Enable Gateway only CA Bundles and custom CA CM name

Avoid conflict with a user control plane by setting a custom CA Bundle CM name for the Gateway Control plane and enable Istio to only inject CA Bundle CMs in namespaces where Gateways exist to avoid polluting the whole cluster.

Two new environment variables are set for the Istio control plane deployment CR:

PILOT_CA_CERT_CONFIGMAP
PILOT_ENABLE_GATEWAY_API_CA_CERT_ONLY

This change is related to OSSM-9076.

This change incorporates #1209.

Don't copy labels or annotations

Configure Istiod not to copy annotations or labels from gateways onto associated resources, such as the proxy deployment and load-balancer service for a gateway.

This copying behavior is Istio-specific, not part of the Gateway API spec, and could be used to inject unsupported configuration. For example, an end-user could set a service annotation on the gateway in order to configure a load-balancer. Setting annotations on the gateway to configure the load-balancer would not be portable to other Gateway API implementations and would complicate product support.

One new environment variable is set:

PILOT_ENABLE_GATEWAY_API_COPY_LABELS_ANNOTATIONS

This change is related to OSSM-8989.

Delete old controller-mode setting

Delete the obsolete PILOT_ENABLE_GATEWAY_CONTROLLER_MODE environment variable from the Istiod configuration. This environment variable is no longer recognized in OSSM 3, and the variable has been superseded by EnhancedResourceScoping.

assertDNSRecord: Increase timeout to 10m

Increase the timeout in assertDNSRecord for polling for the DNSRecord CR from 1 minute to 10 minutes.

The cloud provider can easily take over a minute to provision the load balancer, and the operator cannot create the DNSRecord CR before the load balancer has been provisioned and assigned a host name or address. Consequently, the polling loop could easily reach the 1-minute timeout just on account of the time that it takes to provision the load balancer.

testGatewayAPIManualDeployment: Increase timeout

Increase the timeout for polling the gateway, and dump the gateway if the test fails.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Miciah · 2025-06-30T16:46:39Z

/retest

Miciah · 2025-07-01T21:20:41Z

e2e-aws-operator failed in testGatewayAPIObjects and testGatewayAPIManualDeployment.

testGatewayAPIManualDeployment failed for the same reason as before: Istio allowed the the gateway to specify an existing service (i.e. manual deployment) when Istio should have created a new LB service for the gateway (i.e. automated deployment). The same issue occurred when bumping OSSM to 3.0.1 and Istio to 1.24.4 without specifying any of the new options in OSSM 3.0.1/Istio 1.24.4 (see #1228), but the issue did not occur when bumping OSSM to 3.0.1 while leaving Istio at 1.24.3 (see #1238). This is strong evidence of a regression in Istio 1.24.4.

testGatewayAPIObjects failed because the test gateway pod failed to start; the kubelet emitted an event with the following message: MountVolume.SetUp failed for volume "istiod-ca-cert" : configmap "istio-ca-root-cert" not found. The same issue does not occur when bumping OSSM to 3.0.1 and Istio to 1.24.4 without specifying any of the new options in OSSM 3.0.1/Istio 1.24.4 (see #1228 again). This is strong evidence of an issue with the new PILOT_CA_CERT_CONFIGMAP and PILOT_ENABLE_GATEWAY_API_CA_CERT_ONLY options.

(testGatewayAPIObjects still fails with #1228. The e2e-aws-operator job artifacts didn't have the events, so I looked at an e2e-gcp-operator job run; there, the test-gateway pod did start, but it did not respond to client HTTP requests or the kubelet's readiness probes, so while the test got further, it still failed, on some other issue.)

Miciah · 2025-07-02T18:04:41Z

https://github.com/openshift/cluster-ingress-operator/compare/0478030d2564a309f2514e80eb97c0dd65804fdb..d3257317c7c0c8e14c8c0704b2a40bd81f95b527 sets spec.values.global.trustBundleName. I hope that this change will resolve the MountVolume.SetUp failed for volume "istiod-ca-cert" : configmap "istio-ca-root-cert" not found errors.

Miciah · 2025-07-02T21:06:01Z

We are now blocked on #1236 for the Go 1.24 bump, which the sail-operator bump requires.

Miciah · 2025-07-03T06:39:10Z

#1236 has merged.

/retest

Miciah · 2025-07-08T21:03:09Z

https://github.com/openshift/cluster-ingress-operator/compare/d3257317c7c0c8e14c8c0704b2a40bd81f95b527..ece76fddf9ce97b985fddc36aafcc73197c28f17 removes PILOT_CA_CERT_CONFIGMAP on @dgn's suggestion (setting trustBundleName should be sufficient).

e2e-aws-operator failed because testGatewayAPIObjects and testGatewayAPIManualDeployment failed. However, this time the Istiod pod started, but testGatewayAPIObjects failed because it got HTTP errors when polling the route.

e2e-gcp-operator failed because not all the nodes came up, and e2e-azure-operator failed only on testGatewayAPIManualDeployment; testGatewayAPIObjects succeeded.

I did some manual testing with an AWS cluster and could not reproduce the HTTP errors when running testGatewayAPIObjects. The test still failed, but due to DNS lookup failures related to DNS caching issues. When I curled the route that the test created, I observed HTTP 200 responses.

/test e2e-aws-operator
/test e2e-azure-operator
/test e2e-gcp-operator

Also, explicitly set ENABLE_GATEWAY_API_MANUAL_DEPLOYMENT to "false" on the Istio CR. For Istio 1.24.3, OSSM has a vendor override that sets this option[1]. However, for Istio 1.24.4, the option must be explicitly set. 1. https://github.com/openshift-service-mesh/sail-operator/blob/3bf27ee3c4fb4494ffe6028c7f72034c5a7a1e60/pkg/istiovalues/vendor_defaults.yaml#L11-L14 This commit resolves NE-2022. https://issues.redhat.com/browse/NE-2022 * cmd/ingress-operator/start.go (defaultGatewayAPIOperatorVersion): * manifests/02-deployment-ibm-cloud-managed.yaml (GATEWAY_API_OPERATOR_VERSION): * manifests/02-deployment.yaml (GATEWAY_API_OPERATOR_VERSION): Bump from OSSM v3.0.0 to v3.0.1. * pkg/operator/controller/gatewayclass/istio.go (desiredIstio): Bump from Istio v1.24.3 to v1.24.4. Set ENABLE_GATEWAY_API_MANUAL_DEPLOYMENT to "false".

To avoid conflicts with user-managed control-planes, set a custom name for the CA bundle configmaps for the Istio control-plane that the operator manages. Also, configure Istio to inject the configmaps only into namespaces where gateways exist in order to avoid polluting the whole cluster. Set one new environment variable in the Istio CR: PILOT_ENABLE_GATEWAY_API_CA_CERT_ONLY Set the Istio CR's trustBundleName global value to match the custom configmap name. This change requires bumping the sail-operator API: go get github.com/istio-ecosystem/sail-operator@30be83268d6b6bfaf6fb0562a6c3e505a17422ea This commit is related to OSSM-9076. * go.mod: Bump github.com/istio-ecosystem/sail-operator. * go.sum: * vendor/*/: Regenerate. * pkg/operator/controller/gatewayclass/istio.go (desiredIstio): Set the new environment variable and trustBundleName field * pkg/operator/controller/names.go (OpenShiftGatewayCARootCertName): New const. Modified-by: Miciah Masters <miciah.masters@gmail.com>

Configure Istiod not to copy annotations or labels from gateways onto associated resources, such as the proxy deployment and load-balancer service for a gateway. This copying behavior is Istio-specific, not part of the Gateway API spec, and could be used to inject unsupported configuration. For example, an end-user could set a service annotation on the gateway in order to configure a load-balancer. Setting annotations on the gateway to configure the load-balancer would not be portable to other Gateway API implementations and would complicate product support. This commit is related to OSSM-8989. https://issues.redhat.com/browse/OSSM-8989 * pkg/operator/controller/gatewayclass/istio.go (desiredIstio): Set the "PILOT_ENABLE_GATEWAY_API_COPY_LABELS_ANNOTATIONS" to "false".

Delete the obsolete PILOT_ENABLE_GATEWAY_CONTROLLER_MODE environment variable from the Istiod configuration. This environment variable is no longer recognized in OSSM 3, and the variable has been superseded by EnhancedResourceScoping. * pkg/operator/controller/gatewayclass/istio.go (desiredIstio): Delete PILOT_ENABLE_GATEWAY_CONTROLLER_MODE.

Increase the timeout in assertDNSRecord for polling for the DNSRecord CR from 1 minute to 10 minutes. The cloud provider can easily take over a minute to provision the load balancer, and the operator cannot create the DNSRecord CR before the load balancer has been provisioned and assigned a host name or address. Consequently, the polling loop could easily reach the 1-minute timeout just on account of the time that it takes to provision the load balancer. * test/e2e/util_gatewayapi_test.go (assertDNSRecord): Increase timeout for the DNSRecord CR polling loop from 1m to 10m.

Increase the timeout for polling the gateway, and dump the gateway if the test fails. * test/e2e/gateway_api_test.go (testGatewayAPIManualDeployment): Increase the timeout for polling the gateway from 1m to 5m. Dump the gateway if the test fails.

Miciah · 2025-07-09T12:50:33Z

e2e-azure-operator and e2e-gcp-operator failed only on testGatewayAPIManualDeployment. e2e-aws-operator failed on both testGatewayAPIManualDeployment as well as testGatewayAPIDNS/gatewayListenersWithOverlappingHostname; I have filed OCPBUGS-59139 for this new flake. This time, testGatewayAPIObjects did not fail in any of these jobs; 🎉!

Miciah · 2025-07-09T12:51:42Z

https://github.com/openshift/cluster-ingress-operator/compare/ece76fddf9ce97b985fddc36aafcc73197c28f17..f1e445da0c864fc0c15ed3f90b7f3f2f6483a014 sets ENABLE_GATEWAY_API_MANUAL_DEPLOYMENT. OSSM has an override for this setting for Istio 1.24.3 but is missing the override for Istio 1.24.4. Setting ENABLE_GATEWAY_API_MANUAL_DEPLOYMENT explicitly should resolve the testGatewayAPIManualDeployment test failures.

Miciah · 2025-07-10T18:57:18Z

/skip
to clear out the stale e2e-aws-gatewayapi job.

Miciah · 2025-07-11T13:59:24Z

e2e-aws-operator failed because the "e2e-aws-operator-ipi-deprovision-deprovision" step failed. The tests all passed.

Miciah · 2025-07-11T14:13:11Z

e2e-aws-operator-techpreview failed because the TestConnectTimeout test failed. This test failure also showed up on #1242, so I have filed OCPBUGS-59249 to track the issue.

Miciah · 2025-07-11T14:13:59Z

e2e-aws-ovn failed on the "e2e-aws-ovn-ipi-deprovision-deprovision" step, but otherwise tests were passing.

Miciah · 2025-07-11T14:14:35Z

e2e-aws-ovn-serial also failed on the "e2e-aws-ovn-serial-ipi-deprovision-deprovision" step.

Miciah · 2025-07-11T15:37:05Z

e2e-aws-ovn-single-node, e2e-aws-ovn-techpreview, and e2e-aws-ovn-upgrade all failed on the "e2e-aws-ovn-serial-ipi-deprovision-deprovision" step.

e2e-aws-ovn-techpreview also failed on [sig-api-machinery] FieldValidation should detect unknown metadata fields in both the root and embedded object of a CR:

  STEP: Creating a kubernetes client @ 07/10/25 19:52:37.714
  STEP: Building a namespace api object, basename field-validation @ 07/10/25 19:52:37.716
I0710 19:52:37.777934 20405 namespace.go:59] About to run a Kube e2e test, ensuring namespace/e2e-field-validation-7456 is privileged
  STEP: Waiting for a default service account to be provisioned in namespace @ 07/10/25 19:52:38.133
  STEP: Waiting for kube-root-ca.crt to be provisioned in namespace @ 07/10/25 19:52:38.136
I0710 19:53:10.759154 20405 field_validation.go:575] Unexpected error: deleting CustomResourceDefinition: 
    <wait.errInterrupted>: 
    timed out waiting for the condition
    {
        cause: <*errors.errorString | 0xc000464ba0>{
            s: "timed out waiting for the condition",
        },
    }
  [FAILED] in [It] - k8s.io/kubernetes/test/e2e/apimachinery/field_validation.go:575 @ 07/10/25 19:53:10.759
  STEP: dump namespace information after failure @ 07/10/25 19:53:10.774
  STEP: Collecting events from namespace "e2e-field-validation-7456". @ 07/10/25 19:53:10.774
  STEP: Found 0 events. @ 07/10/25 19:53:10.784
I0710 19:53:10.786828 20405 resource.go:168] POD  NODE  PHASE  GRACE  CONDITIONS
I0710 19:53:10.786860 20405 resource.go:178] 
I0710 19:53:10.796368 20405 dump.go:81] skipping dumping cluster info - cluster too large
  STEP: Destroying namespace "e2e-field-validation-7456" for this suite. @ 07/10/25 19:53:10.796

fail [k8s.io/kubernetes/test/e2e/apimachinery/field_validation.go:575]: deleting CustomResourceDefinition: timed out waiting for the condition

Using search.ci, I found a few similar "deleting CustomResourceDefinition: timed out waiting for the condition" errors; I filed OCPBUGS-59257 to track the issue.
/test e2e-aws-ovn-techpreview

Miciah · 2025-07-12T20:40:55Z

/test e2e-aws-operator

Miciah · 2025-07-15T15:39:11Z

e2e-aws-operator failed because no worker nodes came up.

/test e2e-aws-operator

Miciah · 2025-07-16T13:07:33Z

e2e-aws-operator failed because the job timed out:

�[36mINFO�[0m[2025-07-15T18:42:45Z] Running step e2e-aws-operator-test.          
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2025-07-15T19:46:04Z"}
�[36mINFO�[0m[2025-07-15T19:46:04Z] Received signal.                              �[36msignal�[0m=interrupt
�[36mINFO�[0m[2025-07-15T19:46:04Z] error: Process interrupted with signal interrupt, cancelling execution... 
�[36mINFO�[0m[2025-07-15T19:46:04Z] cleanup: Deleting release pod release-latest 
�[36mINFO�[0m[2025-07-15T19:46:04Z] Step e2e-aws-operator-test failed after 1h3m18s.

The error message is misleading. From the timestamps, it is clear that the tests did not run for even close to 4 hours. First, building the images took ~40 minutes:

�[36mINFO�[0m[2025-07-15T15:46:09Z] Created build "src-amd64"                    
�[36mINFO�[0m[2025-07-15T15:49:21Z] Imported tags on imagestream (after taking snapshot) ci-op-wldt11m3/stable-initial 
�[36mINFO�[0m[2025-07-15T15:49:21Z] Imported tags on imagestream (after taking snapshot) ci-op-wldt11m3/stable 
�[36mINFO�[0m[2025-07-15T16:06:16Z] Build src-amd64 succeeded after 20m7s        
�[36mINFO�[0m[2025-07-15T16:06:16Z] Retrieving digests of member images          
�[36mINFO�[0m[2025-07-15T16:06:17Z] Image ci-op-wldt11m3/pipeline:src created     �[36mdigest�[0m=sha256:aaacbe11fcc6d7652b72003670c6510b278a01f53613e9f6dbc81d7fa100862a �[36mfor-build�[0m=src
�[36mINFO�[0m[2025-07-15T16:06:17Z] Building cluster-ingress-operator            
�[36mINFO�[0m[2025-07-15T16:06:17Z] Created build "cluster-ingress-operator-amd64" 
�[36mINFO�[0m[2025-07-15T16:25:25Z] Build cluster-ingress-operator-amd64 succeeded after 19m8s

Then, getting a lease for the infrastructure took ~85 minutes:

�[36mINFO�[0m[2025-07-15T16:26:44Z] Acquiring leases for test e2e-aws-operator: [aws-quota-slice] 
�[36mINFO�[0m[2025-07-15T17:51:27Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-1--aws-quota-slice-30]

And then, installing the cluster took ~49 minutes:

�[36mINFO�[0m[2025-07-15T17:52:57Z] Running step e2e-aws-operator-ipi-install-install. 
�[36mINFO�[0m[2025-07-15T18:42:19Z] Step e2e-aws-operator-ipi-install-install succeeded after 49m22s.

So the entire CI job appears to be constrained to 4 hours, the setup took almost 3 hours, and the tests themselves had just over 1 hour to run before the 4 hours elapsed and the job was terminated.

Getting the lease should not have taken so long. I hope that the issue the caused the delay has been resolved.
/test e2e-aws-operator

Miciah · 2025-07-16T15:36:48Z

e2e-aws-operator looks good. Let's try the other AWS jobs now.

/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-aws-operator-techpreview
/test e2e-aws-ovn-single-node
/test e2e-aws-ovn-serial
/test e2e-aws-ovn-techpreview

lihongan · 2025-07-17T03:02:58Z

/test e2e-aws-gatewayapi-conformance

lihongan · 2025-07-17T03:58:53Z

No issue found in pre-merge test

$ oc -n openshift-operators get csv
NAME                          DISPLAY                            VERSION   REPLACES                      PHASE
servicemeshoperator3.v3.0.3   Red Hat OpenShift Service Mesh 3   3.0.3     servicemeshoperator3.v3.0.2   Succeeded

$ oc get istio
NAME                REVISIONS   READY   IN USE   ACTIVE REVISION     STATUS    VERSION   AGE
openshift-gateway   1           1       1        openshift-gateway   Healthy   v1.24.6   61m

$ oc -n openshift-ingress get gateway
NAME    CLASS               ADDRESS                                                                  PROGRAMMED   AGE
gwapi   openshift-default   a0e2b5f38e6ae44cc8227004bdcf50ed-210799710.us-west-1.elb.amazonaws.com   True         38m

$ oc get httproute
NAME      HOSTNAMES                                                   AGE
myroute   ["test.gwapi.ci-ln-9igmm2k-76ef8.aws-2.ci.openshift.org"]   37m

$ curl http://test.gwapi.ci-ln-9igmm2k-76ef8.aws-2.ci.openshift.org 
Hello-OpenShift web-server-6d6cfb97fc-c6bgb http-8080

Miciah · 2025-07-21T18:17:23Z

https://github.com/openshift/cluster-ingress-operator/compare/186e633124c8ad819ef20c429df3a93be7d4987e..f1e445da0c864fc0c15ed3f90b7f3f2f6483a014 drops the OSSM 3.0.3 bump so that we can backport a single-version bump, which should enable automatic updates.

alebedev87 · 2025-07-21T20:25:52Z

/assign

alebedev87 · 2025-07-22T20:31:23Z

/lgtm
/approve

openshift-ci · 2025-07-22T20:34:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alebedev87

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [alebedev87]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-07-22T21:25:39Z

/retest-required

Remaining retests: 0 against base HEAD cbc0b21 and 2 for PR HEAD f1e445d in total

lihongan · 2025-07-23T03:26:24Z

Redo pre-merge test with the PR and the CSV and Istio look good, but I can see two installplans and both are Manual approval, is that expected?
Does "enable automatic update" mean the "Approval" should be set as "Automatic" ?

$ oc -n openshift-operators get csv
NAME                          DISPLAY                            VERSION   REPLACES                      PHASE
servicemeshoperator3.v3.0.1   Red Hat OpenShift Service Mesh 3   3.0.1     servicemeshoperator3.v3.0.0   Succeeded

$ oc get istio
NAME                REVISIONS   READY   IN USE   ACTIVE REVISION     STATUS    VERSION   AGE
openshift-gateway   1           1       0        openshift-gateway   Healthy   v1.24.4   2m35s

$ oc -n openshift-operators get installplan
NAME            CSV                           APPROVAL   APPROVED
install-2knxb   servicemeshoperator3.v3.0.2   Manual     false
install-fdnff   servicemeshoperator3.v3.0.1   Manual     true

openshift-ci-robot · 2025-07-23T06:10:16Z

/retest-required

Remaining retests: 0 against base HEAD cbc0b21 and 2 for PR HEAD f1e445d in total

alebedev87 · 2025-07-23T08:15:24Z

@lihongan:

but I can see two installplans and both are Manual approval, is that expected?

Yes, OLM creates "next in upgrade graph" installplan after the current installplan was applied. This allows subscriptions with Manual approvals to keep upgrading.

Does "enable automatic update" mean the "Approval" should be set as "Automatic" ?

No. It means that if the OSSM operator bump is next in the upgrade graph and no specific upgrade logic is needed, the current install plan approval logic (of cluster ingress operator) would work. Like in this PR, we have v3.0.0 in master and we bump v3.0.1 which is next in the upgrade graph. Note though that if this PR would bump to v3.0.3, the OSSM operator would not be upgraded by the current installplan approval logic of cluster ingress operator. https://issues.redhat.com/browse/NE-2097 is the Jira ticket for implementing an upgrade logic which would allow version jumps (from current to any).

openshift-ci · 2025-07-23T08:36:18Z

@Miciah: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-gatewayapi	`0478030`	link	false	`/test e2e-aws-gatewayapi`
ci/prow/okd-scos-e2e-aws-ovn	`f1e445d`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-aws-ovn-techpreview	`f1e445d`	link	false	`/test e2e-aws-ovn-techpreview`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2025-07-24T15:41:18Z

/retest-required

Remaining retests: 0 against base HEAD cbc0b21 and 2 for PR HEAD f1e445d in total

openshift-bot · 2025-07-24T22:07:47Z

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-ingress-operator
This PR has been included in build ose-cluster-ingress-operator-container-v4.20.0-202507242147.p0.g4f1ed7f.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 2, 2025

openshift-ci bot requested review from frobware and gcs278 May 2, 2025 17:01

Miciah mentioned this pull request May 6, 2025

Test bump to OSSM 3.0.1 and Istio 1.24.4 #1228

Closed

Miciah mentioned this pull request Jun 30, 2025

Test bump to OSSM 3.0.1 #1238

Closed

Miciah force-pushed the NE-2022-bump-to-OSSM-3.0.1 branch from 0478030 to d325731 Compare July 2, 2025 18:03

Miciah force-pushed the NE-2022-bump-to-OSSM-3.0.1 branch from d325731 to ece76fd Compare July 7, 2025 19:25

Miciah and others added 6 commits July 9, 2025 08:32

Miciah force-pushed the NE-2022-bump-to-OSSM-3.0.1 branch from ece76fd to f1e445d Compare July 9, 2025 12:50

Miciah mentioned this pull request Jul 16, 2025

Bump to OSSM 3.0.3 and Istio 1.24.4 #1239

Closed

Miciah force-pushed the NE-2022-bump-to-OSSM-3.0.1 branch from 186e633 to f1e445d Compare July 21, 2025 18:16

openshift-ci bot assigned alebedev87 Jul 21, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 22, 2025

openshift-merge-bot bot merged commit 4f1ed7f into openshift:master Jul 24, 2025
20 of 22 checks passed

This was referenced Jul 25, 2025

OCPBUGS-59839: desiredIstio: Delete trustBundleName #1243

Merged

[release-4.19] NE-2103: Bump to OSSM 3.0.1 and Istio 1.24.4 #1244

Merged

NE-2022: Bump to OSSM 3.0.1 and Istio 1.24.4 #1227

NE-2022: Bump to OSSM 3.0.1 and Istio 1.24.4 #1227

Uh oh!

Conversation

Miciah commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bump to OSSM 3.0.1 and Istio 1.24.4

Enable Gateway only CA Bundles and custom CA CM name

Don't copy labels or annotations

Delete old controller-mode setting

assertDNSRecord: Increase timeout to 10m

testGatewayAPIManualDeployment: Increase timeout

Uh oh!

openshift-ci-robot commented May 2, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bump to OSSM 3.0.1 and Istio 1.24.4

Enable Gateway only CA Bundles and custom CA CM name

Don't copy labels or annotations

Delete old controller-mode setting

Uh oh!

Miciah commented May 2, 2025

Uh oh!

Miciah commented May 2, 2025

Uh oh!

openshift-ci-robot commented May 2, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bump to OSSM 3.0.1 and Istio 1.24.4

Enable Gateway only CA Bundles and custom CA CM name

Don't copy labels or annotations

Delete old controller-mode setting

assertDNSRecord: Increase timeout to 10m

testGatewayAPIManualDeployment: Increase timeout

Uh oh!

Miciah commented May 2, 2025

Uh oh!

Miciah commented May 2, 2025

Uh oh!

openshift-ci-robot commented May 28, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bump to OSSM 3.0.1 and Istio 1.24.4

Enable Gateway only CA Bundles and custom CA CM name

Don't copy labels or annotations

Delete old controller-mode setting

assertDNSRecord: Increase timeout to 10m

testGatewayAPIManualDeployment: Increase timeout

Uh oh!

Miciah commented Jun 30, 2025

Uh oh!

Miciah commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Miciah commented Jul 2, 2025

Uh oh!

Miciah commented Jul 2, 2025

Uh oh!

Miciah commented Jul 3, 2025

Uh oh!

Miciah commented Jul 8, 2025

Uh oh!

Miciah commented Jul 9, 2025

Uh oh!

Miciah commented Jul 9, 2025

Uh oh!

Miciah commented Jul 10, 2025

Uh oh!

Miciah commented Jul 11, 2025

Uh oh!

Miciah commented Jul 11, 2025

Uh oh!

Miciah commented Jul 11, 2025

Uh oh!

Miciah commented Jul 11, 2025

Uh oh!

Miciah commented Jul 11, 2025

Uh oh!

Miciah commented Jul 12, 2025

Uh oh!

Miciah commented Jul 15, 2025

Uh oh!

Miciah commented Jul 16, 2025

Miciah commented May 2, 2025 •

edited

Loading

`assertDNSRecord`: Increase timeout to 10m

`testGatewayAPIManualDeployment`: Increase timeout

openshift-ci-robot commented May 2, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented May 2, 2025 •

edited by openshift-ci bot

Loading

`assertDNSRecord`: Increase timeout to 10m

`testGatewayAPIManualDeployment`: Increase timeout

openshift-ci-robot commented May 28, 2025 •

edited by openshift-ci bot

Loading

`assertDNSRecord`: Increase timeout to 10m

`testGatewayAPIManualDeployment`: Increase timeout

Miciah commented Jul 1, 2025 •

edited

Loading

openshift-ci bot commented Jul 23, 2025 •

edited

Loading