Skip to content

Conversation

@Miciah
Copy link
Contributor

@Miciah Miciah commented May 2, 2025

Bump to OSSM 3.0.1 and Istio 1.24.4

Bump from OSSM v3.0.0 to v3.0.1 and from Istio v1.24.3 to v1.24.4.

Also, explicitly set ENABLE_GATEWAY_API_MANUAL_DEPLOYMENT to "false" on the Istio CR. For Istio 1.24.3, OSSM has a vendor override that sets this option. However, for Istio 1.24.4, the option must be explicitly set.

Enable Gateway only CA Bundles and custom CA CM name

To avoid conflicts with user-managed control-planes, set a custom name for the CA bundle configmaps for the Istio control-plane that the operator manages. Also, configure Istio to inject the configmaps only into namespaces where gateways exist in order to avoid polluting the whole cluster.

Set one new environment variable in the Istio CR:

PILOT_ENABLE_GATEWAY_API_CA_CERT_ONLY

Set the Istio CR's trustBundleName global value to match the custom configmap name. This change requires bumping the sail-operator API:

go get github.com/istio-ecosystem/sail-operator@30be83268d6b6bfaf6fb0562a6c3e505a17422ea

This change is related to OSSM-9076.

This change incorporates #1209.

Don't copy labels or annotations

Configure Istiod not to copy annotations or labels from gateways onto associated resources, such as the proxy deployment and load-balancer service for a gateway.

This copying behavior is Istio-specific, not part of the Gateway API spec, and could be used to inject unsupported configuration. For example, an end-user could set a service annotation on the gateway in order to configure a load-balancer. Setting annotations on the gateway to configure the load-balancer would not be portable to other Gateway API implementations and would complicate product support.

One new environment variable is set:

PILOT_ENABLE_GATEWAY_API_COPY_LABELS_ANNOTATIONS

This change is related to OSSM-8989.

Delete old controller-mode setting

Delete the obsolete PILOT_ENABLE_GATEWAY_CONTROLLER_MODE environment variable from the Istiod configuration. This environment variable is no longer recognized in OSSM 3, and the variable has been superseded by EnhancedResourceScoping.

assertDNSRecord: Increase timeout to 10m

Increase the timeout in assertDNSRecord for polling for the DNSRecord CR from 1 minute to 10 minutes.

The cloud provider can easily take over a minute to provision the load balancer, and the operator cannot create the DNSRecord CR before the load balancer has been provisioned and assigned a host name or address. Consequently, the polling loop could easily reach the 1-minute timeout just on account of the time that it takes to provision the load balancer.

testGatewayAPIManualDeployment: Increase timeout

Increase the timeout for polling the gateway, and dump the gateway if the test fails.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 2, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented May 2, 2025

@Miciah: This pull request references NE-2022 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

Details

In response to this:

Bump to OSSM 3.0.1 and Istio 1.24.4

Bump from OSSM v3.0.0 to v3.0.1 and from Istio v1.24.3 to v1.24.4.

Enable Gateway only CA Bundles and custom CA CM name

Avoid conflict with a user control plane by setting a custom CA Bundle CM name for the Gateway Control plane and enable Istio to only inject CA Bundle CMs in namespaces where Gateways exist to avoid polluting the whole cluster.

Two new environment variables are set for the Istio control plane deployment CR:

PILOT_CA_CERT_CONFIGMAP
PILOT_ENABLE_GATEWAY_API_CA_CERT_ONLY

This change is related to OSSM-9076.

Don't copy labels or annotations

Configure Istiod not to copy annotations or labels from gateways onto associated resources, such as the proxy deployment and load-balancer service for a gateway.

This copying behavior is Istio-specific, not part of the Gateway API spec, and could be used to inject unsupported configuration. For example, an end-user could set a service annotation on the gateway in order to configure a load-balancer. Setting annotations on the gateway to configure the load-balancer would not be portable to other Gateway API implementations and would complicate product support.

One new environment variable is set:

PILOT_ENABLE_GATEWAY_API_COPY_LABELS_ANNOTATIONS

This change is related to OSSM-8989.

Delete old controller-mode setting

Delete the obsolete PILOT_ENABLE_GATEWAY_CONTROLLER_MODE environment variable from the Istiod configuration. This environment variable is no longer recognized in OSSM 3, and the variable has been superseded by EnhancedResourceScoping.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from frobware and gcs278 May 2, 2025 17:01
@Miciah
Copy link
Contributor Author

Miciah commented May 2, 2025

e2e-hypershift failed because TestCreateClusterV2/Main/break-glass-credentials/independent_signers failed. The failure appears to be the same as the one tracked in OCPBUGS-44582.

@Miciah
Copy link
Contributor Author

Miciah commented May 2, 2025

e2e-aws-gatewayapi failed because TestGatewayAPI/testGatewayAPIObjects and TestGatewayAPI/testGatewayAPIManualDeployment failed.

Once the httproute is accepted, the testGatewayAPIObjects test only polls for 1 minute, which is not necessarily enough time to provision an ELB. In this case, it took over 1 minute, from T18:31:56 to T18:33:16:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1227/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi/1918349792623202304/artifacts/e2e-aws-gatewayapi/gather-extra/artifacts/events.json' | jq -c '.items.[]|select(.involvedObject.name == "test-gateway-openshift-default" and .source.component == "service-controller")|[.firstTimestamp, .message]'
["2025-05-02T18:31:56Z","Ensuring load balancer"]
["2025-05-02T18:31:57Z","Error syncing load balancer: failed to ensure load balancer: TooManyLoadBalancers: Exceeded quota of account 130757279292\n\tstatus code: 400, request id: 22c489e5-6ce3-42d2-b054-873e041491dd"]
["2025-05-02T18:32:03Z","Error syncing load balancer: failed to ensure load balancer: TooManyLoadBalancers: Exceeded quota of account 130757279292\n\tstatus code: 400, request id: 9c395cac-7a89-4c05-9a48-981dddff62c6"]
["2025-05-02T18:32:13Z","Error syncing load balancer: failed to ensure load balancer: TooManyLoadBalancers: Exceeded quota of account 130757279292\n\tstatus code: 400, request id: ce39636a-cc44-4964-9e9c-230eb61f0274"]
["2025-05-02T18:32:34Z","Error syncing load balancer: failed to ensure load balancer: TooManyLoadBalancers: Exceeded quota of account 130757279292\n\tstatus code: 400, request id: f8817ecb-fc85-4ee7-85ee-9f1fb6ac037e"]
["2025-05-02T18:33:16Z","Ensured load balancer"]
["2025-05-02T18:43:05Z","Deleting load balancer"]
["2025-05-02T18:43:26Z","Deleted load balancer"]
% 

I will push a commit to increase the timeout to 10 minutes.

It is less clear why testGatewayAPIManualDeployment failed. Istiod logs that it observe the gateway at T18:32:59:

2025-05-02T18:32:59.423291Z	info	ads	Push debounce stable[7] 2 for config Gateway/openshift-ingress/manual-deployment: 101.0015ms since last change, 110.068219ms since last push, full=true
2025-05-02T18:32:59.540520Z	info	ads	Push debounce stable[8] 1 for config Gateway/openshift-ingress/manual-deployment: 100.231743ms since last change, 100.231574ms since last push, full=true
{"metadata":{"name":"manual-deployment.183bc973b47c7b74","namespace":"openshift-ingress","uid":"9eaf145b-968c-40d6-8c0c-896cbdfafb36","resourceVersion":"35647","creationTimestamp":"2025-05-02T18:32:59Z"},"reason":"AddedLabel","message":"Added label istio.io/rev=openshift-gateway to gateway manual-deployment","source":{"component":"gateway_labeler_controller"},"firstTimestamp":"2025-05-02T18:32:59Z","lastTimestamp":"2025-05-02T18:32:59Z","count":1,"type":"Normal","eventTime":null,"reportingComponent":"gateway_labeler_controller","reportingInstance":"","involvedObject":{"kind":"Gateway","namespace":"openshift-ingress","name":"manual-deployment","uid":"2bc7c5db-8a21-4404-90eb-dfd32daa5b68","apiVersion":"gateway.networking.k8s.io/v1","resourceVersion":"35646","labels":{"istio.io/rev":"openshift-gateway"}}}
2025-05-02T18:34:04.456589Z	info	ads	Push debounce stable[17] 1 for config Gateway/openshift-ingress/manual-deployment: 100.247504ms since last change, 100.247364ms since last push, full=true

I do see that the Istiod pod starts failing readiness probes at T18:31:55 and gets stopped at T18:34:10, with a new pod created at T18:34:11:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1227/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi/1918349792623202304/artifacts/e2e-aws-gatewayapi/gather-extra/artifacts/events.json' | jq -c '[.items.[]|select(.involvedObject.name//""|startswith("istiod-"))]|sort_by(.firstTimestamp//.metadata.creationTimestamp)|.[]|[.firstTimestamp//.metadata.creationTimestamp, .source.component, .message]'
["2025-05-02T18:31:46Z",null,"Successfully assigned openshift-ingress/istiod-openshift-gateway-d88579fb6-8svpf to ip-10-0-99-198.us-east-2.compute.internal"]
["2025-05-02T18:31:46Z","replicaset-controller","Created pod: istiod-openshift-gateway-d88579fb6-8svpf"]
["2025-05-02T18:31:46Z","controllermanager","No matching pods found"]
["2025-05-02T18:31:46Z","deployment-controller","Scaled up replica set istiod-openshift-gateway-d88579fb6 from 0 to 1"]
["2025-05-02T18:31:47Z","multus","Add eth0 [10.128.2.15/23] from ovn-kubernetes"]
["2025-05-02T18:31:47Z","kubelet","Pulling image \"registry.redhat.io/openshift-service-mesh/istio-pilot-rhel9@sha256:36557fa4817c3d0bac499dc65a4a9d6673500681641943d2f8cec5bfec4355be\""]
["2025-05-02T18:31:54Z","kubelet","Successfully pulled image \"registry.redhat.io/openshift-service-mesh/istio-pilot-rhel9@sha256:36557fa4817c3d0bac499dc65a4a9d6673500681641943d2f8cec5bfec4355be\" in 6.862s (6.862s including waiting). Image size: 197874371 bytes."]
["2025-05-02T18:31:54Z","kubelet","Created container: discovery"]
["2025-05-02T18:31:54Z","kubelet","Started container discovery"]
["2025-05-02T18:31:55Z","kubelet","Readiness probe error: HTTP probe failed with statuscode: 503\nbody: \n"]
["2025-05-02T18:31:55Z","kubelet","Readiness probe failed: HTTP probe failed with statuscode: 503"]
["2025-05-02T18:32:01Z","horizontal-pod-autoscaler","failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API"]
["2025-05-02T18:32:01Z","horizontal-pod-autoscaler","invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API"]
["2025-05-02T18:34:10Z","kubelet","Stopping container discovery"]
["2025-05-02T18:34:11Z",null,"Successfully assigned openshift-ingress/istiod-openshift-gateway-d88579fb6-s4rh7 to ip-10-0-99-198.us-east-2.compute.internal"]
["2025-05-02T18:34:11Z","replicaset-controller","Created pod: istiod-openshift-gateway-d88579fb6-s4rh7"]
["2025-05-02T18:34:11Z","deployment-controller","Scaled up replica set istiod-openshift-gateway-d88579fb6 from 0 to 1"]
["2025-05-02T18:34:11Z","controllermanager","No matching pods found"]
["2025-05-02T18:34:12Z","multus","Add eth0 [10.128.2.16/23] from ovn-kubernetes"]
["2025-05-02T18:34:12Z","kubelet","Container image \"registry.redhat.io/openshift-service-mesh/istio-pilot-rhel9@sha256:36557fa4817c3d0bac499dc65a4a9d6673500681641943d2f8cec5bfec4355be\" already present on machine"]
["2025-05-02T18:34:12Z","kubelet","Created container: discovery"]
["2025-05-02T18:34:12Z","kubelet","Started container discovery"]
["2025-05-02T18:34:17Z","horizontal-pod-autoscaler","failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API"]
["2025-05-02T18:34:17Z","horizontal-pod-autoscaler","invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API"]
% 

Maybe testGatewayAPIManualDeployment just happened to catch Istiod at a bad time. I'll increase its log output and timeout.

Finally, during Istiod's shutdown, it logged some permissions errors:

2025-05-02T18:34:10.811160Z	error	klog	Failed to release lock: leases.coordination.k8s.io "istio-gateway-ca-openshift-gateway" is forbidden: User "system:serviceaccount:openshift-ingress:istiod-openshift-gateway" cannot update resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-ingress"
2025-05-02T18:34:10.811187Z	info	leader election lock lost: istio-gateway-ca-openshift-gateway
2025-05-02T18:34:10.811434Z	error	klog	Failed to release lock: leases.coordination.k8s.io "istio-gateway-deployment-openshift-gateway" is forbidden: User "system:serviceaccount:openshift-ingress:istiod-openshift-gateway" cannot update resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-ingress"
2025-05-02T18:34:10.811505Z	info	leader election lock lost: istio-gateway-deployment-openshift-gateway
2025-05-02T18:34:10.811604Z	error	klog	Failed to release lock: configmaps "istio-gateway-status-leader" is forbidden: User "system:serviceaccount:openshift-ingress:istiod-openshift-gateway" cannot update resource "configmaps" in API group "" in the namespace "openshift-ingress"
2025-05-02T18:34:10.811617Z	info	leader election lock lost: istio-gateway-status-leader

I'll check with the Service Mesh team about those errors.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented May 2, 2025

@Miciah: This pull request references NE-2022 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

Details

In response to this:

Bump to OSSM 3.0.1 and Istio 1.24.4

Bump from OSSM v3.0.0 to v3.0.1 and from Istio v1.24.3 to v1.24.4.

Enable Gateway only CA Bundles and custom CA CM name

Avoid conflict with a user control plane by setting a custom CA Bundle CM name for the Gateway Control plane and enable Istio to only inject CA Bundle CMs in namespaces where Gateways exist to avoid polluting the whole cluster.

Two new environment variables are set for the Istio control plane deployment CR:

PILOT_CA_CERT_CONFIGMAP
PILOT_ENABLE_GATEWAY_API_CA_CERT_ONLY

This change is related to OSSM-9076.

Don't copy labels or annotations

Configure Istiod not to copy annotations or labels from gateways onto associated resources, such as the proxy deployment and load-balancer service for a gateway.

This copying behavior is Istio-specific, not part of the Gateway API spec, and could be used to inject unsupported configuration. For example, an end-user could set a service annotation on the gateway in order to configure a load-balancer. Setting annotations on the gateway to configure the load-balancer would not be portable to other Gateway API implementations and would complicate product support.

One new environment variable is set:

PILOT_ENABLE_GATEWAY_API_COPY_LABELS_ANNOTATIONS

This change is related to OSSM-8989.

Delete old controller-mode setting

Delete the obsolete PILOT_ENABLE_GATEWAY_CONTROLLER_MODE environment variable from the Istiod configuration. This environment variable is no longer recognized in OSSM 3, and the variable has been superseded by EnhancedResourceScoping.

assertDNSRecord: Increase timeout to 10m

Increase the timeout in assertDNSRecord for polling for the DNSRecord CR from 1 minute to 10 minutes.

The cloud provider can easily take over a minute to provision the load balancer, and the operator cannot create the DNSRecord CR before the load balancer has been provisioned and assigned a host name or address. Consequently, the polling loop could easily reach the 1-minute timeout just on account of the time that it takes to provision the load balancer.

testGatewayAPIManualDeployment: Increase timeout

Increase the timeout for polling the gateway, and dump the gateway if the test fails.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Miciah
Copy link
Contributor Author

Miciah commented May 2, 2025

e2e-aws-operator failed on testGatewayAPIManualDeployment again. The gateway has this status:

          - type: Programmed
            status: "True"
            observedgeneration: 1
            lasttransitiontime: "2025-05-02T22:28:10Z"
            reason: Programmed
            message: Resource programmed, assigned to service(s) router-internal-default.openshift-ingress.svc.cluster.local:80

This might reflect a real regression in OSSM 3.0.1.

testGatewayAPIObjects found the DNSRecord CR:

util_gatewayapi_test.go:1048: Found DNSRecord openshift-ingress/test-gateway-7d8d8f5f88-wildcard Published=True

However, it timed out waiting for the DNS record to resolve:

util_gatewayapi_test.go:903: GET test-hostname-5qzdj.gws.ci-op-1mssq3hn-43abb.origin-ci-int-aws.dev.rhcloud.com failed: GET test-hostname-5qzdj.gws.ci-op-1mssq3hn-43abb.origin-ci-int-aws.dev.rhcloud.com failed: Get "http://test-hostname-5qzdj.gws.ci-op-1mssq3hn-43abb.origin-ci-int-aws.dev.rhcloud.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers), retrying...
util_gatewayapi_test.go:918: Error connecting to test-hostname-5qzdj.gws.ci-op-1mssq3hn-43abb.origin-ci-int-aws.dev.rhcloud.com: context deadline exceeded

It is possible that DNS caching or propagation delay caused this failure.

/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented May 2, 2025

Actually, in the e2e-aws-operator job, it looks like the DNS name did eventually resolve (though it took a while), but then the connection timed out.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented May 28, 2025

@Miciah: This pull request references NE-2022 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

Details

In response to this:

Bump to OSSM 3.0.1 and Istio 1.24.4

Bump from OSSM v3.0.0 to v3.0.1 and from Istio v1.24.3 to v1.24.4.

Enable Gateway only CA Bundles and custom CA CM name

Avoid conflict with a user control plane by setting a custom CA Bundle CM name for the Gateway Control plane and enable Istio to only inject CA Bundle CMs in namespaces where Gateways exist to avoid polluting the whole cluster.

Two new environment variables are set for the Istio control plane deployment CR:

PILOT_CA_CERT_CONFIGMAP
PILOT_ENABLE_GATEWAY_API_CA_CERT_ONLY

This change is related to OSSM-9076.

This change incorporates #1209.

Don't copy labels or annotations

Configure Istiod not to copy annotations or labels from gateways onto associated resources, such as the proxy deployment and load-balancer service for a gateway.

This copying behavior is Istio-specific, not part of the Gateway API spec, and could be used to inject unsupported configuration. For example, an end-user could set a service annotation on the gateway in order to configure a load-balancer. Setting annotations on the gateway to configure the load-balancer would not be portable to other Gateway API implementations and would complicate product support.

One new environment variable is set:

PILOT_ENABLE_GATEWAY_API_COPY_LABELS_ANNOTATIONS

This change is related to OSSM-8989.

Delete old controller-mode setting

Delete the obsolete PILOT_ENABLE_GATEWAY_CONTROLLER_MODE environment variable from the Istiod configuration. This environment variable is no longer recognized in OSSM 3, and the variable has been superseded by EnhancedResourceScoping.

assertDNSRecord: Increase timeout to 10m

Increase the timeout in assertDNSRecord for polling for the DNSRecord CR from 1 minute to 10 minutes.

The cloud provider can easily take over a minute to provision the load balancer, and the operator cannot create the DNSRecord CR before the load balancer has been provisioned and assigned a host name or address. Consequently, the polling loop could easily reach the 1-minute timeout just on account of the time that it takes to provision the load balancer.

testGatewayAPIManualDeployment: Increase timeout

Increase the timeout for polling the gateway, and dump the gateway if the test fails.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Miciah
Copy link
Contributor Author

Miciah commented Jun 30, 2025

/retest

@Miciah Miciah mentioned this pull request Jun 30, 2025
@Miciah
Copy link
Contributor Author

Miciah commented Jul 1, 2025

e2e-aws-operator failed in testGatewayAPIObjects and testGatewayAPIManualDeployment.

testGatewayAPIManualDeployment failed for the same reason as before: Istio allowed the the gateway to specify an existing service (i.e. manual deployment) when Istio should have created a new LB service for the gateway (i.e. automated deployment). The same issue occurred when bumping OSSM to 3.0.1 and Istio to 1.24.4 without specifying any of the new options in OSSM 3.0.1/Istio 1.24.4 (see #1228), but the issue did not occur when bumping OSSM to 3.0.1 while leaving Istio at 1.24.3 (see #1238). This is strong evidence of a regression in Istio 1.24.4.

testGatewayAPIObjects failed because the test gateway pod failed to start; the kubelet emitted an event with the following message: MountVolume.SetUp failed for volume "istiod-ca-cert" : configmap "istio-ca-root-cert" not found. The same issue does not occur when bumping OSSM to 3.0.1 and Istio to 1.24.4 without specifying any of the new options in OSSM 3.0.1/Istio 1.24.4 (see #1228 again). This is strong evidence of an issue with the new PILOT_CA_CERT_CONFIGMAP and PILOT_ENABLE_GATEWAY_API_CA_CERT_ONLY options.

(testGatewayAPIObjects still fails with #1228. The e2e-aws-operator job artifacts didn't have the events, so I looked at an e2e-gcp-operator job run; there, the test-gateway pod did start, but it did not respond to client HTTP requests or the kubelet's readiness probes, so while the test got further, it still failed, on some other issue.)

@Miciah Miciah force-pushed the NE-2022-bump-to-OSSM-3.0.1 branch from 0478030 to d325731 Compare July 2, 2025 18:03
@Miciah
Copy link
Contributor Author

Miciah commented Jul 2, 2025

https://github.com/openshift/cluster-ingress-operator/compare/0478030d2564a309f2514e80eb97c0dd65804fdb..d3257317c7c0c8e14c8c0704b2a40bd81f95b527 sets spec.values.global.trustBundleName. I hope that this change will resolve the MountVolume.SetUp failed for volume "istiod-ca-cert" : configmap "istio-ca-root-cert" not found errors.

@Miciah
Copy link
Contributor Author

Miciah commented Jul 2, 2025

We are now blocked on #1236 for the Go 1.24 bump, which the sail-operator bump requires.

@Miciah
Copy link
Contributor Author

Miciah commented Jul 3, 2025

#1236 has merged.

/retest

@Miciah Miciah force-pushed the NE-2022-bump-to-OSSM-3.0.1 branch from d325731 to ece76fd Compare July 7, 2025 19:25
@Miciah
Copy link
Contributor Author

Miciah commented Jul 8, 2025

https://github.com/openshift/cluster-ingress-operator/compare/d3257317c7c0c8e14c8c0704b2a40bd81f95b527..ece76fddf9ce97b985fddc36aafcc73197c28f17 removes PILOT_CA_CERT_CONFIGMAP on @dgn's suggestion (setting trustBundleName should be sufficient).

e2e-aws-operator failed because testGatewayAPIObjects and testGatewayAPIManualDeployment failed. However, this time the Istiod pod started, but testGatewayAPIObjects failed because it got HTTP errors when polling the route.

e2e-gcp-operator failed because not all the nodes came up, and e2e-azure-operator failed only on testGatewayAPIManualDeployment; testGatewayAPIObjects succeeded.

I did some manual testing with an AWS cluster and could not reproduce the HTTP errors when running testGatewayAPIObjects. The test still failed, but due to DNS lookup failures related to DNS caching issues. When I curled the route that the test created, I observed HTTP 200 responses.

/test e2e-aws-operator
/test e2e-azure-operator
/test e2e-gcp-operator

Miciah and others added 6 commits July 9, 2025 08:32
Also, explicitly set ENABLE_GATEWAY_API_MANUAL_DEPLOYMENT to "false" on
the Istio CR.  For Istio 1.24.3, OSSM has a vendor override that sets
this option[1].  However, for Istio 1.24.4, the option must be
explicitly set.

1. https://github.com/openshift-service-mesh/sail-operator/blob/3bf27ee3c4fb4494ffe6028c7f72034c5a7a1e60/pkg/istiovalues/vendor_defaults.yaml#L11-L14

This commit resolves NE-2022.

https://issues.redhat.com/browse/NE-2022

* cmd/ingress-operator/start.go (defaultGatewayAPIOperatorVersion):
* manifests/02-deployment-ibm-cloud-managed.yaml
(GATEWAY_API_OPERATOR_VERSION):
* manifests/02-deployment.yaml
(GATEWAY_API_OPERATOR_VERSION): Bump from OSSM v3.0.0 to v3.0.1.
* pkg/operator/controller/gatewayclass/istio.go (desiredIstio): Bump
from Istio v1.24.3 to v1.24.4.  Set ENABLE_GATEWAY_API_MANUAL_DEPLOYMENT
to "false".
To avoid conflicts with user-managed control-planes, set a custom name
for the CA bundle configmaps for the Istio control-plane that the
operator manages.  Also, configure Istio to inject the configmaps only
into namespaces where gateways exist in order to avoid polluting the
whole cluster.

Set one new environment variable in the Istio CR:

    PILOT_ENABLE_GATEWAY_API_CA_CERT_ONLY

Set the Istio CR's trustBundleName global value to match the custom
configmap name.  This change requires bumping the sail-operator API:

    go get github.com/istio-ecosystem/sail-operator@30be83268d6b6bfaf6fb0562a6c3e505a17422ea

This commit is related to OSSM-9076.

* go.mod: Bump github.com/istio-ecosystem/sail-operator.
* go.sum:
* vendor/*/: Regenerate.
* pkg/operator/controller/gatewayclass/istio.go (desiredIstio): Set the
new environment variable and trustBundleName field
* pkg/operator/controller/names.go (OpenShiftGatewayCARootCertName): New
const.

Modified-by: Miciah Masters <miciah.masters@gmail.com>
Configure Istiod not to copy annotations or labels from gateways onto
associated resources, such as the proxy deployment and load-balancer
service for a gateway.

This copying behavior is Istio-specific, not part of the Gateway API
spec, and could be used to inject unsupported configuration.  For
example, an end-user could set a service annotation on the gateway in
order to configure a load-balancer.  Setting annotations on the gateway
to configure the load-balancer would not be portable to other Gateway
API implementations and would complicate product support.

This commit is related to OSSM-8989.

https://issues.redhat.com/browse/OSSM-8989

* pkg/operator/controller/gatewayclass/istio.go (desiredIstio): Set the
"PILOT_ENABLE_GATEWAY_API_COPY_LABELS_ANNOTATIONS" to "false".
Delete the obsolete PILOT_ENABLE_GATEWAY_CONTROLLER_MODE environment
variable from the Istiod configuration.  This environment variable is
no longer recognized in OSSM 3, and the variable has been superseded
by EnhancedResourceScoping.

* pkg/operator/controller/gatewayclass/istio.go (desiredIstio): Delete
PILOT_ENABLE_GATEWAY_CONTROLLER_MODE.
Increase the timeout in assertDNSRecord for polling for the DNSRecord CR
from 1 minute to 10 minutes.

The cloud provider can easily take over a minute to provision the load balancer,
and the operator cannot create the DNSRecord CR before the load balancer has
been provisioned and assigned a host name or address.  Consequently, the polling
loop could easily reach the 1-minute timeout just on account of the time that it
takes to provision the load balancer.

* test/e2e/util_gatewayapi_test.go (assertDNSRecord): Increase timeout
for the DNSRecord CR polling loop from 1m to 10m.
Increase the timeout for polling the gateway, and dump the gateway if
the test fails.

* test/e2e/gateway_api_test.go (testGatewayAPIManualDeployment):
Increase the timeout for polling the gateway from 1m to 5m.  Dump the
gateway if the test fails.
@Miciah
Copy link
Contributor Author

Miciah commented Jul 9, 2025

e2e-azure-operator and e2e-gcp-operator failed only on testGatewayAPIManualDeployment. e2e-aws-operator failed on both testGatewayAPIManualDeployment as well as testGatewayAPIDNS/gatewayListenersWithOverlappingHostname; I have filed OCPBUGS-59139 for this new flake. This time, testGatewayAPIObjects did not fail in any of these jobs; 🎉!

@Miciah Miciah force-pushed the NE-2022-bump-to-OSSM-3.0.1 branch from ece76fd to f1e445d Compare July 9, 2025 12:50
@Miciah
Copy link
Contributor Author

Miciah commented Jul 9, 2025

https://github.com/openshift/cluster-ingress-operator/compare/ece76fddf9ce97b985fddc36aafcc73197c28f17..f1e445da0c864fc0c15ed3f90b7f3f2f6483a014 sets ENABLE_GATEWAY_API_MANUAL_DEPLOYMENT. OSSM has an override for this setting for Istio 1.24.3 but is missing the override for Istio 1.24.4. Setting ENABLE_GATEWAY_API_MANUAL_DEPLOYMENT explicitly should resolve the testGatewayAPIManualDeployment test failures.

@Miciah
Copy link
Contributor Author

Miciah commented Jul 10, 2025

/skip
to clear out the stale e2e-aws-gatewayapi job.

@Miciah
Copy link
Contributor Author

Miciah commented Jul 11, 2025

e2e-aws-operator failed because the "e2e-aws-operator-ipi-deprovision-deprovision" step failed. The tests all passed.

@Miciah
Copy link
Contributor Author

Miciah commented Jul 11, 2025

e2e-aws-operator-techpreview failed because the TestConnectTimeout test failed. This test failure also showed up on #1242, so I have filed OCPBUGS-59249 to track the issue.

@Miciah
Copy link
Contributor Author

Miciah commented Jul 11, 2025

e2e-aws-ovn failed on the "e2e-aws-ovn-ipi-deprovision-deprovision" step, but otherwise tests were passing.

@Miciah
Copy link
Contributor Author

Miciah commented Jul 11, 2025

e2e-aws-ovn-serial also failed on the "e2e-aws-ovn-serial-ipi-deprovision-deprovision" step.

@Miciah
Copy link
Contributor Author

Miciah commented Jul 11, 2025

e2e-aws-ovn-single-node, e2e-aws-ovn-techpreview, and e2e-aws-ovn-upgrade all failed on the "e2e-aws-ovn-serial-ipi-deprovision-deprovision" step.

e2e-aws-ovn-techpreview also failed on [sig-api-machinery] FieldValidation should detect unknown metadata fields in both the root and embedded object of a CR:

  STEP: Creating a kubernetes client @ 07/10/25 19:52:37.714
  STEP: Building a namespace api object, basename field-validation @ 07/10/25 19:52:37.716
I0710 19:52:37.777934 20405 namespace.go:59] About to run a Kube e2e test, ensuring namespace/e2e-field-validation-7456 is privileged
  STEP: Waiting for a default service account to be provisioned in namespace @ 07/10/25 19:52:38.133
  STEP: Waiting for kube-root-ca.crt to be provisioned in namespace @ 07/10/25 19:52:38.136
I0710 19:53:10.759154 20405 field_validation.go:575] Unexpected error: deleting CustomResourceDefinition: 
    <wait.errInterrupted>: 
    timed out waiting for the condition
    {
        cause: <*errors.errorString | 0xc000464ba0>{
            s: "timed out waiting for the condition",
        },
    }
  [FAILED] in [It] - k8s.io/kubernetes/test/e2e/apimachinery/field_validation.go:575 @ 07/10/25 19:53:10.759
  STEP: dump namespace information after failure @ 07/10/25 19:53:10.774
  STEP: Collecting events from namespace "e2e-field-validation-7456". @ 07/10/25 19:53:10.774
  STEP: Found 0 events. @ 07/10/25 19:53:10.784
I0710 19:53:10.786828 20405 resource.go:168] POD  NODE  PHASE  GRACE  CONDITIONS
I0710 19:53:10.786860 20405 resource.go:178] 
I0710 19:53:10.796368 20405 dump.go:81] skipping dumping cluster info - cluster too large
  STEP: Destroying namespace "e2e-field-validation-7456" for this suite. @ 07/10/25 19:53:10.796

fail [k8s.io/kubernetes/test/e2e/apimachinery/field_validation.go:575]: deleting CustomResourceDefinition: timed out waiting for the condition

Using search.ci, I found a few similar "deleting CustomResourceDefinition: timed out waiting for the condition" errors; I filed OCPBUGS-59257 to track the issue.
/test e2e-aws-ovn-techpreview

@Miciah
Copy link
Contributor Author

Miciah commented Jul 12, 2025

/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Jul 15, 2025

e2e-aws-operator failed because no worker nodes came up.

/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Jul 16, 2025

e2e-aws-operator failed because the job timed out:

�[36mINFO�[0m[2025-07-15T18:42:45Z] Running step e2e-aws-operator-test.          
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2025-07-15T19:46:04Z"}
�[36mINFO�[0m[2025-07-15T19:46:04Z] Received signal.                              �[36msignal�[0m=interrupt
�[36mINFO�[0m[2025-07-15T19:46:04Z] error: Process interrupted with signal interrupt, cancelling execution... 
�[36mINFO�[0m[2025-07-15T19:46:04Z] cleanup: Deleting release pod release-latest 
�[36mINFO�[0m[2025-07-15T19:46:04Z] Step e2e-aws-operator-test failed after 1h3m18s. 

The error message is misleading. From the timestamps, it is clear that the tests did not run for even close to 4 hours. First, building the images took ~40 minutes:

�[36mINFO�[0m[2025-07-15T15:46:09Z] Created build "src-amd64"                    
�[36mINFO�[0m[2025-07-15T15:49:21Z] Imported tags on imagestream (after taking snapshot) ci-op-wldt11m3/stable-initial 
�[36mINFO�[0m[2025-07-15T15:49:21Z] Imported tags on imagestream (after taking snapshot) ci-op-wldt11m3/stable 
�[36mINFO�[0m[2025-07-15T16:06:16Z] Build src-amd64 succeeded after 20m7s        
�[36mINFO�[0m[2025-07-15T16:06:16Z] Retrieving digests of member images          
�[36mINFO�[0m[2025-07-15T16:06:17Z] Image ci-op-wldt11m3/pipeline:src created     �[36mdigest�[0m=sha256:aaacbe11fcc6d7652b72003670c6510b278a01f53613e9f6dbc81d7fa100862a �[36mfor-build�[0m=src
�[36mINFO�[0m[2025-07-15T16:06:17Z] Building cluster-ingress-operator            
�[36mINFO�[0m[2025-07-15T16:06:17Z] Created build "cluster-ingress-operator-amd64" 
�[36mINFO�[0m[2025-07-15T16:25:25Z] Build cluster-ingress-operator-amd64 succeeded after 19m8s 

Then, getting a lease for the infrastructure took ~85 minutes:

�[36mINFO�[0m[2025-07-15T16:26:44Z] Acquiring leases for test e2e-aws-operator: [aws-quota-slice] 
�[36mINFO�[0m[2025-07-15T17:51:27Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-1--aws-quota-slice-30] 

And then, installing the cluster took ~49 minutes:

�[36mINFO�[0m[2025-07-15T17:52:57Z] Running step e2e-aws-operator-ipi-install-install. 
�[36mINFO�[0m[2025-07-15T18:42:19Z] Step e2e-aws-operator-ipi-install-install succeeded after 49m22s. 

So the entire CI job appears to be constrained to 4 hours, the setup took almost 3 hours, and the tests themselves had just over 1 hour to run before the 4 hours elapsed and the job was terminated.

Getting the lease should not have taken so long. I hope that the issue the caused the delay has been resolved.
/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Jul 16, 2025

e2e-aws-operator looks good. Let's try the other AWS jobs now.

/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-aws-operator-techpreview
/test e2e-aws-ovn-single-node
/test e2e-aws-ovn-serial
/test e2e-aws-ovn-techpreview

@lihongan
Copy link
Contributor

/test e2e-aws-gatewayapi-conformance

@lihongan
Copy link
Contributor

No issue found in pre-merge test

$ oc -n openshift-operators get csv
NAME                          DISPLAY                            VERSION   REPLACES                      PHASE
servicemeshoperator3.v3.0.3   Red Hat OpenShift Service Mesh 3   3.0.3     servicemeshoperator3.v3.0.2   Succeeded

$ oc get istio
NAME                REVISIONS   READY   IN USE   ACTIVE REVISION     STATUS    VERSION   AGE
openshift-gateway   1           1       1        openshift-gateway   Healthy   v1.24.6   61m

$ oc -n openshift-ingress get gateway
NAME    CLASS               ADDRESS                                                                  PROGRAMMED   AGE
gwapi   openshift-default   a0e2b5f38e6ae44cc8227004bdcf50ed-210799710.us-west-1.elb.amazonaws.com   True         38m

$ oc get httproute
NAME      HOSTNAMES                                                   AGE
myroute   ["test.gwapi.ci-ln-9igmm2k-76ef8.aws-2.ci.openshift.org"]   37m

$ curl http://test.gwapi.ci-ln-9igmm2k-76ef8.aws-2.ci.openshift.org 
Hello-OpenShift web-server-6d6cfb97fc-c6bgb http-8080

@Miciah Miciah force-pushed the NE-2022-bump-to-OSSM-3.0.1 branch from 186e633 to f1e445d Compare July 21, 2025 18:16
@Miciah
Copy link
Contributor Author

Miciah commented Jul 21, 2025

https://github.com/openshift/cluster-ingress-operator/compare/186e633124c8ad819ef20c429df3a93be7d4987e..f1e445da0c864fc0c15ed3f90b7f3f2f6483a014 drops the OSSM 3.0.3 bump so that we can backport a single-version bump, which should enable automatic updates.

@alebedev87
Copy link
Contributor

/assign

@alebedev87
Copy link
Contributor

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alebedev87

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 22, 2025
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD cbc0b21 and 2 for PR HEAD f1e445d in total

@lihongan
Copy link
Contributor

Redo pre-merge test with the PR and the CSV and Istio look good, but I can see two installplans and both are Manual approval, is that expected?
Does "enable automatic update" mean the "Approval" should be set as "Automatic" ?

$ oc -n openshift-operators get csv
NAME                          DISPLAY                            VERSION   REPLACES                      PHASE
servicemeshoperator3.v3.0.1   Red Hat OpenShift Service Mesh 3   3.0.1     servicemeshoperator3.v3.0.0   Succeeded

$ oc get istio
NAME                REVISIONS   READY   IN USE   ACTIVE REVISION     STATUS    VERSION   AGE
openshift-gateway   1           1       0        openshift-gateway   Healthy   v1.24.4   2m35s

$ oc -n openshift-operators get installplan
NAME            CSV                           APPROVAL   APPROVED
install-2knxb   servicemeshoperator3.v3.0.2   Manual     false
install-fdnff   servicemeshoperator3.v3.0.1   Manual     true

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD cbc0b21 and 2 for PR HEAD f1e445d in total

@alebedev87
Copy link
Contributor

@lihongan:

but I can see two installplans and both are Manual approval, is that expected?

Yes, OLM creates "next in upgrade graph" installplan after the current installplan was applied. This allows subscriptions with Manual approvals to keep upgrading.

Does "enable automatic update" mean the "Approval" should be set as "Automatic" ?

No. It means that if the OSSM operator bump is next in the upgrade graph and no specific upgrade logic is needed, the current install plan approval logic (of cluster ingress operator) would work. Like in this PR, we have v3.0.0 in master and we bump v3.0.1 which is next in the upgrade graph. Note though that if this PR would bump to v3.0.3, the OSSM operator would not be upgraded by the current installplan approval logic of cluster ingress operator. https://issues.redhat.com/browse/NE-2097 is the Jira ticket for implementing an upgrade logic which would allow version jumps (from current to any).

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 23, 2025

@Miciah: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-gatewayapi 0478030 link false /test e2e-aws-gatewayapi
ci/prow/okd-scos-e2e-aws-ovn f1e445d link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-ovn-techpreview f1e445d link false /test e2e-aws-ovn-techpreview

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD cbc0b21 and 2 for PR HEAD f1e445d in total

@openshift-merge-bot openshift-merge-bot bot merged commit 4f1ed7f into openshift:master Jul 24, 2025
20 of 22 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-ingress-operator
This PR has been included in build ose-cluster-ingress-operator-container-v4.20.0-202507242147.p0.g4f1ed7f.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants