[NE-2183] Implement GatewayAPI status controller #1294

rikatz · 2025-10-14T18:02:27Z

This change implements a new controller, that watches for Gateway resources (and related resources) on openshift-ingress namespace, and kick a new reconciliation in case any of the resources of interest are changed.

The reconciliation process will then add additional status conditions to the Gateway resource, reflecting the current state of infrastructure resources like DNSRecord and loadbalancers, allowing the owner of a Gateway resource to understand (or at least have initial insights) on why a load balancer or a dnsrecord are not working correctly.

The conditions are the same as added to ingresscontroller status.

pkg/operator/controller/gateway-status/controller.go

pkg/operator/controller/gateway-status/status.go

test/e2e/gateway_api_test.go

rikatz · 2025-10-14T22:26:54Z

/retest

rikatz · 2025-10-15T15:00:55Z

Add a test on 2 gateways that conflict on DNSRecord to see how they behave

candita · 2025-10-15T16:35:29Z

/assign @bentito @davidesalerno @alebedev87

openshift-ci · 2025-10-15T17:01:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from alebedev87. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rikatz · 2025-10-15T20:32:53Z

Tests weren't failing, probably the change made for empty service name or empty dnsrecord broke something, will check

Refactor the ingress controller status functions and move to a new package, to be reused by other controllers like GatewayAPI. Additionally, move the unit test, but keep the original one on Ingress package to show that no breaking change is being caused by this move

This change adds additional capability on the ingress status package to generate conditions also for Gateway API. It converts OperatorCondition to meta.Condition, adding the right observed generation and using the upstream utils to properly set the condition on an already existing array of conditions

This commit moves common components and constants from Gateway controller, preparing those to be reused by additional Gateway API controllers and components instead of repeating constants and common functions

This commit introduces the new Gateway Status controller. This controller will be responsible on adding specific Openshift conditions to a Gateway resource managed by CIO, existing on 'openshift-ingress' namespace. The conditions added are similar to the ones from Ingress resources, allowing admins to detect when a LoadBalancer or a DNSRecord where provisioned, and in case of failure, what failure happened.

rikatz · 2025-10-16T21:16:59Z

the failing tests are not related with the change, I saw the same hypershift job failing on other PRs

rikatz · 2025-10-21T21:26:18Z

/retest

openshift-ci · 2025-10-21T21:46:27Z

@rikatz: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-pre-release-ossm	`89d182e`	link	false	`/test e2e-aws-pre-release-ossm`
ci/prow/e2e-aws-ovn-hypershift-conformance	`89d182e`	link	true	`/test e2e-aws-ovn-hypershift-conformance`
ci/prow/e2e-hypershift	`89d182e`	link	true	`/test e2e-hypershift`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

candita · 2025-10-30T20:37:34Z

/assign @Miciah

bentito · 2025-11-03T20:15:43Z

pkg/operator/controller/gateway-status/controller.go

+			return err
+		}
+
+		*t = *list.Items[0].DeepCopy()


here, and the below case: Should we at least order this list before taking the first so it's deterministic which one we take?

AFAIR the Kubernetes list is already deterministic but ordered alphabetically. As this function is used to get "at least once", filtering by a label with gateway name and on openshift-ingress namespace (which is limited), I can see a long term case where 2 resources will be returned.

OTherwise, what's your suggestion for this ordering other than by name? creationTimestamp?

WDYT?

bentito · 2025-11-03T22:28:08Z

pkg/operator/controller/gateway-status/controller.go

+	var errs []error
+
+	childSvc := &corev1.Service{}
+	err := fetchFirstMatchingFromGateway(ctx, r.cache, childSvc, gateway)


We log the error when the child Service lookup fails, but we still fall through, compute conditions with a nil child, and return nil from Reconcile. That treats the reconcile as a success even if the list/get call failed or the Service simply has not been created yet, so we never requeue and may publish temporary “ServiceNotFound” status from a transient read glitch. Please consider appending the error (or returning a short requeue) when we miss either child so we retry once the operands exist, and update the missing loadbalancer and dnsrecord unit test accordingly.

Right, this IMO is expected, and let me explain why:

Both commands will try to fetch the first matching resource. So as an example, you created a Gateway, you may or may not have the Service and DNSRecord created. In case they are not created, this will be an error but the Gateway creation is still happening. These conditions are "desired". The only other error that can happen here is if you pass an unsupported type (like &corev1.Secret) but this is a private and controlled function, so it is more like a "defensive" measure added to the util function

The resource instantiated here (childSvc) will be passed to the computation as null indeed, but in case it is null or empty the approach is the same as the existing on "ingress/status" controller. Here we are keeping the consistency. Re-adding it to the queue (eg returning an error) may end with a loop.

We do watch the service and dnsrecord as part of the "what should reconcile this" (on controller.go line 120), so any change to any of these resources (eg.: they exist) will trigger a new reconciliation.

One thing that can go wrong here and probably I can take care of is:

If the list is empty, it is not an error of logic, and should just waiti for the watcher to figure out this resource existing on the cache

If the error on the client at fetchFirstMatchingFromGateway is something else, then it is an error

Add the error to the errs array, so in case of a problem here a reconciliation will happen.

So an empty list is not an error, a problem with Kubernetes client is. We move with the reconciliation to keep the same logic of ingress resource (keep the calculation of status), but return an error so it would be reconciled immediately.

bentito · 2025-11-03T22:35:59Z

pkg/operator/controller/gateway-status/controller.go

+		gateway.Status.Conditions = make([]metav1.Condition, 0)
+	}
+
+	// WARNING: one thing to be aware is that conditions on Gateway resource are limited to 8: https://github.com/kubernetes-sigs/gateway-api/blob/a8fe5c8732a37ef471d86afaf570ff8ad0ef0221/apis/v1/gateway_types.go#L691


SetStatusCondition happily appends new condition types, so if the Gateway already has eight conditions from other controllers, adding our four will push the list over the API’s limit and the status patch will be rejected. Could we guard against this before patching. Also, we could save condition count by de-duping by type and drop the least useful of ours if we’d exceed eight?

de-duping by type and drop the least useful of ours if we’d exceed eight?
The idea is to keep consistency with the existing conditions on Ingress Status. The big issue here is deciding what is important or not.

As an example, Istio may decide that some newer conditions should be added (ListenerSet? Policies, etc), and in this case limiting for 8 conditions here and getting an error would be problematic.

What I am doing instead right now is having an e2e test that assures that we will always have 6 conditions (https://github.com/openshift/cluster-ingress-operator/pull/1294/files#diff-0477a883800b78b4c8704dd53de74ed35c6a97a2c809d1581ceae78e10e9094fR786).

I would defer the decision about dropping conditions from ingress controller to @Miciah

bentito · 2025-11-03T22:38:39Z

pkg/operator/controller/gateway-status/controller.go

+		return reconcile.Result{}, fmt.Errorf("failed to get infrastructure 'cluster': %v", err)
+	}
+
+	operandEvents := &corev1.EventList{}


This could be a long list, can we filter by the involved object UID(s) for the Service / DNSRecord so we only examine events that we might surface in the status?

yeah makes sense. IIRC I did this, but removed for some reason (I think the unit test suite doesn't support field selectors...)

I will try once more and see how I can adjust the unit test for it

davidesalerno · 2025-11-06T16:34:54Z

pkg/operator/controller/predicates.go

+// GatewayHasOurController returns a function that will use the provided logger and
+// clients, receive an object and return a boolean that represents if the provided
+// object is a Gateway managed by our Gateway Class
+func GatewayHasOurController(logger logr.Logger, crclient client.Reader) func(o client.Object) bool {


I think that it could be useful to have some basic unit test, wdyt?

openshift-ci bot requested review from bentito and davidesalerno October 14, 2025 18:03