Skip to content

Conversation

@rikatz
Copy link
Member

@rikatz rikatz commented Oct 21, 2025

This enhancement proposal adds the Ingress Controller Conditions (LoadBalancerManaged, LoadBalancerReady, DNSManaged and DNSReady) to Gateway API resources that created with Openshift Gateway Class and on openshift-ingress namespace.

This proposal was partially generated with the help of Claude/AI

@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 21, 2025

@rikatz: This pull request references NE-2183 which is a valid jira issue.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 21, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 21, 2025
@openshift-ci openshift-ci bot requested review from knobunc and rfredette October 21, 2025 18:12
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 22, 2025

@rikatz: This pull request references NE-2183 which is a valid jira issue.

In response to this:

This enhancement proposal adds the Ingress Controller Conditions (LoadBalancerManaged, LoadBalancerReady, DNSManaged and DNSReady) to Gateway API resources that created with Openshift Gateway Class and on openshift-ingress namespace.

This proposal was partially generated with the help of Claude/AI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rikatz rikatz changed the title WIP: NE-2183: Initial write on Gateway API conditions NE-2183: Initial write on Gateway API conditions Oct 22, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 22, 2025
@rikatz rikatz changed the title NE-2183: Initial write on Gateway API conditions NE-2183: Openshift conditions on Gateway API status Oct 22, 2025
@Miciah
Copy link
Contributor

Miciah commented Oct 30, 2025

/assign

@candita
Copy link
Contributor

candita commented Oct 30, 2025

/assign

@candita
Copy link
Contributor

candita commented Oct 30, 2025

/retest

@rikatz
Copy link
Member Author

rikatz commented Oct 31, 2025

the error is valid, it is due to the metadata not containing real approvers and reviewers for now. Once I get some review and approval I can fix it

@alebedev87
Copy link
Contributor

Forgot to add a description to the review. I was about to look at openshift/cluster-ingress-operator#1294 but then recalled that there is an EP. So I started from the EP.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 17, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from candita. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* Adding these conditions to user-managed Gateway resources outside the
`openshift-ingress` namespace
* Modifying or changing existing IngressController condition behavior or semantics
* Introducing custom condition types beyond DNS and LoadBalancer at this time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means we don't mark IngressController as Degraded if there are problems with the Gateway?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right, this proposal is just about adding conditions to the Gateway resource, not changing anything else

IngressController by:
1. Creating a shared `pkg/resources/status` package with condition computation
functions
2. Refactoring existing IngressController status code to use this shared package
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we mark the IngressController as Degraded if DNSReady and/or LoadBalancerReady are false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this means that we are moving the conditions function previously used by IngressController only to a common place that can also be reused by Gateway API. It is about the condition calculation functions (given DNSRecord, LoadBalancer service, etc what should be the Gateway resource conditions) but we don't touch IngressController behavior

3. Cloud Provider API provisions the LoadBalancer successfully
4. LoadBalancer service status is updated with external IP/hostname
5. Cluster Ingress Operator detects the Gateway resource and begins reconciliation
6. Cluster Ingress Operator initiates DNS record provisioning through its own dns controller
Copy link
Contributor

@candita candita Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: There is a Gateway API dns record creater controller alongside the cluster ingress one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, and it is on Cluster Ingress Operator. Am I missing something more explicit here? (like the package name?)

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 18, 2025

@rikatz: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

zone, quota exceeded, provider API error)
4. Cluster Ingress Operator DNS controller reports failure status in the
DNSRecord resource
5. Gateway Status Controller updates Gateway condition `DNSManaged=True`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason DNSManaged=False? Maybe reserved for the future, if another DNS management system is selected, such as ExternalDNS?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From CIO code:

	// In case there is no managed DNS zone configured, return a single condition
	// that DNSManaged=False because no zone is configured

any other specific cases (like IngressController endpointpublishingstrategy) are not verified by Gateway API, so we intentionally skip it.

But managed can yes, be false in case the DNSConfig doesn't specify a proper public or private zone, even on Gateway API

5. Gateway Status Controller updates Gateway condition `DNSManaged=True`
(DNS should be managed, configuration is correct)
6. Gateway Status Controller updates Gateway condition `DNSReady=False` with
reason `FailedZones` and detailed error message from DNS provider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
reason `FailedZones` and detailed error message from DNS provider
reason and error message as detailed in the section on Implementation Details.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hum, I think it is fine to keep it here. The implementation details are more about "How" we will implement, but it does make sense IMO keeping the conditions that will be used on the failure flow.

* DNS conditions apply regardless of platform if DNS records are being managed

**MicroShift:**
* MicroShift typically does not use Gateway API or cloud LoadBalancer services
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember hearing about a MicroShift integration, handled by MicroShift, so maybe remove the first line?

* No impact on MicroShift resource consumption or configuration

**Resource Impact:**
* Minimal CPU/memory impact: only adds condition updates during reconciliation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't watch DNS records? During reconciliation of what object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it watches, but these watches are negligible from a resource impact perspective IMO. I am not sure it is worth mentioning it here (also this is related to SNO deployments, so it is not different from other resource impacts considered above)


**Resource Impact:**
* Minimal CPU/memory impact: only adds condition updates during reconciliation
* No additional controllers or processes required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a new gateway-status controller mentioned below.

Copy link
Member Author

@rikatz rikatz Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to "* A new gateway-status controller is created on existing Cluster Ingress Operator"

namespace that have the OpenShift Gateway Class as their `.spec.gatewayClassName` controller
* Associated DNSRecord and Service resources are discovered using the
`gateway.networking.k8s.io/gateway-name` label
* Only the first matching DNSRecord and Service in the same namespace are used
Copy link
Contributor

@candita candita Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind only the first matching DNSRecord being used? Why not check all DNSRecords for the Gateway and report if one or more are in a failure status?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well:

  • For services we know that Istio will provision just one Service (of type LoadBalancer) for the Gateway
  • For DNS, it is following the same logics from CIO, that receives the "wildcard DNS record" only. Maybe this assumption is wrong for Gateway API, and we should compute the DNS record from all of the provisioned DNS Records (we do watch all of the DNS Records related to the Gateway).

I will fix the EP here, as we need to watch all the DNSRecords from the Gateway, good catch!


*DNSReady Condition:*
* Set to `Unknown` when DNSManagementPolicy is `Unmanaged` (OpenShift doesn't manage DNS, so status is unknown)
* Set to `False` with reason `RecordNotFound` when the associated DNSRecord resource cannot be found
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this more precise?

Suggested change
* Set to `False` with reason `RecordNotFound` when the associated DNSRecord resource cannot be found
* Set to `False` with reason `RecordNotFound` when one or more of the associated DNSRecord resources cannot be found


**Condition Lifecycle:**
* Conditions are added when a Gateway is reconciled in the `openshift-ingress` namespace
* Conditions are updated in-place using `condutils.SetStatusCondition()` to preserve transition times
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why preserve transition times? Transition time should be updated if the condition changes, at least if it changes from true to false or vice versa.

* Maximum of 8 total conditions are maintained per Gateway to prevent unbounded growth

**Permissions:**
* The cluster-ingress-operator service account is granted RBAC permissions to:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see it say:

Suggested change
* The cluster-ingress-operator service account is granted RBAC permissions to:
* The cluster-ingress-operator service account uses existing RBAC permissions to:

* No additional controllers or processes required
* Negligible increase in etcd storage for condition status (~1KB per Gateway)

### Implementation Details/Notes/Constraints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more implementation detail - how about making sure the status gets added to the must-gather troubleshooting document?

**Not applicable to all environments:**
* The LoadBalancer condition is only meaningful on cloud platforms or platforms with
`LoadBalancer` support.
* Users on bare metal may see persistent `False` or `Unknown` status which could
Copy link
Contributor

@candita candita Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure that either the messages are clearly indicating that status isn't supported (e.g. "Bare metal clusters don't measure Gateway status"), or that they are at worst Unknown, not False.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, what we discussed so far is that as Load balancers are not supported on Bare metal, but we also "do not provide" the Load Balancer managed condition on Gateway API, we will simply feed the "LoadBalancerReady" condition. This condition will reflect the current behavior of CCM / Baremetal provisioner, so let's say you are on baremetal:

  • If you have metallb, it will work fine
  • If you don't have a Load balancer controller, the status of the LoadBalancerReady condition will be false and the reason will be the LoadBalancer is pending, which means you don't have a LoadBalancer controller on your environment, and reflects the same behavior of CIO.

IMO as we don't have a clear definition yet on bare metal loadbalancer, I think this is the most meaningful information we can provide to users without being misleading, wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also it should be considered that this is the "Drawbacks" section, meaning this may be a known drawback)

* Test `ComputeGatewayAPIDNSStatus` wrapper correctly converts internal conditions to Gateway API conditions
* Test `ComputeGatewayAPILoadBalancerStatus` wrapper correctly converts internal conditions to Gateway API conditions
* Test condition computation with DNSManagementPolicy set to Managed vs Unmanaged
* Test ObservedGeneration is correctly set on conditions
Copy link
Contributor

@candita candita Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also: Test transition times?

* On the same test, verify the condition count is consistent with Istio and Openshift
added conditions
* Create Gateway out of `openshift-ingress` and verify that no Openshift condition is added
* Create Gateway with wrong DNS Domain and verify that Openshift conditions reflect the failue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:
Create Gateway with multiple listeners, only one of which can get a successful DNS record. I expect the dns ready status to be False.

* No CSI, CRI, or CNI changes are involved

**Compatibility:**
* Feature works with Gateway API v1 (both support custom conditions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: What does the "both" refer to here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

halucination, apparently. Let me remove the word

* Negligible impact on API throughput: condition updates happen during normal reconciliation
* No new API calls introduced; only status updates to existing Gateway resources
* Expected number of managed Gateways in `openshift-ingress` namespace: typically 1-10 per cluster
* Condition updates are rate-limited to prevent excessive writes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rate-limiting is automatic, no coding needed, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it is part of the maximum reconciliation that we define, and client-go throttling (or it should).

I am not sure where claude got this, so I would be happy to also remove this line if we feel it may be misleading.


**Detecting Issues:**

*Symptom: Gateway conditions show `DNSManaged=False`*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessarily an error condition. Some users may choose to not have DNS be managed.

@candita
Copy link
Contributor

candita commented Nov 18, 2025

@rikatz Overall I think this looks great. Do you think we should use the generator for other EP/KEP/GEPs?

I have a few nits and questions, and one major question: can you discuss the decision not to propagate condition status up to the ingress controller status? I can see pros and cons for both, but we should document that decision. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants