-
Notifications
You must be signed in to change notification settings - Fork 218
[NE-2183] Implement GatewayAPI status controller #1294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
1e37dd0 to
78c577a
Compare
|
/retest |
e5cd590 to
b85db95
Compare
|
Add a test on 2 gateways that conflict on DNSRecord to see how they behave |
|
/assign @bentito @davidesalerno @alebedev87 |
b85db95 to
74e6089
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
9849982 to
97deecd
Compare
|
Tests weren't failing, probably the change made for empty service name or empty dnsrecord broke something, will check |
Refactor the ingress controller status functions and move to a new package, to be reused by other controllers like GatewayAPI. Additionally, move the unit test, but keep the original one on Ingress package to show that no breaking change is being caused by this move
97deecd to
df67d4b
Compare
This change adds additional capability on the ingress status package to generate conditions also for Gateway API. It converts OperatorCondition to meta.Condition, adding the right observed generation and using the upstream utils to properly set the condition on an already existing array of conditions
This commit moves common components and constants from Gateway controller, preparing those to be reused by additional Gateway API controllers and components instead of repeating constants and common functions
This commit introduces the new Gateway Status controller. This controller will be responsible on adding specific Openshift conditions to a Gateway resource managed by CIO, existing on 'openshift-ingress' namespace. The conditions added are similar to the ones from Ingress resources, allowing admins to detect when a LoadBalancer or a DNSRecord where provisioned, and in case of failure, what failure happened.
df67d4b to
89d182e
Compare
|
the failing tests are not related with the change, I saw the same hypershift job failing on other PRs |
|
/retest |
|
@rikatz: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/assign @Miciah |
| return err | ||
| } | ||
|
|
||
| *t = *list.Items[0].DeepCopy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here, and the below case: Should we at least order this list before taking the first so it's deterministic which one we take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIR the Kubernetes list is already deterministic but ordered alphabetically. As this function is used to get "at least once", filtering by a label with gateway name and on openshift-ingress namespace (which is limited), I can see a long term case where 2 resources will be returned.
OTherwise, what's your suggestion for this ordering other than by name? creationTimestamp?
WDYT?
| var errs []error | ||
|
|
||
| childSvc := &corev1.Service{} | ||
| err := fetchFirstMatchingFromGateway(ctx, r.cache, childSvc, gateway) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We log the error when the child Service lookup fails, but we still fall through, compute conditions with a nil child, and return nil from Reconcile. That treats the reconcile as a success even if the list/get call failed or the Service simply has not been created yet, so we never requeue and may publish temporary “ServiceNotFound” status from a transient read glitch. Please consider appending the error (or returning a short requeue) when we miss either child so we retry once the operands exist, and update the missing loadbalancer and dnsrecord unit test accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this IMO is expected, and let me explain why:
- Both commands will try to fetch the first matching resource. So as an example, you created a Gateway, you may or may not have the Service and DNSRecord created. In case they are not created, this will be an error but the Gateway creation is still happening. These conditions are "desired". The only other error that can happen here is if you pass an unsupported type (like &corev1.Secret) but this is a private and controlled function, so it is more like a "defensive" measure added to the util function
- The resource instantiated here (childSvc) will be passed to the computation as null indeed, but in case it is null or empty the approach is the same as the existing on "ingress/status" controller. Here we are keeping the consistency. Re-adding it to the queue (eg returning an error) may end with a loop.
- We do watch the service and dnsrecord as part of the "what should reconcile this" (on controller.go line 120), so any change to any of these resources (eg.: they exist) will trigger a new reconciliation.
One thing that can go wrong here and probably I can take care of is:
- If the list is empty, it is not an error of logic, and should just waiti for the watcher to figure out this resource existing on the cache
- If the error on the client at
fetchFirstMatchingFromGatewayis something else, then it is an error - Add the error to the errs array, so in case of a problem here a reconciliation will happen.
So an empty list is not an error, a problem with Kubernetes client is. We move with the reconciliation to keep the same logic of ingress resource (keep the calculation of status), but return an error so it would be reconciled immediately.
| gateway.Status.Conditions = make([]metav1.Condition, 0) | ||
| } | ||
|
|
||
| // WARNING: one thing to be aware is that conditions on Gateway resource are limited to 8: https://github.com/kubernetes-sigs/gateway-api/blob/a8fe5c8732a37ef471d86afaf570ff8ad0ef0221/apis/v1/gateway_types.go#L691 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SetStatusCondition happily appends new condition types, so if the Gateway already has eight conditions from other controllers, adding our four will push the list over the API’s limit and the status patch will be rejected. Could we guard against this before patching. Also, we could save condition count by de-duping by type and drop the least useful of ours if we’d exceed eight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
de-duping by type and drop the least useful of ours if we’d exceed eight?
The idea is to keep consistency with the existing conditions on Ingress Status. The big issue here is deciding what is important or not.
As an example, Istio may decide that some newer conditions should be added (ListenerSet? Policies, etc), and in this case limiting for 8 conditions here and getting an error would be problematic.
What I am doing instead right now is having an e2e test that assures that we will always have 6 conditions (https://github.com/openshift/cluster-ingress-operator/pull/1294/files#diff-0477a883800b78b4c8704dd53de74ed35c6a97a2c809d1581ceae78e10e9094fR786).
I would defer the decision about dropping conditions from ingress controller to @Miciah
| return reconcile.Result{}, fmt.Errorf("failed to get infrastructure 'cluster': %v", err) | ||
| } | ||
|
|
||
| operandEvents := &corev1.EventList{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a long list, can we filter by the involved object UID(s) for the Service / DNSRecord so we only examine events that we might surface in the status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah makes sense. IIRC I did this, but removed for some reason (I think the unit test suite doesn't support field selectors...)
I will try once more and see how I can adjust the unit test for it
| // GatewayHasOurController returns a function that will use the provided logger and | ||
| // clients, receive an object and return a boolean that represents if the provided | ||
| // object is a Gateway managed by our Gateway Class | ||
| func GatewayHasOurController(logger logr.Logger, crclient client.Reader) func(o client.Object) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it could be useful to have some basic unit test, wdyt?
This change implements a new controller, that watches for Gateway resources (and related resources) on openshift-ingress namespace, and kick a new reconciliation in case any of the resources of interest are changed.
The reconciliation process will then add additional status conditions to the Gateway resource, reflecting the current state of infrastructure resources like DNSRecord and loadbalancers, allowing the owner of a Gateway resource to understand (or at least have initial insights) on why a load balancer or a dnsrecord are not working correctly.
The conditions are the same as added to ingresscontroller status.