NE-1273: Add a watch to the ingress operator so it will recreate the gwapi crds#1106
Conversation
|
@anirudhAgniRedhat: This pull request references NE-1273 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@anirudhAgniRedhat please add an e2e test to this so we can see how it works. I know you are waiting for #1023 to merge first, hopefully it will happen today or tomorrow. |
035017f to
5b5ef8a
Compare
gcs278
left a comment
There was a problem hiding this comment.
Thanks for taking on the PR review comments. Just a quick review while I'm here.
|
|
||
| // Make sure all the *.gateway.networking.k8s.io CRDs are available since the FeatureGate is enabled and if deleted | ||
| // manually. | ||
| ensureCRDs(t) |
There was a problem hiding this comment.
Thanks for taking on the PR review!
Since they were two different efforts/tasks, would you mind breaking apart the E2E fixes from the Candace's PR code review and your new additional E2E changes for NE-1273. It will help with clarity in documentation and traceability, and future reverts (if ever needed).
And in your commit message of the E2E fixes commit, mind putting a link in for #1023 to reference the PR where the code review changes are being addressed from?
| err = kclient.Delete(context.Background(), crd) | ||
| if err != nil { | ||
| t.Errorf("failed to delete crd %s: %v", name, err) | ||
| return err | ||
| } |
There was a problem hiding this comment.
I believe this will mark the CRD for deletion, not-block, and continue on. Deletion can take a couple of seconds. A potential issue is that the CRD is still available, but being deleted, and the EnsureCRDs function finds them without being delete, and passes because they still look good even though they are in the process of being deleted.
To be safe, I'd consider blocking here. One example of blocking or waiting for deletion is deleteIngressController.
There was a problem hiding this comment.
@gcs278 I have added the poll because the IngressController has multiple operands, which means it takes some time to delete. However, CRDs are deleted quite quickly. When I receive my first poll CRD, it gives me a new CRD with a different UUID, and it ends up stuck in a loop searching for a new CRD.
// if new CRD got recreated while the poll ensures the CRD is deleted.
if newCRD != nil && newCRD.UID != crd.UID {
return true, nil
}
There was a problem hiding this comment.
When I receive my first poll CRD, it gives me a new CRD with a different UUID
Sorry, I might be confused, isn't this a good thing? Not sure why it'd get stuck in this loop.
Looks like the test passed, are you saying it's not working correctly still or did you figure it out?
There was a problem hiding this comment.
The problem here I am trying to address is that if I remove the code section for
// if new CRD got recreated while the poll ensures the CRD is deleted.
if newCRD != nil && newCRD.UID != crd.UID {
return true, nil
}
The problem will be the reconcile loop is creating the CRD again pretty quickly and when I am trying get the CRD in the poll section it is already been recreated, in that case the if kerrors.IsNotFound(err) condition will never satisfied and tests will end-up failing due to context deadline exceeded.
This is the reason Why I have validated it with old CRD's UID.
There was a problem hiding this comment.
Oh yes, that sounds very reasonable, better to be precise to avoid race conditions.
5b5ef8a to
a53644f
Compare
gcs278
left a comment
There was a problem hiding this comment.
Thanks for the responses, looking good.
| config: config, | ||
| client: mgr.GetClient(), | ||
| config: config, | ||
| recorder: mgr.GetEventRecorderFor(controllerName), |
There was a problem hiding this comment.
I see we pass recorder in now, but it never used. Does it do anything?
| // watch for CRDs | ||
| if err = c.Watch(source.Kind(operatorCache, &apiextensionsv1.CustomResourceDefinition{}), &handler.EnqueueRequestForObject{}, predicate.Funcs{ | ||
| CreateFunc: func(e event.CreateEvent) bool { return false }, | ||
| DeleteFunc: func(e event.DeleteEvent) bool { return true }, |
There was a problem hiding this comment.
So this reconciles this control loop whenever there is any delete in an object informed by the operatorCache. I suppose that's not too bad, but generally, I think we try to be as precise as possible, to avoid the growing pains that come with unnecessary Operator reconcile cycles. Never know what crazy deletes/creates/updates a cluster at scale might be doing.
Have you thought about using the managedCRDs object and filtering the DeleteEvent by only when the CRDs we care about are deleted?
There was a problem hiding this comment.
@gcs278 The managedCRDs function already handles the case where if the CRDs are already present it will not do anything.
However I added it for only delete events because that is what only required in this case, I don't want to add any extra events if they are not required at this moment.
There was a problem hiding this comment.
@gcs278 The managedCRDs function already handles the case where if the CRDs are already present it will not do anything.
Correct, ensureGatewayAPICRDs won't do anything incorrectly if it gets extra reconciles, everything will work fine, but it's doing more work than it needs, and logging extra reconcile loops for other-than-gateway-api CRDs. e.g. when a IngressController is deleted in openshift-ingress-operator namespace, this logic will run reconcile, but we know deleting an IngressController won't cause any impact.
I'm suggesting to be more specific from watching all CRDs delete events to only the managedCRDs we care about, maybe this suggestion will help to make sense (I haven't tested it):
for i := range managedCRDs {
if err = c.Watch(source.Kind(operatorCache, managedCRDs[i]), &handler.EnqueueRequestForObject{}, predicate.Funcs{
CreateFunc: func(e event.CreateEvent) bool { return false },
DeleteFunc: func(e event.DeleteEvent) bool { return true },
UpdateFunc: func(e event.UpdateEvent) bool { return false },
GenericFunc: func(e event.GenericEvent) bool { return false },
}); err != nil {
return nil, err
}
}
| err = kclient.Delete(context.Background(), crd) | ||
| if err != nil { | ||
| t.Errorf("failed to delete crd %s: %v", name, err) | ||
| return err | ||
| } |
There was a problem hiding this comment.
When I receive my first poll CRD, it gives me a new CRD with a different UUID
Sorry, I might be confused, isn't this a good thing? Not sure why it'd get stuck in this loop.
Looks like the test passed, are you saying it's not working correctly still or did you figure it out?
a53644f to
fb2c815
Compare
fb2c815 to
53c6aee
Compare
| } | ||
|
|
||
| // watch for CRDs | ||
| if err = c.Watch(source.Kind(operatorCache, &apiextensionsv1.CustomResourceDefinition{}), &handler.EnqueueRequestForObject{}, predicate.Funcs{ |
There was a problem hiding this comment.
One thing I just realized, and I've gotten pinged on in the past, is ensuring the work queue is "homogeneous". That means, you always reconcile the same type of object. Here's an example discussion: #1014 (comment)
If you see the Go docs for Reconcile, it's expecting a FeatureGate:
You can do this easily with a pretty simple EnqueueRequestFromMapFunc like here that just returns the feature gate:
I know that this doesn't affect functionality at all, but it helps maintain consistency and standards.
There was a problem hiding this comment.
Updated as per your suggestion!
53c6aee to
3ef4a2e
Compare
…gwapi crds and E2E tests
3ef4a2e to
e9c4200
Compare
|
@rfredette @gcs278 Added Suggested Changes! |
|
Looks good to me! Thanks for the code review responses. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gcs278 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@anirudhAgniRedhat: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
[ART PR BUILD NOTIFIER] Distgit: ose-cluster-ingress-operator |
Added watch for Gateway API CRDs to recreate CRDs if they get deleted