Skip to content

NE-1273: Add a watch to the ingress operator so it will recreate the gwapi crds#1106

Merged
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
anirudhAgniRedhat:gatewayCrdReconcile
Jul 31, 2024
Merged

NE-1273: Add a watch to the ingress operator so it will recreate the gwapi crds#1106
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
anirudhAgniRedhat:gatewayCrdReconcile

Conversation

@anirudhAgniRedhat
Copy link
Contributor

@anirudhAgniRedhat anirudhAgniRedhat commented Jul 12, 2024

Added watch for Gateway API CRDs to recreate CRDs if they get deleted

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2024
@openshift-ci openshift-ci bot requested review from Thealisyed and gcs278 July 12, 2024 12:11
@anirudhAgniRedhat anirudhAgniRedhat changed the title [WIP] Added Gateway CRD Reconcile NE-1273: Add a watch to the ingress operator so it will recreate the gwapi crds Jul 12, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 12, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jul 12, 2024

@anirudhAgniRedhat: This pull request references NE-1273 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@candita
Copy link
Contributor

candita commented Jul 23, 2024

@anirudhAgniRedhat please add an e2e test to this so we can see how it works. I know you are waiting for #1023 to merge first, hopefully it will happen today or tomorrow.

Copy link
Contributor

@gcs278 gcs278 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking on the PR review comments. Just a quick review while I'm here.


// Make sure all the *.gateway.networking.k8s.io CRDs are available since the FeatureGate is enabled and if deleted
// manually.
ensureCRDs(t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking on the PR review!

Since they were two different efforts/tasks, would you mind breaking apart the E2E fixes from the Candace's PR code review and your new additional E2E changes for NE-1273. It will help with clarity in documentation and traceability, and future reverts (if ever needed).

And in your commit message of the E2E fixes commit, mind putting a link in for #1023 to reference the PR where the code review changes are being addressed from?

Comment on lines +130 to +129
err = kclient.Delete(context.Background(), crd)
if err != nil {
t.Errorf("failed to delete crd %s: %v", name, err)
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will mark the CRD for deletion, not-block, and continue on. Deletion can take a couple of seconds. A potential issue is that the CRD is still available, but being deleted, and the EnsureCRDs function finds them without being delete, and passes because they still look good even though they are in the process of being deleted.

To be safe, I'd consider blocking here. One example of blocking or waiting for deletion is deleteIngressController.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gcs278 I have added the poll because the IngressController has multiple operands, which means it takes some time to delete. However, CRDs are deleted quite quickly. When I receive my first poll CRD, it gives me a new CRD with a different UUID, and it ends up stuck in a loop searching for a new CRD.

// if new CRD got recreated while the poll ensures the CRD is deleted.
if newCRD != nil && newCRD.UID != crd.UID {
	return true, nil
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I receive my first poll CRD, it gives me a new CRD with a different UUID

Sorry, I might be confused, isn't this a good thing? Not sure why it'd get stuck in this loop.

Looks like the test passed, are you saying it's not working correctly still or did you figure it out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here I am trying to address is that if I remove the code section for

// if new CRD got recreated while the poll ensures the CRD is deleted.
if newCRD != nil && newCRD.UID != crd.UID {
	return true, nil
}

The problem will be the reconcile loop is creating the CRD again pretty quickly and when I am trying get the CRD in the poll section it is already been recreated, in that case the if kerrors.IsNotFound(err) condition will never satisfied and tests will end-up failing due to context deadline exceeded.
This is the reason Why I have validated it with old CRD's UID.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, that sounds very reasonable, better to be precise to avoid race conditions.

Copy link
Contributor

@gcs278 gcs278 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the responses, looking good.

config: config,
client: mgr.GetClient(),
config: config,
recorder: mgr.GetEventRecorderFor(controllerName),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we pass recorder in now, but it never used. Does it do anything?

// watch for CRDs
if err = c.Watch(source.Kind(operatorCache, &apiextensionsv1.CustomResourceDefinition{}), &handler.EnqueueRequestForObject{}, predicate.Funcs{
CreateFunc: func(e event.CreateEvent) bool { return false },
DeleteFunc: func(e event.DeleteEvent) bool { return true },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this reconciles this control loop whenever there is any delete in an object informed by the operatorCache. I suppose that's not too bad, but generally, I think we try to be as precise as possible, to avoid the growing pains that come with unnecessary Operator reconcile cycles. Never know what crazy deletes/creates/updates a cluster at scale might be doing.

Have you thought about using the managedCRDs object and filtering the DeleteEvent by only when the CRDs we care about are deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gcs278 The managedCRDs function already handles the case where if the CRDs are already present it will not do anything.
However I added it for only delete events because that is what only required in this case, I don't want to add any extra events if they are not required at this moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gcs278 The managedCRDs function already handles the case where if the CRDs are already present it will not do anything.

Correct, ensureGatewayAPICRDs won't do anything incorrectly if it gets extra reconciles, everything will work fine, but it's doing more work than it needs, and logging extra reconcile loops for other-than-gateway-api CRDs. e.g. when a IngressController is deleted in openshift-ingress-operator namespace, this logic will run reconcile, but we know deleting an IngressController won't cause any impact.

I'm suggesting to be more specific from watching all CRDs delete events to only the managedCRDs we care about, maybe this suggestion will help to make sense (I haven't tested it):

	for i := range managedCRDs {
		if err = c.Watch(source.Kind(operatorCache, managedCRDs[i]), &handler.EnqueueRequestForObject{}, predicate.Funcs{
			CreateFunc:  func(e event.CreateEvent) bool { return false },
			DeleteFunc:  func(e event.DeleteEvent) bool { return true },
			UpdateFunc:  func(e event.UpdateEvent) bool { return false },
			GenericFunc: func(e event.GenericEvent) bool { return false },
		}); err != nil {
			return nil, err
		}
	}

Comment on lines +130 to +129
err = kclient.Delete(context.Background(), crd)
if err != nil {
t.Errorf("failed to delete crd %s: %v", name, err)
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I receive my first poll CRD, it gives me a new CRD with a different UUID

Sorry, I might be confused, isn't this a good thing? Not sure why it'd get stuck in this loop.

Looks like the test passed, are you saying it's not working correctly still or did you figure it out?

}

// watch for CRDs
if err = c.Watch(source.Kind(operatorCache, &apiextensionsv1.CustomResourceDefinition{}), &handler.EnqueueRequestForObject{}, predicate.Funcs{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I just realized, and I've gotten pinged on in the past, is ensuring the work queue is "homogeneous". That means, you always reconcile the same type of object. Here's an example discussion: #1014 (comment)

If you see the Go docs for Reconcile, it's expecting a FeatureGate:

// Reconcile expects request to refer to a FeatureGate and creates or
// reconciles the Gateway API CRDs.

You can do this easily with a pretty simple EnqueueRequestFromMapFunc like here that just returns the feature gate:

handler.EnqueueRequestsFromMapFunc(toDefaultIngressController),

I know that this doesn't affect functionality at all, but it helps maintain consistency and standards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as per your suggestion!

@anirudhAgniRedhat
Copy link
Contributor Author

@rfredette @gcs278 Added Suggested Changes!
Do you guys have any more suggestions in this, else I can get a tag on this!

@gcs278
Copy link
Contributor

gcs278 commented Jul 30, 2024

Looks good to me! Thanks for the code review responses.
Since this is a minimal change and it's still in tech preview, I'll label myself.
/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 30, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 30, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gcs278

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 30, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 31, 2024

@anirudhAgniRedhat: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit f3e48bc into openshift:master Jul 31, 2024
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-ingress-operator
This PR has been included in build ose-cluster-ingress-operator-container-v4.17.0-202407310342.p0.gf3e48bc.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants