NE-1273: Add a watch to the ingress operator so it will recreate the gwapi crds by anirudhAgniRedhat · Pull Request #1106 · openshift/cluster-ingress-operator

anirudhAgniRedhat · 2024-07-12T12:10:26Z

Added watch for Gateway API CRDs to recreate CRDs if they get deleted

openshift-ci-robot · 2024-07-12T13:33:17Z

@anirudhAgniRedhat: This pull request references NE-1273 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

candita · 2024-07-23T21:50:01Z

@anirudhAgniRedhat please add an e2e test to this so we can see how it works. I know you are waiting for #1023 to merge first, hopefully it will happen today or tomorrow.

gcs278

Thanks for taking on the PR review comments. Just a quick review while I'm here.

gcs278 · 2024-07-25T14:20:51Z

test/e2e/gateway_api_test.go

+
+	// Make sure all the *.gateway.networking.k8s.io CRDs are available since the FeatureGate is enabled and if deleted
+	// manually.
+	ensureCRDs(t)


Thanks for taking on the PR review!

Since they were two different efforts/tasks, would you mind breaking apart the E2E fixes from the Candace's PR code review and your new additional E2E changes for NE-1273. It will help with clarity in documentation and traceability, and future reverts (if ever needed).

And in your commit message of the E2E fixes commit, mind putting a link in for #1023 to reference the PR where the code review changes are being addressed from?

test/e2e/util_gatewayapi_test.go

gcs278 · 2024-07-25T14:38:57Z

test/e2e/util_gatewayapi_test.go

+	err = kclient.Delete(context.Background(), crd)
+	if err != nil {
+		t.Errorf("failed to delete crd %s: %v", name, err)
+		return err
+	}


I believe this will mark the CRD for deletion, not-block, and continue on. Deletion can take a couple of seconds. A potential issue is that the CRD is still available, but being deleted, and the EnsureCRDs function finds them without being delete, and passes because they still look good even though they are in the process of being deleted.

To be safe, I'd consider blocking here. One example of blocking or waiting for deletion is deleteIngressController.

@gcs278 I have added the poll because the IngressController has multiple operands, which means it takes some time to delete. However, CRDs are deleted quite quickly. When I receive my first poll CRD, it gives me a new CRD with a different UUID, and it ends up stuck in a loop searching for a new CRD.

// if new CRD got recreated while the poll ensures the CRD is deleted. if newCRD != nil && newCRD.UID != crd.UID { return true, nil }

When I receive my first poll CRD, it gives me a new CRD with a different UUID

Sorry, I might be confused, isn't this a good thing? Not sure why it'd get stuck in this loop.

Looks like the test passed, are you saying it's not working correctly still or did you figure it out?

The problem here I am trying to address is that if I remove the code section for

// if new CRD got recreated while the poll ensures the CRD is deleted. if newCRD != nil && newCRD.UID != crd.UID { return true, nil }

The problem will be the reconcile loop is creating the CRD again pretty quickly and when I am trying get the CRD in the poll section it is already been recreated, in that case the if kerrors.IsNotFound(err) condition will never satisfied and tests will end-up failing due to context deadline exceeded.
This is the reason Why I have validated it with old CRD's UID.

Oh yes, that sounds very reasonable, better to be precise to avoid race conditions.

pkg/operator/controller/gatewayapi/controller.go

gcs278

Thanks for the responses, looking good.

test/e2e/gateway_api_test.go

test/e2e/util_gatewayapi_test.go

gcs278 · 2024-07-26T14:53:49Z

pkg/operator/controller/gatewayapi/controller.go

-		config: config,
+		client:   mgr.GetClient(),
+		config:   config,
+		recorder: mgr.GetEventRecorderFor(controllerName),


I see we pass recorder in now, but it never used. Does it do anything?

gcs278 · 2024-07-26T15:00:53Z

pkg/operator/controller/gatewayapi/controller.go

+	// watch for CRDs
+	if err = c.Watch(source.Kind(operatorCache, &apiextensionsv1.CustomResourceDefinition{}), &handler.EnqueueRequestForObject{}, predicate.Funcs{
+		CreateFunc:  func(e event.CreateEvent) bool { return false },
+		DeleteFunc:  func(e event.DeleteEvent) bool { return true },


So this reconciles this control loop whenever there is any delete in an object informed by the operatorCache. I suppose that's not too bad, but generally, I think we try to be as precise as possible, to avoid the growing pains that come with unnecessary Operator reconcile cycles. Never know what crazy deletes/creates/updates a cluster at scale might be doing.

Have you thought about using the managedCRDs object and filtering the DeleteEvent by only when the CRDs we care about are deleted?

@gcs278 The managedCRDs function already handles the case where if the CRDs are already present it will not do anything.
However I added it for only delete events because that is what only required in this case, I don't want to add any extra events if they are not required at this moment.

@gcs278 The managedCRDs function already handles the case where if the CRDs are already present it will not do anything.

Correct, ensureGatewayAPICRDs won't do anything incorrectly if it gets extra reconciles, everything will work fine, but it's doing more work than it needs, and logging extra reconcile loops for other-than-gateway-api CRDs. e.g. when a IngressController is deleted in openshift-ingress-operator namespace, this logic will run reconcile, but we know deleting an IngressController won't cause any impact.

I'm suggesting to be more specific from watching all CRDs delete events to only the managedCRDs we care about, maybe this suggestion will help to make sense (I haven't tested it):

for i := range managedCRDs { if err = c.Watch(source.Kind(operatorCache, managedCRDs[i]), &handler.EnqueueRequestForObject{}, predicate.Funcs{ CreateFunc: func(e event.CreateEvent) bool { return false }, DeleteFunc: func(e event.DeleteEvent) bool { return true }, UpdateFunc: func(e event.UpdateEvent) bool { return false }, GenericFunc: func(e event.GenericEvent) bool { return false }, }); err != nil { return nil, err } }

test/e2e/gateway_api_test.go

gcs278 · 2024-07-26T15:10:18Z

test/e2e/util_gatewayapi_test.go

+	err = kclient.Delete(context.Background(), crd)
+	if err != nil {
+		t.Errorf("failed to delete crd %s: %v", name, err)
+		return err
+	}


When I receive my first poll CRD, it gives me a new CRD with a different UUID

Sorry, I might be confused, isn't this a good thing? Not sure why it'd get stuck in this loop.

Looks like the test passed, are you saying it's not working correctly still or did you figure it out?

openshift#1023

gcs278 · 2024-07-26T17:20:27Z

pkg/operator/controller/gatewayapi/controller.go

 	}
+
+	// watch for CRDs
+	if err = c.Watch(source.Kind(operatorCache, &apiextensionsv1.CustomResourceDefinition{}), &handler.EnqueueRequestForObject{}, predicate.Funcs{


One thing I just realized, and I've gotten pinged on in the past, is ensuring the work queue is "homogeneous". That means, you always reconcile the same type of object. Here's an example discussion: #1014 (comment)

If you see the Go docs for Reconcile, it's expecting a FeatureGate:

cluster-ingress-operator/pkg/operator/controller/gatewayapi/controller.go

Lines 91 to 92 in a53644f

// Reconcile expects request to refer to a FeatureGate and creates or

// reconciles the Gateway API CRDs.

You can do this easily with a pretty simple EnqueueRequestFromMapFunc like here that just returns the feature gate:

cluster-ingress-operator/pkg/operator/controller/status/controller.go

Line 94 in 8f98c61

handler.EnqueueRequestsFromMapFunc(toDefaultIngressController),

I know that this doesn't affect functionality at all, but it helps maintain consistency and standards.

Updated as per your suggestion!

…gwapi crds and E2E tests

anirudhAgniRedhat · 2024-07-30T05:27:47Z

@rfredette @gcs278 Added Suggested Changes!
Do you guys have any more suggestions in this, else I can get a tag on this!

gcs278 · 2024-07-30T22:13:48Z

Looks good to me! Thanks for the code review responses.
Since this is a minimal change and it's still in tech preview, I'll label myself.
/approve
/lgtm

openshift-ci · 2024-07-30T22:17:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gcs278

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [gcs278]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-07-31T01:34:16Z

@anirudhAgniRedhat: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-bot · 2024-07-31T04:15:20Z

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-ingress-operator
This PR has been included in build ose-cluster-ingress-operator-container-v4.17.0-202407310342.p0.gf3e48bc.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2024

openshift-ci bot requested review from Thealisyed and gcs278 July 12, 2024 12:11

anirudhAgniRedhat changed the title ~~[WIP] Added Gateway CRD Reconcile~~ NE-1273: Add a watch to the ingress operator so it will recreate the gwapi crds Jul 12, 2024

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2024

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 12, 2024

gcs278 mentioned this pull request Jul 24, 2024

NE-1208: Gateway API E2E Testing #1023

Merged

anirudhAgniRedhat force-pushed the gatewayCrdReconcile branch from 035017f to 5b5ef8a Compare July 25, 2024 10:41

gcs278 reviewed Jul 25, 2024

View reviewed changes

anirudhAgniRedhat force-pushed the gatewayCrdReconcile branch from 5b5ef8a to a53644f Compare July 26, 2024 09:25

gcs278 reviewed Jul 26, 2024

View reviewed changes

anirudhAgniRedhat force-pushed the gatewayCrdReconcile branch from a53644f to fb2c815 Compare July 26, 2024 16:16

E2E Fixes from NE-1208: Gateway API E2E Testing

ea6af9f

openshift#1023

anirudhAgniRedhat force-pushed the gatewayCrdReconcile branch from fb2c815 to 53c6aee Compare July 26, 2024 16:25

gcs278 reviewed Jul 26, 2024

View reviewed changes

anirudhAgniRedhat force-pushed the gatewayCrdReconcile branch from 53c6aee to 3ef4a2e Compare July 27, 2024 14:30

NE-1273: Add a watch to the ingress operator so it will recreate the …

e9c4200

…gwapi crds and E2E tests

anirudhAgniRedhat force-pushed the gatewayCrdReconcile branch from 3ef4a2e to e9c4200 Compare July 29, 2024 10:45

openshift-ci bot assigned gcs278 Jul 30, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 30, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 30, 2024

openshift-merge-bot bot merged commit f3e48bc into openshift:master Jul 31, 2024

	// Reconcile expects request to refer to a FeatureGate and creates or
	// reconciles the Gateway API CRDs.

Conversation

anirudhAgniRedhat commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jul 12, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

candita commented Jul 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gcs278 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gcs278 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anirudhAgniRedhat commented Jul 30, 2024

Uh oh!

gcs278 commented Jul 30, 2024

Uh oh!

openshift-ci bot commented Jul 30, 2024

Uh oh!

openshift-ci bot commented Jul 31, 2024

Uh oh!

openshift-bot commented Jul 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

anirudhAgniRedhat commented Jul 12, 2024 •

edited

Loading

openshift-ci-robot commented Jul 12, 2024 •

edited by openshift-ci bot

Loading

candita commented Jul 23, 2024 •

edited

Loading