OCPBUGS-78523: gatewayapi_controller: Replace sync.Once with retry for GatewayClass field indexer setup by RamLavi · Pull Request #1382 · openshift/cluster-ingress-operator

RamLavi · 2026-03-12T08:45:23Z

The gatewayapi_controller uses sync.Once to add a GatewayClass field indexer and start dependent controllers. If the GatewayClass CRD is not yet registered with the API server when the first reconcile runs, the IndexField call fails and sync.Once prevents any subsequent retry.

This permanently breaks the status_controller's ability to list GatewayClass resources via the indexed field, causing the ingress ClusterOperator to never report status conditions.

Replace sync.Once with a mutex-guarded bool that allows retries on IndexField failure, requeueing every 10 seconds until the CRD is established. Dependent controllers are started only after the indexer succeeds.

Fixes: #1381
Fixes: OCPBUGS-78523

coderabbitai · 2026-03-12T08:45:42Z

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)

do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: e0d6bff8-7aa4-4212-89a7-627a2ca27b5c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Reconciler startup was refactored to replace a one-shot sync.Once with an ensureDependentControllers method guarded by mu sync.Mutex and tracked by controllersStarted bool. The new method idempotently registers a GatewayClass field indexer, starts dependent controllers only after successful indexing, and sets controllersStarted. Reconcile now calls ensureDependentControllers and returns a 10-second requeue on initialization failure. The time package import was added for requeue timing.

Assessment against linked issues

Objective (grouped, issue)	Addressed	Explanation
Ensure GatewayClass field indexer is not subject to one-time failure and can be retried (#[1381])	✅
Prevent dependent controllers from starting when the indexer setup fails; ensure they start only after successful indexer installation (#[1381])	✅
Surface indexer failure to Reconcile and trigger retry (requeue) instead of silently proceeding (#[1381])	✅

(Only objectives from linked issues were assessed.)

🚥 Pre-merge checks | ✅ 7

✅ Passed checks (7 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: replacing sync.Once with a retry mechanism for GatewayClass field indexer setup.
Description check	✅ Passed	The description clearly explains the problem (sync.Once preventing retries), the impact (broken status reporting), and the solution (mutex-guarded retry with 10-second requeue).
Linked Issues check	✅ Passed	The PR implementation directly addresses all requirements from issue `#1381`: replaces sync.Once with retry logic, allows IndexField retries, prevents dependent controller startup on failure, and restores status reporting via the field indexer.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the sync.Once retry issue: adding mutex/flag for guarded startup, implementing ensureDependentControllers helper, and adjusting error handling for requeue logic.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names	✅ Passed	PR modifies only production code in gatewayapi controller; no Ginkgo test files or test titles are introduced or modified.
Test Structure And Quality	✅ Passed	The custom check is not applicable to this pull request. The test file uses standard Go testing framework with testify/assert, not Ginkgo-style tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/controller/gatewayapi/controller.go`:
- Around line 158-160: The current reconcile collapses all errors from
ensureDependentControllers into a CRD-not-established retry path; update the
logic so ensureDependentControllers (and any calls like IndexField) return a
distinguishable error (e.g., a sentinel error var ErrCRDNotEstablished or a
typed error) and then in the reconcile caller check that error specifically: if
errors.Is(err, ErrCRDNotEstablished) then log and requeue after 10s, otherwise
return the original err (not a requeue) so controller-runtime surfaces and
backoffs; alternatively use errors.As to detect a CRD-not-established error type
and treat all other errors as fatal by returning err.
- Around line 190-199: The current loop launches goroutines for each
r.config.DependentControllers element and sets r.controllersStarted = true
immediately, which can permanently hide Start(ctx) failures; change this to wait
for all Start(ctx) calls to complete and surface any errors before setting
r.controllersStarted: iterate over r.config.DependentControllers, spawn
goroutines that call (*c).Start(ctx) but use a sync.WaitGroup plus an error
channel (or start them serially), collect and return the first error (or
aggregate) if any Start(ctx) fails, and only set r.controllersStarted = true
after the WaitGroup completes with no errors so failed startups can be retried
by subsequent reconciliations.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 64b733b7-7338-4f6e-ad04-75af1438925e

📥 Commits

Reviewing files that changed from the base of the PR and between c2eae34 and 27c49cf.

📒 Files selected for processing (1)

pkg/operator/controller/gatewayapi/controller.go

coderabbitai · 2026-03-12T08:58:09Z

pkg/operator/controller/gatewayapi/controller.go

+	if err := r.ensureDependentControllers(ctx); err != nil {
+		log.Error(err, "GatewayClass CRD not yet established, will retry")
+		return reconcile.Result{RequeueAfter: 10 * time.Second}, nil


⚠️ Potential issue | 🟠 Major

Don't collapse every setup failure into the CRD-not-established retry path.

This treats all ensureDependentControllers errors as the expected transient bootstrap case. If IndexField fails for any other reason, the reconcile loop will silently spin every 10 seconds and never surface the real failure.

Possible shape

if err := r.ensureDependentControllers(ctx); err != nil { - log.Error(err, "GatewayClass CRD not yet established, will retry") - return reconcile.Result{RequeueAfter: 10 * time.Second}, nil + if isRetryableGatewayClassIndexerError(err) { + log.Info("GatewayClass CRD not yet established, will retry", "after", 10*time.Second) + return reconcile.Result{RequeueAfter: 10 * time.Second}, nil + } + return reconcile.Result{}, err }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/operator/controller/gatewayapi/controller.go` around lines 158 - 160, The current reconcile collapses all errors from ensureDependentControllers into a CRD-not-established retry path; update the logic so ensureDependentControllers (and any calls like IndexField) return a distinguishable error (e.g., a sentinel error var ErrCRDNotEstablished or a typed error) and then in the reconcile caller check that error specifically: if errors.Is(err, ErrCRDNotEstablished) then log and requeue after 10s, otherwise return the original err (not a requeue) so controller-runtime surfaces and backoffs; alternatively use errors.As to detect a CRD-not-established error type and treat all other errors as fatal by returning err.

made err more neutral

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check the current state of the Reconcile function's ensureDependentControllers error handling rg -n "ensureDependentControllers|RequeueAfter|controllersStarted|isRetryable|ErrCRD|IndexField" pkg/operator/controller/gatewayapi/controller.go

Length of output: 735

@RamLavi, thanks for the update — but the core concern is about logic, not the message wording. The code still collapses every ensureDependentControllers error into the same silent 10-second retry path (return reconcile.Result{RequeueAfter: 10 * time.Second}, nil), regardless of the failure cause.

If IndexField ever fails for a reason other than the CRD not being established yet (permissions, internal indexer state, etc.), the reconciler will spin silently every 10 seconds and never report the real error to controller-runtime or its backoff machinery.

The minimal fix is to distinguish the two cases, for example using a sentinel:

// package-level var errGatewayClassCRDNotEstablished = errors.New("GatewayClass CRD not yet established")

// in ensureDependentControllers, wrap the IndexField error: if err := r.fieldIndexer.IndexField(...); err != nil { if meta.IsNoMatchError(err) || isCRDNotFoundError(err) { return fmt.Errorf("%w: %v", errGatewayClassCRDNotEstablished, err) } return err // non-transient — surface immediately }

// in Reconcile: if err := r.ensureDependentControllers(ctx); err != nil { if errors.Is(err, errGatewayClassCRDNotEstablished) { log.Info("GatewayClass CRD not yet established, will retry", "after", 10*time.Second) return reconcile.Result{RequeueAfter: 10 * time.Second}, nil } return reconcile.Result{}, err // surface unexpected errors }

This ensures only the expected transient bootstrap case is silently retried; anything else is bubbled up properly.

@coderabbitai while I agree with the overall solution, establishing this pattern right now is not desired on this repo, as an error like permission missing (that may be under apply, as CVO and hypershift do it async) or others may cause a worst situation of exponential backoff. I think in this case, requeuing for 10 seconds is acceptable and won't generate much noise on the logs

@rikatz, that makes sense — given the async nature of CVO and Hypershift applying permissions, a uniform 10-second requeue avoids the risk of exponential backoff on errors that are transiently non-retryable. Understood, I'll stand down on this suggestion.

📝 Learnings were identified and not saved (knowledge base disabled). Enable

If I understand correctly, we just won't update status in the clusteroperator resource if the index isn't initialized—right? We would return here before ever updating the clusteroperator status:

cluster-ingress-operator/pkg/operator/controller/status/controller.go

Lines 248 to 250 in f781f7e

state, err := r.getOperatorState(ctx, ingressNamespace, canaryNamespace, co)

if err != nil {

return reconcile.Result{}, fmt.Errorf("failed to get operator state: %v", err)

Worse, after an upgrade, the clusteroperator status might have stale status conditions reporting that everything is fine. If I have that right, we really need to improve the logic to surface errors such as "Index [...] does not exist" from getOperatorState.

So that's all to say our error handling is already lacking, and I don't think it's too important whether we have a 10-second retry here or whether we have an exponential backoff (if anything, the 10-second retry might actually make the error state more obvious if the cluster-admin does think to check the logs); in any case, we really need to follow up to improve the error reporting.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

coderabbitai · 2026-03-12T08:58:09Z

pkg/operator/controller/gatewayapi/controller.go

+	for i := range r.config.DependentControllers {
+		c := &r.config.DependentControllers[i]
+		go func() {
+			if err := (*c).Start(ctx); err != nil {
+				log.Error(err, "cannot start controller")
+			}
+		}()
+	}
+
+	r.controllersStarted = true


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the full context around the startup code cat -n pkg/operator/controller/gatewayapi/controller.go | head -220 | tail -70

Repository: openshift/cluster-ingress-operator

Length of output: 2887

🏁 Script executed:

# Check where controllersStarted is used in the file rg "controllersStarted" pkg/operator/controller/gatewayapi/controller.go -n

Repository: openshift/cluster-ingress-operator

Length of output: 170

🏁 Script executed:

# Look for any error handling or retry logic related to controller startup rg -A 5 -B 5 "controllersStarted\s*=" pkg/operator/controller/gatewayapi/controller.go

Repository: openshift/cluster-ingress-operator

Length of output: 181

🏁 Script executed:

# Check the Start method signature and where DependentControllers is defined rg "DependentControllers" pkg/operator/controller/gatewayapi/controller.go -B 2 -A 2 -n

Repository: openshift/cluster-ingress-operator

Length of output: 1096

A dependent-controller startup failure is silently swallowed and permanently blocks further initialization.

controllersStarted is set to true immediately after spawning goroutines (line 199), before any Start(ctx) call can complete or report an error. If a controller fails to start, the error is logged but never propagated—the function returns success (nil). Since controllersStarted is now true, the reconciliation loop will skip this block forever (line 172-174), and the operator remains partially initialized with no recovery path. Either wait for all controller starts to complete before setting the flag, or implement a recovery mechanism that allows retrying failed startups.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/operator/controller/gatewayapi/controller.go` around lines 190 - 199, The current loop launches goroutines for each r.config.DependentControllers element and sets r.controllersStarted = true immediately, which can permanently hide Start(ctx) failures; change this to wait for all Start(ctx) calls to complete and surface any errors before setting r.controllersStarted: iterate over r.config.DependentControllers, spawn goroutines that call (*c).Start(ctx) but use a sync.WaitGroup plus an error channel (or start them serially), collect and return the first error (or aggregate) if any Start(ctx) fails, and only set r.controllersStarted = true after the WaitGroup completes with no errors so failed startups can be retried by subsequent reconciliations.

The existing pattern is unchanged by this PR, and is outside it's scope (which is changing the sync.Once mechanism)

@RamLavi, understood — the goroutine-launch pattern for DependentControllers.Start() predates this PR and is not changed by it. That's a fair scope boundary. Would you like me to open a follow-up issue to track the silent-swallow concern for a future PR?

@RamLavi but the comment is right. If you are setting the bool to true after this change, and you cannot start controllers this will behave as the former sync.Once.

You are going to end up on a situation where the indexer is added, but once you try to start a controller if it fails you will never try again because you are marking the controllersStarted as true.

I think the approach here needs to be something like:

Establish a sync.Map inside the reconciler struct that will contain the started controllers, or otherwise the controller name will not be added

For every sync, you compare what controllers are started or otherwise tries to start the missing ones

@Miciah I dont understand why the indexer is not running outside of the reconciliation. IIRC the right pattern here (and given we always ship with GatewayClass now) should be once the controller is starting it should try to set the indexer or fail outside of the reconciliation loop

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

as an example, we should be setting the indexer as part of the manager bootstrap IMO: kubernetes-sigs/cluster-api-ipam-provider-in-cluster@d19ce78/pkg/ipamutil/reconciler.go#L91-L93

@rikatz You raise a fair point about Start failures. In this path, IndexField has already succeeded (so the CRD is registered and cache sync should work), but I agree it could be more robust.

On the broader question - to my "new to this repo" understanding, the indexer is in Reconcile because the GatewayClass CRD may not exist when New() runs, so this controller creates it in ensureGatewayAPICRDs().

Moving the indexer to bootstrap would change that assumption. @Miciah what do you think about @rikatz 's suggestion?

yes that's right. The indexer exists here because otherwise the CRD may not be available. For now, let's keep it as is and we have a task for 4.23 to fix the controllers startup and indexer startup

FTR the story for refactoring redhat.atlassian.net/browse/NE-1986

I will try to take care of it on 4.23

Miciah · 2026-03-13T15:51:00Z

pkg/operator/controller/gatewayapi/controller.go

+// ensureDependentControllers indexes GatewayClass resources and starts
+// dependent controllers exactly once.
+func (r *reconciler) ensureDependentControllers(ctx context.Context) error {
+	r.mu.Lock()
+	defer r.mu.Unlock()
+
+	if r.controllersStarted {
+		return nil
+	}
+
+	// Index gateway classes based on their spec.controllerName
+	if err := r.fieldIndexer.IndexField(
+		context.Background(),
+		&gatewayapiv1.GatewayClass{},
+		operatorcontroller.GatewayClassIndexFieldName,
+		client.IndexerFunc(func(o client.Object) []string {
+			gatewayclass, ok := o.(*gatewayapiv1.GatewayClass)
+			if !ok {
+				return []string{}
+			}
+			return []string{string(gatewayclass.Spec.ControllerName)}
+		})); err != nil {
+		return fmt.Errorf("failed to add field indexer: %w", err)
+	}
+	for i := range r.config.DependentControllers {
+		c := &r.config.DependentControllers[i]
+		go func() {
+			if err := (*c).Start(ctx); err != nil {
+				log.Error(err, "cannot start controller")
+			}
+		}()
+	}
+
+	r.controllersStarted = true
+	return nil
+}


Just a suggestion—it would be helpful to reviewers or people looking at the change history in the future if you put the refactoring in a separate commit from the logic change.

Miciah · 2026-03-13T15:52:02Z

I think this change is good. CodeRabbit had a similar suggestion here: #1326 (comment)

As I mentioned in #1381, we should have an OCPBUGS ticket. Then the PR title and the commit message should both reference that ticket.

candita · 2026-03-13T23:48:52Z

@RamLavi can you please open a Jira bug for this issue when possible?

Extract the inline sync.Once block from Reconcile into a dedicated ensureDependentControllers method. No functional change. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ram Lavi <ralavi@redhat.com>

RamLavi · 2026-03-15T08:01:00Z

Change: split commits to pure refactor and code change. Make err message more neutral.

RamLavi · 2026-03-16T12:33:20Z

@Miciah PTAL

rikatz · 2026-03-17T16:24:51Z

/assign
/cc

rikatz · 2026-03-17T16:29:18Z

pkg/operator/controller/gatewayapi/controller.go

+	r.mu.Lock()
+	defer r.mu.Unlock()
+
+	if r.controllersStarted {


nit: given once the controller is started this will be always true, you can probably put this before locking the mutex

good point but keeping it by the book is tidier. I prefer it as is if you don't object of course.

Well, why do we need the mutex at all? Reconcile doesn't need to be re-entrant, does it? I feel like I'm missing something.

Either we need the mutex for reads and writes, or we don't need the mutex at all.

@coderabbitai, why do we need the mutex?

@Miciah, the mutex is not strictly needed here.

controller.New is called without setting MaxConcurrentReconciles, so it defaults to 1. Controller-runtime processes reconcile requests sequentially from the work queue, meaning Reconcile is never called concurrently. Since ensureDependentControllers is only called from Reconcile, there is no concurrent access to controllersStarted, and the mutex provides no correctness guarantee that a plain bool check wouldn't already give.

The mutex is purely defensive — it guards against a hypothetical future change to MaxConcurrentReconciles > 1. If that future-proofing isn't desired, both mu and the lock/unlock calls can be removed, leaving just the controllersStarted bool check:

func (r *reconciler) ensureDependentControllers(ctx context.Context) error { if r.controllersStarted { return nil } // ... setup ... r.controllersStarted = true return nil }

This is simpler and matches the actual single-threaded reconciliation model. If concurrent reconciliation is ever introduced, a data race detector would catch the unsynchronized access and the mutex could be added back at that point.

rikatz · 2026-03-17T16:29:55Z

pkg/operator/controller/gatewayapi/controller.go

+
+	// Index gateway classes based on their spec.controllerName
+	if err := r.fieldIndexer.IndexField(
+		context.Background(),


you are receiving a context on the function signature, why not use it here?

(ok it may make sense to use context.Background here given for now to keep former behavior, so ignore it for now)

RamLavi · 2026-03-17T19:01:50Z

Change: Use input param scc for IndexField()

RamLavi · 2026-03-17T19:02:29Z

Change: Use input param scc for IndexField()

[update] did not notice the comment on code rabbit, will comment on that as well

rikatz · 2026-03-17T21:11:46Z

pkg/operator/controller/gatewayapi/controller.go

+	}
+
+	// Index gateway classes based on their spec.controllerName
+	if err := r.fieldIndexer.IndexField(


@RamLavi sorry I did a meeting with @Miciah and I may have misguided you, so using context.Background here should be good, as it was the original behavior (I missed it, sorry). Using the context from Reconcile here can be a bad thing right now, given that the parent context can be canceled (it is from the reconciliation process, not from the main process nor the manager) and cause undesired behavior

sure, changed the ctx back
DONE

RamLavi · 2026-03-17T21:24:29Z

Change: return the ctx to what it was before

coderabbitai

♻️ Duplicate comments (1)

pkg/operator/controller/gatewayapi/controller.go (1)

125-126: ⚠️ Potential issue | 🟠 Major

One global started flag still makes controller start failures unrecoverable.

Line 200 flips controllersStarted as soon as the goroutines are launched, so an immediate Start(ctx) error only gets logged and later reconciles never retry that controller. Split the state into “indexer ready” plus per-controller startup tracking, or clear failed controllers from the started set so this path can recover.

Also applies to: 191-200

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 82b9eb4c-aa34-4247-958d-7fd9ee16a634

📥 Commits

Reviewing files that changed from the base of the PR and between c13b8df and 6a9bfe2.

📒 Files selected for processing (1)

pkg/operator/controller/gatewayapi/controller.go

rikatz · 2026-03-17T21:36:49Z

On controller_test.go, please add as part of the test TestReconcileOnlyStartsControllerOnce an additional assertion:

assert.True(t, reconciler.controllersStarted)

After each of the Reconcile() calls

The ensureDependentControllers method uses sync.Once to add a GatewayClass field indexer and start dependent controllers. If the GatewayClass CRD is not yet registered when the first reconcile runs, IndexField fails and sync.Once prevents any subsequent retry. This would leave the status_controller unable to list GatewayClass resources, potentially preventing the ingress ClusterOperator from reporting status conditions. While this race has not been widely observed, the pattern is incorrect for fallible operations. Replace sync.Once with a mutex-guarded bool that allows retries on IndexField failure, requeueing every 10 seconds until the CRD is established. Fixes: https://redhat.atlassian.net/browse/OCPBUGS-78523 Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ram Lavi <ralavi@redhat.com>

RamLavi · 2026-03-18T06:56:16Z

Change: Add controllersStarted assertions after each reconcile

openshift-ci-robot · 2026-03-19T18:46:58Z

@rikatz: This PR has been marked as verified by @rikatz verified the execution as per comment https://github.com/openshift/cluster-ingress-operator/pull/1382#issuecomment-4083255582.

Details

In response to this:

/lgtm
/verified by @rikatz verified the execution as per comment #1382 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

rikatz · 2026-03-19T18:49:13Z

/jira refresh

openshift-ci-robot · 2026-03-19T18:49:20Z

@rikatz: This pull request references Jira Issue OCPBUGS-78523, which is invalid:

expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

rikatz · 2026-03-19T18:49:43Z

/jira refresh

openshift-ci-robot · 2026-03-19T18:49:50Z

@rikatz: This pull request references Jira Issue OCPBUGS-78523, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (iamin@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

rikatz · 2026-03-19T18:51:48Z

/retest-required

Miciah · 2026-03-19T19:26:39Z

#1382 (comment) made me realize I don't understand why we need the mutex at all. That said, having an extra mutex shouldn't hurt anything (I don't see any risk of deadlock), so it isn't a blocking issue.

/lgtm

rikatz · 2026-03-19T22:00:43Z

The e2e-aws-ovn-hypershift-conformance test is permafailing and being verified with Hypershift team. I am discussing about putting an OCPBUG for it

While that, I am overriding this test in favor of getting this fix merged

/override ci/prow/e2e-aws-ovn-hypershift-conformance

openshift-ci · 2026-03-19T22:01:07Z

@rikatz: Overrode contexts on behalf of rikatz: ci/prow/e2e-aws-ovn-hypershift-conformance

Details

In response to this:

The e2e-aws-ovn-hypershift-conformance test is permafailing and being verified with Hypershift team. I am discussing about putting an OCPBUG for it

While that, I am overriding this test in favor of getting this fix merged

/override ci/prow/e2e-aws-ovn-hypershift-conformance

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

rikatz · 2026-03-20T05:16:45Z

/retest-required

rikatz · 2026-03-20T15:57:25Z

/test e2e-hypershift

rikatz · 2026-03-20T22:22:02Z

/retest

rikatz · 2026-03-21T09:22:11Z

/retest-required

rikatz · 2026-03-21T15:03:36Z

/override ci/prow/e2e-aws-ovn-hypershift-conformance

Test is permafailing and has an ocpbug open

openshift-ci · 2026-03-21T15:04:06Z

@rikatz: Overrode contexts on behalf of rikatz: ci/prow/e2e-aws-ovn-hypershift-conformance

Details

In response to this:

/override ci/prow/e2e-aws-ovn-hypershift-conformance

Test is permafailing and has an ocpbug open

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

rikatz · 2026-03-21T15:04:23Z

For the override: https://redhat.atlassian.net/browse/OCPBUGS-78977

rikatz · 2026-03-21T18:51:26Z

/retest-required

gcs278 · 2026-03-23T19:04:08Z

Since this touches some gwapi code, let's test our new Tech Preview no-OLM installation path out of an abundance of caution. I highly doubt this will impact, but better safe than sorry:
/test e2e-aws-operator-techpreview

gcs278 · 2026-03-24T00:10:51Z

unrelated hypershift failures:

fixture.go:333: Failed to wait for infra resources in guest cluster to be deleted: context deadline exceeded
fixture.go:340: Failed to clean up 9 remaining resources for guest cluster

/test e2e-hypershift

gcs278 · 2026-03-24T13:41:58Z

I asked the hypershift team about the Teardown failures we are seeing quite often.

/test e2e-hypershift

openshift-ci-robot · 2026-03-25T10:05:25Z

/retest-required

Remaining retests: 0 against base HEAD 0c1dee7 and 2 for PR HEAD f781f7e in total

openshift-ci · 2026-03-25T13:17:11Z

@RamLavi: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-03-25T13:20:09Z

@RamLavi: Jira Issue Verification Checks: Jira Issue OCPBUGS-78523
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-78523 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

The gatewayapi_controller uses sync.Once to add a GatewayClass field indexer and start dependent controllers. If the GatewayClass CRD is not yet registered with the API server when the first reconcile runs, the IndexField call fails and sync.Once prevents any subsequent retry.

This permanently breaks the status_controller's ability to list GatewayClass resources via the indexed field, causing the ingress ClusterOperator to never report status conditions.

Replace sync.Once with a mutex-guarded bool that allows retries on IndexField failure, requeueing every 10 seconds until the CRD is established. Dependent controllers are started only after the indexer succeeds.

Fixes: #1381
Fixes: OCPBUGS-78523

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-robot · 2026-03-31T10:53:06Z

Fix included in accepted release 4.22.0-0.nightly-2026-03-25-221249

openshift-ci bot requested review from gcs278 and rikatz March 12, 2026 08:46

coderabbitai bot reviewed Mar 12, 2026

View reviewed changes

Miciah reviewed Mar 13, 2026

View reviewed changes

gatewayapi_controller: Extract ensureDependentControllers method

91a13f9

Extract the inline sync.Once block from Reconcile into a dedicated ensureDependentControllers method. No functional change. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ram Lavi <ralavi@redhat.com>

RamLavi force-pushed the change_Once_to_Retry branch from 27c49cf to d8f424e Compare March 15, 2026 07:59

RamLavi changed the title ~~gatewayapi_controller: Replace sync.Once with retry for GatewayClass field indexer setup~~ [OCPBUGS-78523] gatewayapi_controller: Replace sync.Once with retry for GatewayClass field indexer setup Mar 16, 2026

RamLavi force-pushed the change_Once_to_Retry branch from d8f424e to 3c97e0f Compare March 16, 2026 09:53

RamLavi changed the title ~~[OCPBUGS-78523] gatewayapi_controller: Replace sync.Once with retry for GatewayClass field indexer setup~~ OCPBUGS-78523: gatewayapi_controller: Replace sync.Once with retry for GatewayClass field indexer setup Mar 16, 2026

openshift-ci bot assigned rikatz Mar 17, 2026

rikatz reviewed Mar 17, 2026

View reviewed changes

RamLavi force-pushed the change_Once_to_Retry branch from 3c97e0f to c13b8df Compare March 17, 2026 19:01

rikatz reviewed Mar 17, 2026

View reviewed changes

RamLavi force-pushed the change_Once_to_Retry branch from c13b8df to 6a9bfe2 Compare March 17, 2026 21:23

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

RamLavi force-pushed the change_Once_to_Retry branch from 6a9bfe2 to b72652e Compare March 18, 2026 06:54

RamLavi force-pushed the change_Once_to_Retry branch from b72652e to f781f7e Compare March 18, 2026 06:54

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 19, 2026

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 19, 2026

openshift-merge-bot bot merged commit ccf658f into openshift:master Mar 25, 2026
19 checks passed

RamLavi mentioned this pull request Mar 30, 2026

[DNM] Hosted cluster w/ PUDN POC openshift/hypershift#7773

Draft

4 tasks

	state, err := r.getOperatorState(ctx, ingressNamespace, canaryNamespace, co)
	if err != nil {
	return reconcile.Result{}, fmt.Errorf("failed to get operator state: %v", err)

Conversation

RamLavi commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Reviews paused

Walkthrough

Assessment against linked issues

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Miciah commented Mar 13, 2026

Uh oh!

candita commented Mar 13, 2026

Uh oh!

RamLavi commented Mar 15, 2026

Uh oh!

RamLavi commented Mar 16, 2026

Uh oh!

rikatz commented Mar 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

RamLavi commented Mar 12, 2026 •

edited

Loading

coderabbitai bot commented Mar 12, 2026 •

edited

Loading

coderabbitai bot Mar 12, 2026 •

edited

Loading

coderabbitai bot Mar 12, 2026 •

edited

Loading

RamLavi commented Mar 17, 2026 •

edited

Loading

RamLavi Mar 17, 2026 •

edited

Loading