Skip to content

OCPBUGS-78523: gatewayapi_controller: Replace sync.Once with retry for GatewayClass field indexer setup#1382

Merged
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
RamLavi:change_Once_to_Retry
Mar 25, 2026
Merged

OCPBUGS-78523: gatewayapi_controller: Replace sync.Once with retry for GatewayClass field indexer setup#1382
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
RamLavi:change_Once_to_Retry

Conversation

@RamLavi
Copy link
Copy Markdown
Contributor

@RamLavi RamLavi commented Mar 12, 2026

The gatewayapi_controller uses sync.Once to add a GatewayClass field indexer and start dependent controllers. If the GatewayClass CRD is not yet registered with the API server when the first reconcile runs, the IndexField call fails and sync.Once prevents any subsequent retry.

This permanently breaks the status_controller's ability to list GatewayClass resources via the indexed field, causing the ingress ClusterOperator to never report status conditions.

Replace sync.Once with a mutex-guarded bool that allows retries on IndexField failure, requeueing every 10 seconds until the CRD is established. Dependent controllers are started only after the indexer succeeds.

Fixes: #1381
Fixes: OCPBUGS-78523

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 12, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: e0d6bff8-7aa4-4212-89a7-627a2ca27b5c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Reconciler startup was refactored to replace a one-shot sync.Once with an ensureDependentControllers method guarded by mu sync.Mutex and tracked by controllersStarted bool. The new method idempotently registers a GatewayClass field indexer, starts dependent controllers only after successful indexing, and sets controllersStarted. Reconcile now calls ensureDependentControllers and returns a 10-second requeue on initialization failure. The time package import was added for requeue timing.

Assessment against linked issues

Objective (grouped, issue) Addressed Explanation
Ensure GatewayClass field indexer is not subject to one-time failure and can be retried (#[1381])
Prevent dependent controllers from starting when the indexer setup fails; ensure they start only after successful indexer installation (#[1381])
Surface indexer failure to Reconcile and trigger retry (requeue) instead of silently proceeding (#[1381])

(Only objectives from linked issues were assessed.)

🚥 Pre-merge checks | ✅ 7
✅ Passed checks (7 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: replacing sync.Once with a retry mechanism for GatewayClass field indexer setup.
Description check ✅ Passed The description clearly explains the problem (sync.Once preventing retries), the impact (broken status reporting), and the solution (mutex-guarded retry with 10-second requeue).
Linked Issues check ✅ Passed The PR implementation directly addresses all requirements from issue #1381: replaces sync.Once with retry logic, allows IndexField retries, prevents dependent controller startup on failure, and restores status reporting via the field indexer.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the sync.Once retry issue: adding mutex/flag for guarded startup, implementing ensureDependentControllers helper, and adjusting error handling for requeue logic.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names ✅ Passed PR modifies only production code in gatewayapi controller; no Ginkgo test files or test titles are introduced or modified.
Test Structure And Quality ✅ Passed The custom check is not applicable to this pull request. The test file uses standard Go testing framework with testify/assert, not Ginkgo-style tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from gcs278 and rikatz March 12, 2026 08:46
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/controller/gatewayapi/controller.go`:
- Around line 158-160: The current reconcile collapses all errors from
ensureDependentControllers into a CRD-not-established retry path; update the
logic so ensureDependentControllers (and any calls like IndexField) return a
distinguishable error (e.g., a sentinel error var ErrCRDNotEstablished or a
typed error) and then in the reconcile caller check that error specifically: if
errors.Is(err, ErrCRDNotEstablished) then log and requeue after 10s, otherwise
return the original err (not a requeue) so controller-runtime surfaces and
backoffs; alternatively use errors.As to detect a CRD-not-established error type
and treat all other errors as fatal by returning err.
- Around line 190-199: The current loop launches goroutines for each
r.config.DependentControllers element and sets r.controllersStarted = true
immediately, which can permanently hide Start(ctx) failures; change this to wait
for all Start(ctx) calls to complete and surface any errors before setting
r.controllersStarted: iterate over r.config.DependentControllers, spawn
goroutines that call (*c).Start(ctx) but use a sync.WaitGroup plus an error
channel (or start them serially), collect and return the first error (or
aggregate) if any Start(ctx) fails, and only set r.controllersStarted = true
after the WaitGroup completes with no errors so failed startups can be retried
by subsequent reconciliations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 64b733b7-7338-4f6e-ad04-75af1438925e

📥 Commits

Reviewing files that changed from the base of the PR and between c2eae34 and 27c49cf.

📒 Files selected for processing (1)
  • pkg/operator/controller/gatewayapi/controller.go

Comment on lines +158 to +160
if err := r.ensureDependentControllers(ctx); err != nil {
log.Error(err, "GatewayClass CRD not yet established, will retry")
return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't collapse every setup failure into the CRD-not-established retry path.

This treats all ensureDependentControllers errors as the expected transient bootstrap case. If IndexField fails for any other reason, the reconcile loop will silently spin every 10 seconds and never surface the real failure.

Possible shape
 if err := r.ensureDependentControllers(ctx); err != nil {
-	log.Error(err, "GatewayClass CRD not yet established, will retry")
-	return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
+	if isRetryableGatewayClassIndexerError(err) {
+		log.Info("GatewayClass CRD not yet established, will retry", "after", 10*time.Second)
+		return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
+	}
+	return reconcile.Result{}, err
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/controller/gatewayapi/controller.go` around lines 158 - 160, The
current reconcile collapses all errors from ensureDependentControllers into a
CRD-not-established retry path; update the logic so ensureDependentControllers
(and any calls like IndexField) return a distinguishable error (e.g., a sentinel
error var ErrCRDNotEstablished or a typed error) and then in the reconcile
caller check that error specifically: if errors.Is(err, ErrCRDNotEstablished)
then log and requeue after 10s, otherwise return the original err (not a
requeue) so controller-runtime surfaces and backoffs; alternatively use
errors.As to detect a CRD-not-established error type and treat all other errors
as fatal by returning err.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made err more neutral

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the current state of the Reconcile function's ensureDependentControllers error handling
rg -n "ensureDependentControllers|RequeueAfter|controllersStarted|isRetryable|ErrCRD|IndexField" pkg/operator/controller/gatewayapi/controller.go

Length of output: 735


@RamLavi, thanks for the update — but the core concern is about logic, not the message wording. The code still collapses every ensureDependentControllers error into the same silent 10-second retry path (return reconcile.Result{RequeueAfter: 10 * time.Second}, nil), regardless of the failure cause.

If IndexField ever fails for a reason other than the CRD not being established yet (permissions, internal indexer state, etc.), the reconciler will spin silently every 10 seconds and never report the real error to controller-runtime or its backoff machinery.

The minimal fix is to distinguish the two cases, for example using a sentinel:

// package-level
var errGatewayClassCRDNotEstablished = errors.New("GatewayClass CRD not yet established")
// in ensureDependentControllers, wrap the IndexField error:
if err := r.fieldIndexer.IndexField(...); err != nil {
    if meta.IsNoMatchError(err) || isCRDNotFoundError(err) {
        return fmt.Errorf("%w: %v", errGatewayClassCRDNotEstablished, err)
    }
    return err // non-transient — surface immediately
}
// in Reconcile:
if err := r.ensureDependentControllers(ctx); err != nil {
    if errors.Is(err, errGatewayClassCRDNotEstablished) {
        log.Info("GatewayClass CRD not yet established, will retry", "after", 10*time.Second)
        return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
    }
    return reconcile.Result{}, err  // surface unexpected errors
}

This ensures only the expected transient bootstrap case is silently retried; anything else is bubbled up properly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai while I agree with the overall solution, establishing this pattern right now is not desired on this repo, as an error like permission missing (that may be under apply, as CVO and hypershift do it async) or others may cause a worst situation of exponential backoff. I think in this case, requeuing for 10 seconds is acceptable and won't generate much noise on the logs

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rikatz, that makes sense — given the async nature of CVO and Hypershift applying permissions, a uniform 10-second requeue avoids the risk of exponential backoff on errors that are transiently non-retryable. Understood, I'll stand down on this suggestion.


📝 Learnings were identified and not saved (knowledge base disabled). Enable

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, we just won't update status in the clusteroperator resource if the index isn't initialized—right? We would return here before ever updating the clusteroperator status:

state, err := r.getOperatorState(ctx, ingressNamespace, canaryNamespace, co)
if err != nil {
return reconcile.Result{}, fmt.Errorf("failed to get operator state: %v", err)

Worse, after an upgrade, the clusteroperator status might have stale status conditions reporting that everything is fine. If I have that right, we really need to improve the logic to surface errors such as "Index [...] does not exist" from getOperatorState.

So that's all to say our error handling is already lacking, and I don't think it's too important whether we have a 10-second retry here or whether we have an exponential backoff (if anything, the 10-second retry might actually make the error state more obvious if the cluster-admin does think to check the logs); in any case, we really need to follow up to improve the error reporting.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment on lines +190 to +199
for i := range r.config.DependentControllers {
c := &r.config.DependentControllers[i]
go func() {
if err := (*c).Start(ctx); err != nil {
log.Error(err, "cannot start controller")
}
}()
}

r.controllersStarted = true
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the full context around the startup code
cat -n pkg/operator/controller/gatewayapi/controller.go | head -220 | tail -70

Repository: openshift/cluster-ingress-operator

Length of output: 2887


🏁 Script executed:

# Check where controllersStarted is used in the file
rg "controllersStarted" pkg/operator/controller/gatewayapi/controller.go -n

Repository: openshift/cluster-ingress-operator

Length of output: 170


🏁 Script executed:

# Look for any error handling or retry logic related to controller startup
rg -A 5 -B 5 "controllersStarted\s*=" pkg/operator/controller/gatewayapi/controller.go

Repository: openshift/cluster-ingress-operator

Length of output: 181


🏁 Script executed:

# Check the Start method signature and where DependentControllers is defined
rg "DependentControllers" pkg/operator/controller/gatewayapi/controller.go -B 2 -A 2 -n

Repository: openshift/cluster-ingress-operator

Length of output: 1096


A dependent-controller startup failure is silently swallowed and permanently blocks further initialization.

controllersStarted is set to true immediately after spawning goroutines (line 199), before any Start(ctx) call can complete or report an error. If a controller fails to start, the error is logged but never propagated—the function returns success (nil). Since controllersStarted is now true, the reconciliation loop will skip this block forever (line 172-174), and the operator remains partially initialized with no recovery path. Either wait for all controller starts to complete before setting the flag, or implement a recovery mechanism that allows retrying failed startups.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/controller/gatewayapi/controller.go` around lines 190 - 199, The
current loop launches goroutines for each r.config.DependentControllers element
and sets r.controllersStarted = true immediately, which can permanently hide
Start(ctx) failures; change this to wait for all Start(ctx) calls to complete
and surface any errors before setting r.controllersStarted: iterate over
r.config.DependentControllers, spawn goroutines that call (*c).Start(ctx) but
use a sync.WaitGroup plus an error channel (or start them serially), collect and
return the first error (or aggregate) if any Start(ctx) fails, and only set
r.controllersStarted = true after the WaitGroup completes with no errors so
failed startups can be retried by subsequent reconciliations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing pattern is unchanged by this PR, and is outside it's scope (which is changing the sync.Once mechanism)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RamLavi, understood — the goroutine-launch pattern for DependentControllers.Start() predates this PR and is not changed by it. That's a fair scope boundary. Would you like me to open a follow-up issue to track the silent-swallow concern for a future PR?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RamLavi but the comment is right. If you are setting the bool to true after this change, and you cannot start controllers this will behave as the former sync.Once.

You are going to end up on a situation where the indexer is added, but once you try to start a controller if it fails you will never try again because you are marking the controllersStarted as true.

I think the approach here needs to be something like:

  • Establish a sync.Map inside the reconciler struct that will contain the started controllers, or otherwise the controller name will not be added
  • For every sync, you compare what controllers are started or otherwise tries to start the missing ones

@Miciah I dont understand why the indexer is not running outside of the reconciliation. IIRC the right pattern here (and given we always ship with GatewayClass now) should be once the controller is starting it should try to set the indexer or fail outside of the reconciliation loop

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as an example, we should be setting the indexer as part of the manager bootstrap IMO: kubernetes-sigs/cluster-api-ipam-provider-in-cluster@d19ce78/pkg/ipamutil/reconciler.go#L91-L93

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rikatz You raise a fair point about Start failures. In this path, IndexField has already succeeded (so the CRD is registered and cache sync should work), but I agree it could be more robust.

On the broader question - to my "new to this repo" understanding, the indexer is in Reconcile because the GatewayClass CRD may not exist when New() runs, so this controller creates it in ensureGatewayAPICRDs().

Moving the indexer to bootstrap would change that assumption. @Miciah what do you think about @rikatz 's suggestion?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's right. The indexer exists here because otherwise the CRD may not be available. For now, let's keep it as is and we have a task for 4.23 to fix the controllers startup and indexer startup

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR the story for refactoring redhat.atlassian.net/browse/NE-1986

I will try to take care of it on 4.23

Comment on lines +166 to +201
// ensureDependentControllers indexes GatewayClass resources and starts
// dependent controllers exactly once.
func (r *reconciler) ensureDependentControllers(ctx context.Context) error {
r.mu.Lock()
defer r.mu.Unlock()

if r.controllersStarted {
return nil
}

// Index gateway classes based on their spec.controllerName
if err := r.fieldIndexer.IndexField(
context.Background(),
&gatewayapiv1.GatewayClass{},
operatorcontroller.GatewayClassIndexFieldName,
client.IndexerFunc(func(o client.Object) []string {
gatewayclass, ok := o.(*gatewayapiv1.GatewayClass)
if !ok {
return []string{}
}
return []string{string(gatewayclass.Spec.ControllerName)}
})); err != nil {
return fmt.Errorf("failed to add field indexer: %w", err)
}
for i := range r.config.DependentControllers {
c := &r.config.DependentControllers[i]
go func() {
if err := (*c).Start(ctx); err != nil {
log.Error(err, "cannot start controller")
}
}()
}

r.controllersStarted = true
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion—it would be helpful to reviewers or people looking at the change history in the future if you put the refactoring in a separate commit from the logic change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@Miciah
Copy link
Copy Markdown
Contributor

Miciah commented Mar 13, 2026

I think this change is good. CodeRabbit had a similar suggestion here: #1326 (comment)

As I mentioned in #1381, we should have an OCPBUGS ticket. Then the PR title and the commit message should both reference that ticket.

@candita
Copy link
Copy Markdown
Contributor

candita commented Mar 13, 2026

@RamLavi can you please open a Jira bug for this issue when possible?

Extract the inline sync.Once block from Reconcile into a dedicated
ensureDependentControllers method. No functional change.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ram Lavi <ralavi@redhat.com>
@RamLavi RamLavi force-pushed the change_Once_to_Retry branch from 27c49cf to d8f424e Compare March 15, 2026 07:59
@RamLavi
Copy link
Copy Markdown
Contributor Author

RamLavi commented Mar 15, 2026

Change: split commits to pure refactor and code change. Make err message more neutral.

@RamLavi RamLavi changed the title gatewayapi_controller: Replace sync.Once with retry for GatewayClass field indexer setup [OCPBUGS-78523] gatewayapi_controller: Replace sync.Once with retry for GatewayClass field indexer setup Mar 16, 2026
@RamLavi RamLavi force-pushed the change_Once_to_Retry branch from d8f424e to 3c97e0f Compare March 16, 2026 09:53
@RamLavi RamLavi changed the title [OCPBUGS-78523] gatewayapi_controller: Replace sync.Once with retry for GatewayClass field indexer setup OCPBUGS-78523: gatewayapi_controller: Replace sync.Once with retry for GatewayClass field indexer setup Mar 16, 2026
@RamLavi
Copy link
Copy Markdown
Contributor Author

RamLavi commented Mar 16, 2026

@Miciah PTAL

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Mar 17, 2026

/assign
/cc

r.mu.Lock()
defer r.mu.Unlock()

if r.controllersStarted {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: given once the controller is started this will be always true, you can probably put this before locking the mutex

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point but keeping it by the book is tidier. I prefer it as is if you don't object of course.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, why do we need the mutex at all? Reconcile doesn't need to be re-entrant, does it? I feel like I'm missing something.

Either we need the mutex for reads and writes, or we don't need the mutex at all.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai, why do we need the mutex?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miciah, the mutex is not strictly needed here.

controller.New is called without setting MaxConcurrentReconciles, so it defaults to 1. Controller-runtime processes reconcile requests sequentially from the work queue, meaning Reconcile is never called concurrently. Since ensureDependentControllers is only called from Reconcile, there is no concurrent access to controllersStarted, and the mutex provides no correctness guarantee that a plain bool check wouldn't already give.

The mutex is purely defensive — it guards against a hypothetical future change to MaxConcurrentReconciles > 1. If that future-proofing isn't desired, both mu and the lock/unlock calls can be removed, leaving just the controllersStarted bool check:

func (r *reconciler) ensureDependentControllers(ctx context.Context) error {
    if r.controllersStarted {
        return nil
    }
    // ... setup ...
    r.controllersStarted = true
    return nil
}

This is simpler and matches the actual single-threaded reconciliation model. If concurrent reconciliation is ever introduced, a data race detector would catch the unsynchronized access and the mutex could be added back at that point.


// Index gateway classes based on their spec.controllerName
if err := r.fieldIndexer.IndexField(
context.Background(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are receiving a context on the function signature, why not use it here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ok it may make sense to use context.Background here given for now to keep former behavior, so ignore it for now)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@RamLavi RamLavi force-pushed the change_Once_to_Retry branch from 3c97e0f to c13b8df Compare March 17, 2026 19:01
@RamLavi
Copy link
Copy Markdown
Contributor Author

RamLavi commented Mar 17, 2026

Change: Use input param scc for IndexField()

@RamLavi
Copy link
Copy Markdown
Contributor Author

RamLavi commented Mar 17, 2026

Change: Use input param scc for IndexField()

[update] did not notice the comment on code rabbit, will comment on that as well

}

// Index gateway classes based on their spec.controllerName
if err := r.fieldIndexer.IndexField(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RamLavi sorry I did a meeting with @Miciah and I may have misguided you, so using context.Background here should be good, as it was the original behavior (I missed it, sorry). Using the context from Reconcile here can be a bad thing right now, given that the parent context can be canceled (it is from the reconciliation process, not from the main process nor the manager) and cause undesired behavior

Copy link
Copy Markdown
Contributor Author

@RamLavi RamLavi Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, changed the ctx back
DONE

@RamLavi RamLavi force-pushed the change_Once_to_Retry branch from c13b8df to 6a9bfe2 Compare March 17, 2026 21:23
@RamLavi
Copy link
Copy Markdown
Contributor Author

RamLavi commented Mar 17, 2026

Change: return the ctx to what it was before

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/operator/controller/gatewayapi/controller.go (1)

125-126: ⚠️ Potential issue | 🟠 Major

One global started flag still makes controller start failures unrecoverable.

Line 200 flips controllersStarted as soon as the goroutines are launched, so an immediate Start(ctx) error only gets logged and later reconciles never retry that controller. Split the state into “indexer ready” plus per-controller startup tracking, or clear failed controllers from the started set so this path can recover.

Also applies to: 191-200


ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 82b9eb4c-aa34-4247-958d-7fd9ee16a634

📥 Commits

Reviewing files that changed from the base of the PR and between c13b8df and 6a9bfe2.

📒 Files selected for processing (1)
  • pkg/operator/controller/gatewayapi/controller.go

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Mar 17, 2026

On controller_test.go, please add as part of the test TestReconcileOnlyStartsControllerOnce an additional assertion:

assert.True(t, reconciler.controllersStarted)

After each of the Reconcile() calls

@RamLavi RamLavi force-pushed the change_Once_to_Retry branch from 6a9bfe2 to b72652e Compare March 18, 2026 06:54
The ensureDependentControllers method uses sync.Once to add a
GatewayClass field indexer and start dependent controllers. If the
GatewayClass CRD is not yet registered when the first reconcile runs,
IndexField fails and sync.Once prevents any subsequent retry. This
would leave the status_controller unable to list GatewayClass
resources, potentially preventing the ingress ClusterOperator from
reporting status conditions.

While this race has not been widely observed, the pattern is incorrect
for fallible operations. Replace sync.Once with a mutex-guarded bool
that allows retries on IndexField failure, requeueing every 10 seconds
until the CRD is established.

Fixes: https://redhat.atlassian.net/browse/OCPBUGS-78523

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ram Lavi <ralavi@redhat.com>
@RamLavi RamLavi force-pushed the change_Once_to_Retry branch from b72652e to f781f7e Compare March 18, 2026 06:54
@RamLavi
Copy link
Copy Markdown
Contributor Author

RamLavi commented Mar 18, 2026

Change: Add controllersStarted assertions after each reconcile

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@rikatz: This PR has been marked as verified by @rikatz verified the execution as per comment https://github.com/openshift/cluster-ingress-operator/pull/1382#issuecomment-4083255582.

Details

In response to this:

/lgtm
/verified by @rikatz verified the execution as per comment #1382 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Mar 19, 2026

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@rikatz: This pull request references Jira Issue OCPBUGS-78523, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Mar 19, 2026

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@rikatz: This pull request references Jira Issue OCPBUGS-78523, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (iamin@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Mar 19, 2026

/retest-required

@Miciah
Copy link
Copy Markdown
Contributor

Miciah commented Mar 19, 2026

#1382 (comment) made me realize I don't understand why we need the mutex at all. That said, having an extra mutex shouldn't hurt anything (I don't see any risk of deadlock), so it isn't a blocking issue.

/lgtm

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Mar 19, 2026

The e2e-aws-ovn-hypershift-conformance test is permafailing and being verified with Hypershift team. I am discussing about putting an OCPBUG for it

While that, I am overriding this test in favor of getting this fix merged

/override ci/prow/e2e-aws-ovn-hypershift-conformance

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 19, 2026

@rikatz: Overrode contexts on behalf of rikatz: ci/prow/e2e-aws-ovn-hypershift-conformance

Details

In response to this:

The e2e-aws-ovn-hypershift-conformance test is permafailing and being verified with Hypershift team. I am discussing about putting an OCPBUG for it

While that, I am overriding this test in favor of getting this fix merged

/override ci/prow/e2e-aws-ovn-hypershift-conformance

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Mar 20, 2026

/retest-required

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Mar 20, 2026

/test e2e-hypershift

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Mar 20, 2026

/retest

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Mar 21, 2026

/retest-required

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Mar 21, 2026

/override ci/prow/e2e-aws-ovn-hypershift-conformance

Test is permafailing and has an ocpbug open

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 21, 2026

@rikatz: Overrode contexts on behalf of rikatz: ci/prow/e2e-aws-ovn-hypershift-conformance

Details

In response to this:

/override ci/prow/e2e-aws-ovn-hypershift-conformance

Test is permafailing and has an ocpbug open

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Mar 21, 2026

For the override: https://redhat.atlassian.net/browse/OCPBUGS-78977

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Mar 21, 2026

/retest-required

@gcs278
Copy link
Copy Markdown
Contributor

gcs278 commented Mar 23, 2026

Since this touches some gwapi code, let's test our new Tech Preview no-OLM installation path out of an abundance of caution. I highly doubt this will impact, but better safe than sorry:
/test e2e-aws-operator-techpreview

@gcs278
Copy link
Copy Markdown
Contributor

gcs278 commented Mar 24, 2026

unrelated hypershift failures:

fixture.go:333: Failed to wait for infra resources in guest cluster to be deleted: context deadline exceeded
fixture.go:340: Failed to clean up 9 remaining resources for guest cluster

/test e2e-hypershift

@gcs278
Copy link
Copy Markdown
Contributor

gcs278 commented Mar 24, 2026

I asked the hypershift team about the Teardown failures we are seeing quite often.

/test e2e-hypershift

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 0c1dee7 and 2 for PR HEAD f781f7e in total

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 25, 2026

@RamLavi: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit ccf658f into openshift:master Mar 25, 2026
19 checks passed
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@RamLavi: Jira Issue Verification Checks: Jira Issue OCPBUGS-78523
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-78523 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

The gatewayapi_controller uses sync.Once to add a GatewayClass field indexer and start dependent controllers. If the GatewayClass CRD is not yet registered with the API server when the first reconcile runs, the IndexField call fails and sync.Once prevents any subsequent retry.

This permanently breaks the status_controller's ability to list GatewayClass resources via the indexed field, causing the ingress ClusterOperator to never report status conditions.

Replace sync.Once with a mutex-guarded bool that allows retries on IndexField failure, requeueing every 10 seconds until the CRD is established. Dependent controllers are started only after the indexer succeeds.

Fixes: #1381
Fixes: OCPBUGS-78523

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot
Copy link
Copy Markdown
Contributor

Fix included in accepted release 4.22.0-0.nightly-2026-03-25-221249

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[OCPBUGS-78523] gatewayapi_controller: sync.Once prevents retry of GatewayClass field indexer, permanently breaking ingress ClusterOperator status

7 participants