OCPBUGS-78523: gatewayapi_controller: Replace sync.Once with retry for GatewayClass field indexer setup#1382
Conversation
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReconciler startup was refactored to replace a one-shot Assessment against linked issues
(Only objectives from linked issues were assessed.) 🚥 Pre-merge checks | ✅ 7✅ Passed checks (7 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/controller/gatewayapi/controller.go`:
- Around line 158-160: The current reconcile collapses all errors from
ensureDependentControllers into a CRD-not-established retry path; update the
logic so ensureDependentControllers (and any calls like IndexField) return a
distinguishable error (e.g., a sentinel error var ErrCRDNotEstablished or a
typed error) and then in the reconcile caller check that error specifically: if
errors.Is(err, ErrCRDNotEstablished) then log and requeue after 10s, otherwise
return the original err (not a requeue) so controller-runtime surfaces and
backoffs; alternatively use errors.As to detect a CRD-not-established error type
and treat all other errors as fatal by returning err.
- Around line 190-199: The current loop launches goroutines for each
r.config.DependentControllers element and sets r.controllersStarted = true
immediately, which can permanently hide Start(ctx) failures; change this to wait
for all Start(ctx) calls to complete and surface any errors before setting
r.controllersStarted: iterate over r.config.DependentControllers, spawn
goroutines that call (*c).Start(ctx) but use a sync.WaitGroup plus an error
channel (or start them serially), collect and return the first error (or
aggregate) if any Start(ctx) fails, and only set r.controllersStarted = true
after the WaitGroup completes with no errors so failed startups can be retried
by subsequent reconciliations.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 64b733b7-7338-4f6e-ad04-75af1438925e
📒 Files selected for processing (1)
pkg/operator/controller/gatewayapi/controller.go
| if err := r.ensureDependentControllers(ctx); err != nil { | ||
| log.Error(err, "GatewayClass CRD not yet established, will retry") | ||
| return reconcile.Result{RequeueAfter: 10 * time.Second}, nil |
There was a problem hiding this comment.
Don't collapse every setup failure into the CRD-not-established retry path.
This treats all ensureDependentControllers errors as the expected transient bootstrap case. If IndexField fails for any other reason, the reconcile loop will silently spin every 10 seconds and never surface the real failure.
Possible shape
if err := r.ensureDependentControllers(ctx); err != nil {
- log.Error(err, "GatewayClass CRD not yet established, will retry")
- return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
+ if isRetryableGatewayClassIndexerError(err) {
+ log.Info("GatewayClass CRD not yet established, will retry", "after", 10*time.Second)
+ return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
+ }
+ return reconcile.Result{}, err
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/controller/gatewayapi/controller.go` around lines 158 - 160, The
current reconcile collapses all errors from ensureDependentControllers into a
CRD-not-established retry path; update the logic so ensureDependentControllers
(and any calls like IndexField) return a distinguishable error (e.g., a sentinel
error var ErrCRDNotEstablished or a typed error) and then in the reconcile
caller check that error specifically: if errors.Is(err, ErrCRDNotEstablished)
then log and requeue after 10s, otherwise return the original err (not a
requeue) so controller-runtime surfaces and backoffs; alternatively use
errors.As to detect a CRD-not-established error type and treat all other errors
as fatal by returning err.
There was a problem hiding this comment.
made err more neutral
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the current state of the Reconcile function's ensureDependentControllers error handling
rg -n "ensureDependentControllers|RequeueAfter|controllersStarted|isRetryable|ErrCRD|IndexField" pkg/operator/controller/gatewayapi/controller.go
Length of output: 735
@RamLavi, thanks for the update — but the core concern is about logic, not the message wording. The code still collapses every ensureDependentControllers error into the same silent 10-second retry path (return reconcile.Result{RequeueAfter: 10 * time.Second}, nil), regardless of the failure cause.
If IndexField ever fails for a reason other than the CRD not being established yet (permissions, internal indexer state, etc.), the reconciler will spin silently every 10 seconds and never report the real error to controller-runtime or its backoff machinery.
The minimal fix is to distinguish the two cases, for example using a sentinel:
// package-level
var errGatewayClassCRDNotEstablished = errors.New("GatewayClass CRD not yet established")// in ensureDependentControllers, wrap the IndexField error:
if err := r.fieldIndexer.IndexField(...); err != nil {
if meta.IsNoMatchError(err) || isCRDNotFoundError(err) {
return fmt.Errorf("%w: %v", errGatewayClassCRDNotEstablished, err)
}
return err // non-transient — surface immediately
}// in Reconcile:
if err := r.ensureDependentControllers(ctx); err != nil {
if errors.Is(err, errGatewayClassCRDNotEstablished) {
log.Info("GatewayClass CRD not yet established, will retry", "after", 10*time.Second)
return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
}
return reconcile.Result{}, err // surface unexpected errors
}This ensures only the expected transient bootstrap case is silently retried; anything else is bubbled up properly.
There was a problem hiding this comment.
@coderabbitai while I agree with the overall solution, establishing this pattern right now is not desired on this repo, as an error like permission missing (that may be under apply, as CVO and hypershift do it async) or others may cause a worst situation of exponential backoff. I think in this case, requeuing for 10 seconds is acceptable and won't generate much noise on the logs
There was a problem hiding this comment.
@rikatz, that makes sense — given the async nature of CVO and Hypershift applying permissions, a uniform 10-second requeue avoids the risk of exponential backoff on errors that are transiently non-retryable. Understood, I'll stand down on this suggestion.
📝 Learnings were identified and not saved (knowledge base disabled). Enable
There was a problem hiding this comment.
If I understand correctly, we just won't update status in the clusteroperator resource if the index isn't initialized—right? We would return here before ever updating the clusteroperator status:
cluster-ingress-operator/pkg/operator/controller/status/controller.go
Lines 248 to 250 in f781f7e
Worse, after an upgrade, the clusteroperator status might have stale status conditions reporting that everything is fine. If I have that right, we really need to improve the logic to surface errors such as "Index [...] does not exist" from getOperatorState.
So that's all to say our error handling is already lacking, and I don't think it's too important whether we have a 10-second retry here or whether we have an exponential backoff (if anything, the 10-second retry might actually make the error state more obvious if the cluster-admin does think to check the logs); in any case, we really need to follow up to improve the error reporting.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| for i := range r.config.DependentControllers { | ||
| c := &r.config.DependentControllers[i] | ||
| go func() { | ||
| if err := (*c).Start(ctx); err != nil { | ||
| log.Error(err, "cannot start controller") | ||
| } | ||
| }() | ||
| } | ||
|
|
||
| r.controllersStarted = true |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's understand the full context around the startup code
cat -n pkg/operator/controller/gatewayapi/controller.go | head -220 | tail -70Repository: openshift/cluster-ingress-operator
Length of output: 2887
🏁 Script executed:
# Check where controllersStarted is used in the file
rg "controllersStarted" pkg/operator/controller/gatewayapi/controller.go -nRepository: openshift/cluster-ingress-operator
Length of output: 170
🏁 Script executed:
# Look for any error handling or retry logic related to controller startup
rg -A 5 -B 5 "controllersStarted\s*=" pkg/operator/controller/gatewayapi/controller.goRepository: openshift/cluster-ingress-operator
Length of output: 181
🏁 Script executed:
# Check the Start method signature and where DependentControllers is defined
rg "DependentControllers" pkg/operator/controller/gatewayapi/controller.go -B 2 -A 2 -nRepository: openshift/cluster-ingress-operator
Length of output: 1096
A dependent-controller startup failure is silently swallowed and permanently blocks further initialization.
controllersStarted is set to true immediately after spawning goroutines (line 199), before any Start(ctx) call can complete or report an error. If a controller fails to start, the error is logged but never propagated—the function returns success (nil). Since controllersStarted is now true, the reconciliation loop will skip this block forever (line 172-174), and the operator remains partially initialized with no recovery path. Either wait for all controller starts to complete before setting the flag, or implement a recovery mechanism that allows retrying failed startups.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/controller/gatewayapi/controller.go` around lines 190 - 199, The
current loop launches goroutines for each r.config.DependentControllers element
and sets r.controllersStarted = true immediately, which can permanently hide
Start(ctx) failures; change this to wait for all Start(ctx) calls to complete
and surface any errors before setting r.controllersStarted: iterate over
r.config.DependentControllers, spawn goroutines that call (*c).Start(ctx) but
use a sync.WaitGroup plus an error channel (or start them serially), collect and
return the first error (or aggregate) if any Start(ctx) fails, and only set
r.controllersStarted = true after the WaitGroup completes with no errors so
failed startups can be retried by subsequent reconciliations.
There was a problem hiding this comment.
The existing pattern is unchanged by this PR, and is outside it's scope (which is changing the sync.Once mechanism)
There was a problem hiding this comment.
@RamLavi, understood — the goroutine-launch pattern for DependentControllers.Start() predates this PR and is not changed by it. That's a fair scope boundary. Would you like me to open a follow-up issue to track the silent-swallow concern for a future PR?
There was a problem hiding this comment.
@RamLavi but the comment is right. If you are setting the bool to true after this change, and you cannot start controllers this will behave as the former sync.Once.
You are going to end up on a situation where the indexer is added, but once you try to start a controller if it fails you will never try again because you are marking the controllersStarted as true.
I think the approach here needs to be something like:
- Establish a sync.Map inside the reconciler struct that will contain the started controllers, or otherwise the controller name will not be added
- For every sync, you compare what controllers are started or otherwise tries to start the missing ones
@Miciah I dont understand why the indexer is not running outside of the reconciliation. IIRC the right pattern here (and given we always ship with GatewayClass now) should be once the controller is starting it should try to set the indexer or fail outside of the reconciliation loop
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
as an example, we should be setting the indexer as part of the manager bootstrap IMO: kubernetes-sigs/cluster-api-ipam-provider-in-cluster@d19ce78/pkg/ipamutil/reconciler.go#L91-L93
There was a problem hiding this comment.
@rikatz You raise a fair point about Start failures. In this path, IndexField has already succeeded (so the CRD is registered and cache sync should work), but I agree it could be more robust.
On the broader question - to my "new to this repo" understanding, the indexer is in Reconcile because the GatewayClass CRD may not exist when New() runs, so this controller creates it in ensureGatewayAPICRDs().
Moving the indexer to bootstrap would change that assumption. @Miciah what do you think about @rikatz 's suggestion?
There was a problem hiding this comment.
yes that's right. The indexer exists here because otherwise the CRD may not be available. For now, let's keep it as is and we have a task for 4.23 to fix the controllers startup and indexer startup
There was a problem hiding this comment.
FTR the story for refactoring redhat.atlassian.net/browse/NE-1986
I will try to take care of it on 4.23
| // ensureDependentControllers indexes GatewayClass resources and starts | ||
| // dependent controllers exactly once. | ||
| func (r *reconciler) ensureDependentControllers(ctx context.Context) error { | ||
| r.mu.Lock() | ||
| defer r.mu.Unlock() | ||
|
|
||
| if r.controllersStarted { | ||
| return nil | ||
| } | ||
|
|
||
| // Index gateway classes based on their spec.controllerName | ||
| if err := r.fieldIndexer.IndexField( | ||
| context.Background(), | ||
| &gatewayapiv1.GatewayClass{}, | ||
| operatorcontroller.GatewayClassIndexFieldName, | ||
| client.IndexerFunc(func(o client.Object) []string { | ||
| gatewayclass, ok := o.(*gatewayapiv1.GatewayClass) | ||
| if !ok { | ||
| return []string{} | ||
| } | ||
| return []string{string(gatewayclass.Spec.ControllerName)} | ||
| })); err != nil { | ||
| return fmt.Errorf("failed to add field indexer: %w", err) | ||
| } | ||
| for i := range r.config.DependentControllers { | ||
| c := &r.config.DependentControllers[i] | ||
| go func() { | ||
| if err := (*c).Start(ctx); err != nil { | ||
| log.Error(err, "cannot start controller") | ||
| } | ||
| }() | ||
| } | ||
|
|
||
| r.controllersStarted = true | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Just a suggestion—it would be helpful to reviewers or people looking at the change history in the future if you put the refactoring in a separate commit from the logic change.
|
I think this change is good. CodeRabbit had a similar suggestion here: #1326 (comment) As I mentioned in #1381, we should have an OCPBUGS ticket. Then the PR title and the commit message should both reference that ticket. |
|
@RamLavi can you please open a Jira bug for this issue when possible? |
Extract the inline sync.Once block from Reconcile into a dedicated ensureDependentControllers method. No functional change. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ram Lavi <ralavi@redhat.com>
27c49cf to
d8f424e
Compare
|
Change: split commits to pure refactor and code change. Make err message more neutral. |
d8f424e to
3c97e0f
Compare
|
@Miciah PTAL |
|
/assign |
| r.mu.Lock() | ||
| defer r.mu.Unlock() | ||
|
|
||
| if r.controllersStarted { |
There was a problem hiding this comment.
nit: given once the controller is started this will be always true, you can probably put this before locking the mutex
There was a problem hiding this comment.
good point but keeping it by the book is tidier. I prefer it as is if you don't object of course.
There was a problem hiding this comment.
Well, why do we need the mutex at all? Reconcile doesn't need to be re-entrant, does it? I feel like I'm missing something.
Either we need the mutex for reads and writes, or we don't need the mutex at all.
There was a problem hiding this comment.
@Miciah, the mutex is not strictly needed here.
controller.New is called without setting MaxConcurrentReconciles, so it defaults to 1. Controller-runtime processes reconcile requests sequentially from the work queue, meaning Reconcile is never called concurrently. Since ensureDependentControllers is only called from Reconcile, there is no concurrent access to controllersStarted, and the mutex provides no correctness guarantee that a plain bool check wouldn't already give.
The mutex is purely defensive — it guards against a hypothetical future change to MaxConcurrentReconciles > 1. If that future-proofing isn't desired, both mu and the lock/unlock calls can be removed, leaving just the controllersStarted bool check:
func (r *reconciler) ensureDependentControllers(ctx context.Context) error {
if r.controllersStarted {
return nil
}
// ... setup ...
r.controllersStarted = true
return nil
}This is simpler and matches the actual single-threaded reconciliation model. If concurrent reconciliation is ever introduced, a data race detector would catch the unsynchronized access and the mutex could be added back at that point.
|
|
||
| // Index gateway classes based on their spec.controllerName | ||
| if err := r.fieldIndexer.IndexField( | ||
| context.Background(), |
There was a problem hiding this comment.
you are receiving a context on the function signature, why not use it here?
There was a problem hiding this comment.
(ok it may make sense to use context.Background here given for now to keep former behavior, so ignore it for now)
3c97e0f to
c13b8df
Compare
|
Change: Use input param scc for IndexField() |
[update] did not notice the comment on code rabbit, will comment on that as well |
| } | ||
|
|
||
| // Index gateway classes based on their spec.controllerName | ||
| if err := r.fieldIndexer.IndexField( |
There was a problem hiding this comment.
@RamLavi sorry I did a meeting with @Miciah and I may have misguided you, so using context.Background here should be good, as it was the original behavior (I missed it, sorry). Using the context from Reconcile here can be a bad thing right now, given that the parent context can be canceled (it is from the reconciliation process, not from the main process nor the manager) and cause undesired behavior
There was a problem hiding this comment.
sure, changed the ctx back
DONE
c13b8df to
6a9bfe2
Compare
|
Change: return the ctx to what it was before |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/operator/controller/gatewayapi/controller.go (1)
125-126:⚠️ Potential issue | 🟠 MajorOne global started flag still makes controller start failures unrecoverable.
Line 200 flips
controllersStartedas soon as the goroutines are launched, so an immediateStart(ctx)error only gets logged and later reconciles never retry that controller. Split the state into “indexer ready” plus per-controller startup tracking, or clear failed controllers from the started set so this path can recover.Also applies to: 191-200
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 82b9eb4c-aa34-4247-958d-7fd9ee16a634
📒 Files selected for processing (1)
pkg/operator/controller/gatewayapi/controller.go
|
On controller_test.go, please add as part of the test TestReconcileOnlyStartsControllerOnce an additional assertion: After each of the Reconcile() calls |
6a9bfe2 to
b72652e
Compare
The ensureDependentControllers method uses sync.Once to add a GatewayClass field indexer and start dependent controllers. If the GatewayClass CRD is not yet registered when the first reconcile runs, IndexField fails and sync.Once prevents any subsequent retry. This would leave the status_controller unable to list GatewayClass resources, potentially preventing the ingress ClusterOperator from reporting status conditions. While this race has not been widely observed, the pattern is incorrect for fallible operations. Replace sync.Once with a mutex-guarded bool that allows retries on IndexField failure, requeueing every 10 seconds until the CRD is established. Fixes: https://redhat.atlassian.net/browse/OCPBUGS-78523 Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ram Lavi <ralavi@redhat.com>
b72652e to
f781f7e
Compare
|
Change: Add controllersStarted assertions after each reconcile |
|
@rikatz: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@rikatz: This pull request references Jira Issue OCPBUGS-78523, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@rikatz: This pull request references Jira Issue OCPBUGS-78523, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (iamin@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
|
#1382 (comment) made me realize I don't understand why we need the mutex at all. That said, having an extra mutex shouldn't hurt anything (I don't see any risk of deadlock), so it isn't a blocking issue. /lgtm |
|
The e2e-aws-ovn-hypershift-conformance test is permafailing and being verified with Hypershift team. I am discussing about putting an OCPBUG for it While that, I am overriding this test in favor of getting this fix merged /override ci/prow/e2e-aws-ovn-hypershift-conformance |
|
@rikatz: Overrode contexts on behalf of rikatz: ci/prow/e2e-aws-ovn-hypershift-conformance DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest-required |
|
/test e2e-hypershift |
|
/retest |
|
/retest-required |
|
/override ci/prow/e2e-aws-ovn-hypershift-conformance Test is permafailing and has an ocpbug open |
|
@rikatz: Overrode contexts on behalf of rikatz: ci/prow/e2e-aws-ovn-hypershift-conformance DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
For the override: https://redhat.atlassian.net/browse/OCPBUGS-78977 |
|
/retest-required |
|
Since this touches some gwapi code, let's test our new Tech Preview no-OLM installation path out of an abundance of caution. I highly doubt this will impact, but better safe than sorry: |
|
unrelated hypershift failures: /test e2e-hypershift |
|
I asked the hypershift team about the Teardown failures we are seeing quite often. /test e2e-hypershift |
|
@RamLavi: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@RamLavi: Jira Issue Verification Checks: Jira Issue OCPBUGS-78523 Jira Issue OCPBUGS-78523 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in accepted release 4.22.0-0.nightly-2026-03-25-221249 |
The gatewayapi_controller uses sync.Once to add a GatewayClass field indexer and start dependent controllers. If the GatewayClass CRD is not yet registered with the API server when the first reconcile runs, the IndexField call fails and sync.Once prevents any subsequent retry.
This permanently breaks the status_controller's ability to list GatewayClass resources via the indexed field, causing the ingress ClusterOperator to never report status conditions.
Replace sync.Once with a mutex-guarded bool that allows retries on IndexField failure, requeueing every 10 seconds until the CRD is established. Dependent controllers are started only after the indexer succeeds.
Fixes: #1381
Fixes: OCPBUGS-78523