fix(google): do not retry HTTP 409 Conflict errors by dtuck9 · Pull Request #6171 · kubernetes-sigs/external-dns

dtuck9 · 2026-02-06T17:32:24Z

What does it do ?

HTTP 409 (Conflict) errors from Google Cloud DNS indicate permanent state conflicts (e.g., resource already exists) that won't resolve by retrying. Previously, all API errors were wrapped as SoftErrors, causing the controller to retry indefinitely.

This change adds detection for 409 errors and returns them as regular errors (not SoftErrors), preventing unnecessary retry loops while maintaining retry behavior for other transient errors.

Motivation

We have a high-volume use case in which we submit large batches of DNS record changes in short amounts of time from several different clients, and occasionally overload the Google Cloud DNS API servers (or more accurately, sub-services that their API relies on). When this occurs, we receive a vague 502 error despite the batch being successful, followed by endless 409 (Conflict) errors that block all remaining batches indefinitely, requiring a good deal of manual cleanup and potentially hours of customer impact.

After engaging the Cloud DNS engineering team on multiple occasions, we were provided the guidance that:

Batches are entirely successful or unsuccessful. There will never be a partially successful batch.
Consequently, 409s should never be retried as it indicates that the entire batch would've already succeeded.

More

Yes, this PR title follows Conventional Commits
Yes, I added unit tests
Yes, I updated end user documentation accordingly
- I could not find any relevant end user documentation

HTTP 409 (Conflict) errors from Google Cloud DNS indicate permanent state conflicts (e.g., resource already exists) that won't resolve by retrying. Previously, all API errors were wrapped as SoftErrors, causing the controller to retry indefinitely. This change adds detection for 409 errors and returns them as regular errors (not SoftErrors), preventing unnecessary retry loops while maintaining retry behavior for other transient errors. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

k8s-ci-robot · 2026-02-06T17:32:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mloiseleur for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-02-06T17:32:35Z

Hi @dtuck9. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ivankatliarchuk · 2026-02-06T17:54:06Z

Not too sure. External-DNS is a contorller, it does not care where resource exist or something else, it should not guess. Have a I’m not sure that assumption holds. External-DNS is a controller - it shouldn’t care where a resource exists or make guesses about it.

t might be worth taking a look at Crossplane or similar controllers to see how they handle this.

This should be handled with max retries, exponential backoff, and similar mechanisms. Infrastructure conflicts go well beyond deciding which service should crash.

In practice, even if a service does crash, it will simply restart in most cases.

dtuck9 · 2026-02-06T17:56:03Z

This is based on guidance directly from the Cloud DNS engineering team. Should it not align with their guidance?

dtuck9 · 2026-02-06T20:59:03Z

While I understand your input, it also calls an API and per the team that owns the API, is not handling the specific, permanent, non-retryable error appropriately. It is instead treating a 409 as though it's a transient error and blocking subsequent batches from being processed in the interim.

Most controllers that encounter a transient error just requeue the individual item for retry, but that's not possible with the batching that the Google provider (rightfully) does. And a permanent failure, as is the case here, typically gets logged and then dropped and potentially alerted on.

That being said, we would be willing to take the max retry approach, but would prefer retrying in-line as opposed to waiting several iterations of the batch interval as this would lead to several minutes' worth of batches piling up as there isn't really a way to requeue an entire batch without significant refactoring. The exponential backoff approach would have similar consequences.

ivankatliarchuk · 2026-02-09T12:03:01Z

We should focus on identifying and fixing the root cause. Restarting or intentionally crashing the external-dns service won’t solve the problem. It doesn’t address the DNS API rate limits and is likely to create additional issues at the Kubernetes level.

If the service enters a crash loop due to informer resyncs, it can quickly exhaust Kubernetes API resources through repeated LIST/WATCH calls. Crashing on resource conflicts only makes this worse if the underlying trigger remains.

Each time an ExternalDNS pod restarts, it has to resync its informer cache, which results in full LIST calls to the Kubernetes API for resources like EndpointSlices and Ingresses. If a fatal error occurs during sync, the process exits, kubelet restarts the pod, and the same cycle repeats.

dtuck9 · 2026-02-09T16:58:18Z

We should focus on identifying and fixing the root cause. Restarting or intentionally crashing the external-dns service won’t solve the problem. It doesn’t address the DNS API rate limits and is likely to create additional issues at the Kubernetes level.

Is that what returning a non-SoftError does? I had left this in draft because I was still analyzing the impact, and hadn't gotten that far yet. I agree we wouldn't want to do that. I would prefer to drop the batch in case of a 409. What are your thoughts on dropping the batch altogether?

ivankatliarchuk · 2026-02-09T19:36:31Z

Changing the error type does not mean the batch is dropped.

Relevant code paths:

https://github.com/kubernetes-sigs/external-dns/blob/d38daef2a6be64a0eae26df9f981e42dc6633367/controller/execute.go#L138C1-L143

external-dns/controller/controller.go

Lines 187 to 194 in d38daef

    
           if err := c.RunOnce(ctx); err != nil { 
        
           	if errors.Is(err, provider.SoftError) { 
        
           		softErrorCount++ 
        
           		consecutiveSoftErrors.Gauge.Set(float64(softErrorCount)) 
        
           		log.Errorf("Failed to do run once: %v (consecutive soft errors: %d)", err, softErrorCount) 
        
           	} else { 
        
           		log.Fatalf("Failed to do run once: %v", err) // nolint: gocritic // exitAfterDefer 
        
           	}

Changing the error type does not drop the batch.

Any error returned from RunOnce that is not provider.SoftError causes external-dns to exit. With the proposed logic, a 409 propagates up to Controller.Run, which calls log.Fatalf, exiting the process.

Call chain:

ApplyChanges → RunOnce → Run → log.Fatalf → os.Exit(1)

What actually happens:

409 error occurs
external-dns crashes
kubelet restarts the pod
informers resync
same desired state is recomputed
same batch is attempted again

This is a crash loop, not a batch drop.

More broadly:

this is a breaking change for many users
the conflict is environmental/configuration-related and should be fixed at the source, not masked by external-dns
external-dns should not guess user intent or hide misconfiguration
provider-specific behaviour at this layer should be avoided

The conflict itself could be due to record merging producing invalid input, or a race with another DNS controller.

ivankatliarchuk · 2026-02-09T19:57:01Z

If for whatever reason cleaninng up the environment or look at processes to reduce conflics is not the option. Worth to consider support Google SDK–level retries + backoff, but not too sure, 409 is a state conflict, root cause has to be resolved.

Provider layer responsibility

Apply bounded retries only for retryable status codes
Surface 409 as a clear, non-retryable failure
Attach diagnostics (which record / RRSet caused it)

Controller responsibility

React to SoftError vs hard error
Remain provider-agnostic

Non-retryable errors that indicate client-side state or semantic problems:

400 – Invalid argument / malformed request
401 / 403 – Auth or permission issues
404 – Resource not found
409 – Conflict (state mismatch, duplicate records, precondition failure)

For 409 specifically:

the request violates current resource state
retrying the same request without changing input will deterministically fail

In our case the caller must either:

change the request, or
reconcile external state

Aka

Change the request → modify the desired state that external-dns is trying to apply so it no longer conflicts with existing records (e.g., avoid creating overlapping A/AAAA/CNAME records).
Reconcile external state → fix the current DNS state in the provider to match what external-dns expects (delete or update conflicting records) before retrying.

dtuck9 · 2026-02-10T19:07:19Z

this is a breaking change for many users

I am not going to return a hard error and crash the service. I would like to reach an agreement on an acceptable approach before sinking more time into this, though, and aside from returning the hard error, I don't really agree that this is a breaking change for many users as a 409 shouldn't be retried to begin with, and what user would actually expect and rely on that behavior?

the conflict is environmental/configuration-related and should be fixed at the source, not masked by external-dns

Any external-dns user could hit this situation. The Cloud DNS API returns a 502 despite a successful batch. Then external-dns retries because of a transient 502, and gets a 409 because of the previously successful batch. And then external-dns retries a 409.

While I agree that the Cloud DNS API shouldn't return a 5XX on a successful batch, that is an orthogonal issue on which we are also working with GCP. Let's set that aside.

Per your own comment, you have a 409 listed in Non-retryable errors that indicate client-side state or semantic problems:

the request violates current resource state

retrying the same request without changing input will deterministically fail

So why is external-dns currently retrying a 409? And why would handling this in the Google SDK be more appropriate?

Additionally, there are no partial batch successes per the Cloud DNS engineer. So if a batch gets a 409 on a retry attempt, then all records within the batch would be a conflict. There's no valid way to retry the same request with different input as a result -- they'd all 409.

We could retry smaller batches by breaking up the larger batch, but per the Cloud DNS engineer, these would all fail anyway with a 409 and would just delay the inevitable.

To your point, though, perhaps that would handle the case in which a 409 is encountered without the previously successful batch (i.e., a genuine conflict where multiple sources are trying to claim and create the same DNS record), so I would be fine with this approach as there would be a finite number of retries and external-dns would eventually move on.

external-dns should not guess user intent or hide misconfiguration

There's no misconfiguration here, and I don't agree with the suggestion that external-dns is trying to "guess user intent".

provider-specific behaviour at this layer should be avoided

Do the APIs within each provider similarly return 409s? Each provider is inherently different as they all have different API contracts and behaviors, so why should we avoid provider-specific behavior at the individual provider layer?

ivankatliarchuk · 2026-02-10T21:16:11Z

This 409 conflict should not be handled by Google SDK(in fact this error should be treated as non-retryable by sdk), and external-dns should not handle it in some specific way either.

This is not a software problem. It’s a state problem, an environmental inconsistency aka broken state. No matter how external-dns behaves, the state will still be broken.

This translates to:

remove or update conflicting DNS records
ensure single-writer ownership per zone / record set
fix record definitions that merge into invalid RRsets
pause or scale external-dns to zero during environment inconistencies

^ This is normal operational hygiene, not a tooling failure.

We don’t have a defined conflict-resolution model today.

Right now external-dns:

detects desired vs current state
computes a plan
assumes the plan is internally consistent

But when conflicts exist there is no policy and no contract for:

dropping records
rejecting records
partial application
best-effort behavior

If we don’t know whether external-dns should: drop conflicting records, reject the whole plan, apply a subset or block forever -> Then choosing behaviour via error handling side effects (crash vs retry vs skip) for specific provider is not the ideal option.

If not clear which records are conflcting - add metric, add warning logs make it visible for users.

The other reviewers are extremely busy. We’ll see what their view is, but it may take a while.

dtuck9 · 2026-02-11T16:19:16Z

I would propose if a batch fails due to a 409, that the batch is divided into 2 batches and retried and continue doing this until there's a batch of 1 record, and if it still 409s, it is logged and dropped.

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 6, 2026

k8s-ci-robot added the provider Issues or PRs related to a provider label Feb 6, 2026

k8s-ci-robot requested review from ivankatliarchuk and szuecs February 6, 2026 17:32

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 6, 2026

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 6, 2026

ivankatliarchuk mentioned this pull request Mar 17, 2026

docs: add operational best practices guide #6287

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(google): do not retry HTTP 409 Conflict errors#6171

fix(google): do not retry HTTP 409 Conflict errors#6171
dtuck9 wants to merge 1 commit intokubernetes-sigs:masterfrom
dtuck9:do-not-retry-cloud-dns-409s

dtuck9 commented Feb 6, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Feb 6, 2026

Uh oh!

k8s-ci-robot commented Feb 6, 2026

Uh oh!

ivankatliarchuk commented Feb 6, 2026

Uh oh!

dtuck9 commented Feb 6, 2026

Uh oh!

dtuck9 commented Feb 6, 2026

Uh oh!

ivankatliarchuk commented Feb 9, 2026

Uh oh!

dtuck9 commented Feb 9, 2026

Uh oh!

ivankatliarchuk commented Feb 9, 2026

Uh oh!

ivankatliarchuk commented Feb 9, 2026

Uh oh!

dtuck9 commented Feb 10, 2026 •

edited

Loading

Uh oh!

ivankatliarchuk commented Feb 10, 2026

Uh oh!

dtuck9 commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dtuck9 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does it do ?

Motivation

More

Uh oh!

k8s-ci-robot commented Feb 6, 2026

Uh oh!

k8s-ci-robot commented Feb 6, 2026

Uh oh!

ivankatliarchuk commented Feb 6, 2026

Uh oh!

dtuck9 commented Feb 6, 2026

Uh oh!

dtuck9 commented Feb 6, 2026

Uh oh!

ivankatliarchuk commented Feb 9, 2026

Uh oh!

dtuck9 commented Feb 9, 2026

Uh oh!

ivankatliarchuk commented Feb 9, 2026

Uh oh!

ivankatliarchuk commented Feb 9, 2026

Uh oh!

dtuck9 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivankatliarchuk commented Feb 10, 2026

Uh oh!

dtuck9 commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dtuck9 commented Feb 6, 2026 •

edited

Loading

dtuck9 commented Feb 10, 2026 •

edited

Loading