Skip to content

fix(google): do not retry HTTP 409 Conflict errors#6171

Draft
dtuck9 wants to merge 1 commit intokubernetes-sigs:masterfrom
dtuck9:do-not-retry-cloud-dns-409s
Draft

fix(google): do not retry HTTP 409 Conflict errors#6171
dtuck9 wants to merge 1 commit intokubernetes-sigs:masterfrom
dtuck9:do-not-retry-cloud-dns-409s

Conversation

@dtuck9
Copy link
Copy Markdown
Contributor

@dtuck9 dtuck9 commented Feb 6, 2026

What does it do ?

HTTP 409 (Conflict) errors from Google Cloud DNS indicate permanent state conflicts (e.g., resource already exists) that won't resolve by retrying. Previously, all API errors were wrapped as SoftErrors, causing the controller to retry indefinitely.

This change adds detection for 409 errors and returns them as regular errors (not SoftErrors), preventing unnecessary retry loops while maintaining retry behavior for other transient errors.

Motivation

We have a high-volume use case in which we submit large batches of DNS record changes in short amounts of time from several different clients, and occasionally overload the Google Cloud DNS API servers (or more accurately, sub-services that their API relies on). When this occurs, we receive a vague 502 error despite the batch being successful, followed by endless 409 (Conflict) errors that block all remaining batches indefinitely, requiring a good deal of manual cleanup and potentially hours of customer impact.

After engaging the Cloud DNS engineering team on multiple occasions, we were provided the guidance that:

  1. Batches are entirely successful or unsuccessful. There will never be a partially successful batch.
  2. Consequently, 409s should never be retried as it indicates that the entire batch would've already succeeded.

More

  • Yes, this PR title follows Conventional Commits
  • Yes, I added unit tests
  • Yes, I updated end user documentation accordingly
    • I could not find any relevant end user documentation

HTTP 409 (Conflict) errors from Google Cloud DNS indicate permanent
state conflicts (e.g., resource already exists) that won't resolve
by retrying. Previously, all API errors were wrapped as SoftErrors,
causing the controller to retry indefinitely.

This change adds detection for 409 errors and returns them as regular
errors (not SoftErrors), preventing unnecessary retry loops while
maintaining retry behavior for other transient errors.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 6, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mloiseleur for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the provider Issues or PRs related to a provider label Feb 6, 2026
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 6, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @dtuck9. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 6, 2026
@ivankatliarchuk
Copy link
Copy Markdown
Member

Not too sure. External-DNS is a contorller, it does not care where resource exist or something else, it should not guess. Have a I’m not sure that assumption holds. External-DNS is a controller - it shouldn’t care where a resource exists or make guesses about it.

t might be worth taking a look at Crossplane or similar controllers to see how they handle this.

This should be handled with max retries, exponential backoff, and similar mechanisms. Infrastructure conflicts go well beyond deciding which service should crash.

In practice, even if a service does crash, it will simply restart in most cases.

@dtuck9
Copy link
Copy Markdown
Contributor Author

dtuck9 commented Feb 6, 2026

This is based on guidance directly from the Cloud DNS engineering team. Should it not align with their guidance?

@dtuck9
Copy link
Copy Markdown
Contributor Author

dtuck9 commented Feb 6, 2026

While I understand your input, it also calls an API and per the team that owns the API, is not handling the specific, permanent, non-retryable error appropriately. It is instead treating a 409 as though it's a transient error and blocking subsequent batches from being processed in the interim.

Most controllers that encounter a transient error just requeue the individual item for retry, but that's not possible with the batching that the Google provider (rightfully) does. And a permanent failure, as is the case here, typically gets logged and then dropped and potentially alerted on.

That being said, we would be willing to take the max retry approach, but would prefer retrying in-line as opposed to waiting several iterations of the batch interval as this would lead to several minutes' worth of batches piling up as there isn't really a way to requeue an entire batch without significant refactoring. The exponential backoff approach would have similar consequences.

@ivankatliarchuk
Copy link
Copy Markdown
Member

We should focus on identifying and fixing the root cause. Restarting or intentionally crashing the external-dns service won’t solve the problem. It doesn’t address the DNS API rate limits and is likely to create additional issues at the Kubernetes level.

If the service enters a crash loop due to informer resyncs, it can quickly exhaust Kubernetes API resources through repeated LIST/WATCH calls. Crashing on resource conflicts only makes this worse if the underlying trigger remains.

Each time an ExternalDNS pod restarts, it has to resync its informer cache, which results in full LIST calls to the Kubernetes API for resources like EndpointSlices and Ingresses. If a fatal error occurs during sync, the process exits, kubelet restarts the pod, and the same cycle repeats.

@dtuck9
Copy link
Copy Markdown
Contributor Author

dtuck9 commented Feb 9, 2026

We should focus on identifying and fixing the root cause. Restarting or intentionally crashing the external-dns service won’t solve the problem. It doesn’t address the DNS API rate limits and is likely to create additional issues at the Kubernetes level.

Is that what returning a non-SoftError does? I had left this in draft because I was still analyzing the impact, and hadn't gotten that far yet. I agree we wouldn't want to do that. I would prefer to drop the batch in case of a 409. What are your thoughts on dropping the batch altogether?

@ivankatliarchuk
Copy link
Copy Markdown
Member

Changing the error type does not mean the batch is dropped.

Relevant code paths:


Changing the error type does not drop the batch.

Any error returned from RunOnce that is not provider.SoftError causes external-dns to exit. With the proposed logic, a 409 propagates up to Controller.Run, which calls log.Fatalf, exiting the process.

Call chain:

ApplyChanges → RunOnce → Run → log.Fatalf → os.Exit(1)

What actually happens:

  • 409 error occurs
  • external-dns crashes
  • kubelet restarts the pod
  • informers resync
  • same desired state is recomputed
  • same batch is attempted again

This is a crash loop, not a batch drop.

More broadly:

  • this is a breaking change for many users
  • the conflict is environmental/configuration-related and should be fixed at the source, not masked by external-dns
  • external-dns should not guess user intent or hide misconfiguration
  • provider-specific behaviour at this layer should be avoided

The conflict itself could be due to record merging producing invalid input, or a race with another DNS controller.

@ivankatliarchuk
Copy link
Copy Markdown
Member

If for whatever reason cleaninng up the environment or look at processes to reduce conflics is not the option. Worth to consider support Google SDK–level retries + backoff, but not too sure, 409 is a state conflict, root cause has to be resolved.

Provider layer responsibility

  • Apply bounded retries only for retryable status codes
  • Surface 409 as a clear, non-retryable failure
  • Attach diagnostics (which record / RRSet caused it)

Controller responsibility

  • React to SoftError vs hard error
  • Remain provider-agnostic

Non-retryable errors that indicate client-side state or semantic problems:

  • 400 – Invalid argument / malformed request
  • 401 / 403 – Auth or permission issues
  • 404 – Resource not found
  • 409 – Conflict (state mismatch, duplicate records, precondition failure)

For 409 specifically:

  • the request violates current resource state
  • retrying the same request without changing input will deterministically fail

In our case the caller must either:

  • change the request, or
  • reconcile external state

Aka

  • Change the request → modify the desired state that external-dns is trying to apply so it no longer conflicts with existing records (e.g., avoid creating overlapping A/AAAA/CNAME records).
  • Reconcile external state → fix the current DNS state in the provider to match what external-dns expects (delete or update conflicting records) before retrying.

@dtuck9
Copy link
Copy Markdown
Contributor Author

dtuck9 commented Feb 10, 2026

this is a breaking change for many users

I am not going to return a hard error and crash the service. I would like to reach an agreement on an acceptable approach before sinking more time into this, though, and aside from returning the hard error, I don't really agree that this is a breaking change for many users as a 409 shouldn't be retried to begin with, and what user would actually expect and rely on that behavior?

the conflict is environmental/configuration-related and should be fixed at the source, not masked by external-dns

Any external-dns user could hit this situation. The Cloud DNS API returns a 502 despite a successful batch. Then external-dns retries because of a transient 502, and gets a 409 because of the previously successful batch. And then external-dns retries a 409.

While I agree that the Cloud DNS API shouldn't return a 5XX on a successful batch, that is an orthogonal issue on which we are also working with GCP. Let's set that aside.

Per your own comment, you have a 409 listed in Non-retryable errors that indicate client-side state or semantic problems:

  • the request violates current resource state
  • retrying the same request without changing input will deterministically fail

So why is external-dns currently retrying a 409? And why would handling this in the Google SDK be more appropriate?

Additionally, there are no partial batch successes per the Cloud DNS engineer. So if a batch gets a 409 on a retry attempt, then all records within the batch would be a conflict. There's no valid way to retry the same request with different input as a result -- they'd all 409.

We could retry smaller batches by breaking up the larger batch, but per the Cloud DNS engineer, these would all fail anyway with a 409 and would just delay the inevitable.

To your point, though, perhaps that would handle the case in which a 409 is encountered without the previously successful batch (i.e., a genuine conflict where multiple sources are trying to claim and create the same DNS record), so I would be fine with this approach as there would be a finite number of retries and external-dns would eventually move on.

external-dns should not guess user intent or hide misconfiguration

There's no misconfiguration here, and I don't agree with the suggestion that external-dns is trying to "guess user intent".

provider-specific behaviour at this layer should be avoided

Do the APIs within each provider similarly return 409s? Each provider is inherently different as they all have different API contracts and behaviors, so why should we avoid provider-specific behavior at the individual provider layer?

@ivankatliarchuk
Copy link
Copy Markdown
Member

This 409 conflict should not be handled by Google SDK(in fact this error should be treated as non-retryable by sdk), and external-dns should not handle it in some specific way either.

This is not a software problem. It’s a state problem, an environmental inconsistency aka broken state. No matter how external-dns behaves, the state will still be broken.

This translates to:

  • remove or update conflicting DNS records
  • ensure single-writer ownership per zone / record set
  • fix record definitions that merge into invalid RRsets
  • pause or scale external-dns to zero during environment inconistencies

^ This is normal operational hygiene, not a tooling failure.

We don’t have a defined conflict-resolution model today.

Right now external-dns:

  • detects desired vs current state
  • computes a plan
  • assumes the plan is internally consistent

But when conflicts exist there is no policy and no contract for:

  • dropping records
  • rejecting records
  • partial application
  • best-effort behavior

If we don’t know whether external-dns should: drop conflicting records, reject the whole plan, apply a subset or block forever -> Then choosing behaviour via error handling side effects (crash vs retry vs skip) for specific provider is not the ideal option.

If not clear which records are conflcting - add metric, add warning logs make it visible for users.

The other reviewers are extremely busy. We’ll see what their view is, but it may take a while.

@dtuck9
Copy link
Copy Markdown
Contributor Author

dtuck9 commented Feb 11, 2026

I would propose if a batch fails due to a 409, that the batch is divided into 2 batches and retried and continue doing this until there's a batch of 1 record, and if it still 409s, it is logged and dropped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. provider Issues or PRs related to a provider size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants