fix(google): do not retry HTTP 409 Conflict errors#6171
fix(google): do not retry HTTP 409 Conflict errors#6171dtuck9 wants to merge 1 commit intokubernetes-sigs:masterfrom
Conversation
HTTP 409 (Conflict) errors from Google Cloud DNS indicate permanent state conflicts (e.g., resource already exists) that won't resolve by retrying. Previously, all API errors were wrapped as SoftErrors, causing the controller to retry indefinitely. This change adds detection for 409 errors and returns them as regular errors (not SoftErrors), preventing unnecessary retry loops while maintaining retry behavior for other transient errors. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @dtuck9. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Not too sure. External-DNS is a contorller, it does not care where resource exist or something else, it should not guess. Have a I’m not sure that assumption holds. External-DNS is a controller - it shouldn’t care where a resource exists or make guesses about it. t might be worth taking a look at Crossplane or similar controllers to see how they handle this. This should be handled with max retries, exponential backoff, and similar mechanisms. Infrastructure conflicts go well beyond deciding which service should crash. In practice, even if a service does crash, it will simply restart in most cases. |
|
This is based on guidance directly from the Cloud DNS engineering team. Should it not align with their guidance? |
|
While I understand your input, it also calls an API and per the team that owns the API, is not handling the specific, permanent, non-retryable error appropriately. It is instead treating a 409 as though it's a transient error and blocking subsequent batches from being processed in the interim. Most controllers that encounter a transient error just requeue the individual item for retry, but that's not possible with the batching that the Google provider (rightfully) does. And a permanent failure, as is the case here, typically gets logged and then dropped and potentially alerted on. That being said, we would be willing to take the max retry approach, but would prefer retrying in-line as opposed to waiting several iterations of the batch interval as this would lead to several minutes' worth of batches piling up as there isn't really a way to requeue an entire batch without significant refactoring. The exponential backoff approach would have similar consequences. |
|
We should focus on identifying and fixing the root cause. Restarting or intentionally crashing the external-dns service won’t solve the problem. It doesn’t address the DNS API rate limits and is likely to create additional issues at the Kubernetes level. If the service enters a crash loop due to informer resyncs, it can quickly exhaust Kubernetes API resources through repeated LIST/WATCH calls. Crashing on resource conflicts only makes this worse if the underlying trigger remains. Each time an ExternalDNS pod restarts, it has to resync its informer cache, which results in full LIST calls to the Kubernetes API for resources like EndpointSlices and Ingresses. If a fatal error occurs during sync, the process exits, kubelet restarts the pod, and the same cycle repeats. |
Is that what returning a non- |
|
Changing the error type does not mean the batch is dropped. Relevant code paths:
Changing the error type does not drop the batch. Any error returned from Call chain: What actually happens:
This is a crash loop, not a batch drop. More broadly:
The conflict itself could be due to record merging producing invalid input, or a race with another DNS controller. |
|
If for whatever reason cleaninng up the environment or look at processes to reduce conflics is not the option. Worth to consider support Google SDK–level retries + backoff, but not too sure, 409 is a state conflict, root cause has to be resolved. Provider layer responsibility
Controller responsibility
Non-retryable errors that indicate client-side state or semantic problems:
For 409 specifically:
In our case the caller must either:
Aka
|
I am not going to return a hard error and crash the service. I would like to reach an agreement on an acceptable approach before sinking more time into this, though, and aside from returning the hard error, I don't really agree that this is a breaking change for many users as a 409 shouldn't be retried to begin with, and what user would actually expect and rely on that behavior?
Any external-dns user could hit this situation. The Cloud DNS API returns a 502 despite a successful batch. Then external-dns retries because of a transient 502, and gets a 409 because of the previously successful batch. And then external-dns retries a 409. While I agree that the Cloud DNS API shouldn't return a 5XX on a successful batch, that is an orthogonal issue on which we are also working with GCP. Let's set that aside. Per your own comment, you have a 409 listed in
So why is external-dns currently retrying a 409? And why would handling this in the Google SDK be more appropriate? Additionally, there are no partial batch successes per the Cloud DNS engineer. So if a batch gets a 409 on a retry attempt, then all records within the batch would be a conflict. There's no valid way to retry the same request with different input as a result -- they'd all 409. We could retry smaller batches by breaking up the larger batch, but per the Cloud DNS engineer, these would all fail anyway with a 409 and would just delay the inevitable. To your point, though, perhaps that would handle the case in which a 409 is encountered without the previously successful batch (i.e., a genuine conflict where multiple sources are trying to claim and create the same DNS record), so I would be fine with this approach as there would be a finite number of retries and external-dns would eventually move on.
There's no misconfiguration here, and I don't agree with the suggestion that external-dns is trying to "guess user intent".
Do the APIs within each provider similarly return 409s? Each provider is inherently different as they all have different API contracts and behaviors, so why should we avoid provider-specific behavior at the individual provider layer? |
|
This 409 conflict should not be handled by Google SDK(in fact this error should be treated as non-retryable by sdk), and external-dns should not handle it in some specific way either. This is not a software problem. It’s a state problem, an environmental inconsistency aka broken state. No matter how external-dns behaves, the state will still be broken. This translates to:
^ This is normal operational hygiene, not a tooling failure. We don’t have a defined conflict-resolution model today. Right now external-dns:
But when conflicts exist there is no policy and no contract for:
If we don’t know whether external-dns should: drop conflicting records, reject the whole plan, apply a subset or block forever -> Then choosing behaviour via error handling side effects (crash vs retry vs skip) for specific provider is not the ideal option. If not clear which records are conflcting - add metric, add warning logs make it visible for users. The other reviewers are extremely busy. We’ll see what their view is, but it may take a while. |
|
I would propose if a batch fails due to a 409, that the batch is divided into 2 batches and retried and continue doing this until there's a batch of 1 record, and if it still 409s, it is logged and dropped. |
What does it do ?
HTTP 409 (Conflict) errors from Google Cloud DNS indicate permanent state conflicts (e.g., resource already exists) that won't resolve by retrying. Previously, all API errors were wrapped as
SoftErrors, causing the controller to retry indefinitely.This change adds detection for 409 errors and returns them as regular errors (not
SoftErrors), preventing unnecessary retry loops while maintaining retry behavior for other transient errors.Motivation
We have a high-volume use case in which we submit large batches of DNS record changes in short amounts of time from several different clients, and occasionally overload the Google Cloud DNS API servers (or more accurately, sub-services that their API relies on). When this occurs, we receive a vague 502 error despite the batch being successful, followed by endless 409 (Conflict) errors that block all remaining batches indefinitely, requiring a good deal of manual cleanup and potentially hours of customer impact.
After engaging the Cloud DNS engineering team on multiple occasions, we were provided the guidance that:
More