feat: add --kube-api-cache-sync-timeout flag for configurable cache sync timeout by AndrewCharlesHay · Pull Request #6104 · kubernetes-sigs/external-dns

AndrewCharlesHay · 2026-01-11T19:48:05Z

Summary

Add a new --kube-api-cache-sync-timeout flag (default: 60s) to configure the timeout for Kubernetes informer cache sync operations during startup. This applies to all informer-based sources and the CRD source.

Fixes #6091 #5636

Changes

Add CacheSyncTimeout field to Config, plumb through to all sources
Add timeout parameter to informers.WaitForCacheSync
Add cache sync wait to CRD source (previously had none)
Remove pre-Start() WaitForCacheSync calls in service source (bug: was waiting before factory started)
Export DefaultCacheSyncTimeout constant for shared use
Values <= 0 fall back to the default (60s) — no infinite wait allowed
--request-timeout remains unchanged for HTTP client requests

Notes

The flag name --kube-api-cache-sync-timeout avoids exposing informer internals to users
Fail-fast behavior is preserved: if cache sync times out, the source returns an error immediately

ivankatliarchuk · 2026-01-11T22:27:21Z

Have you been able to reproduce this issue or observe it in practice? We run clusters with a fairly large amount of resources and haven’t been able to hit any cache-sync limits. Something like 100k+ pods just take 2-5 seconds.

In our experience, when this fails it’s usually due to RBAC issues or interference from other controllers, such as Gateway API controllers. Increasing this number typically doesn’t solve the underlying problem.

Personally, I’d lean toward making this configurable-or even removing it entirely-but I see it more as a signal that something is wrong, for example a regression in controller-runtime or apimachinery, rather than a capacity issue.

coveralls · 2026-01-12T02:00:22Z

Coverage Report for PR #6104

Coverage increased (+0.01%) to 80.536%

Diff Coverage: No coverable lines changed

Coverage Regressions

47 previously-covered lines in 3 files lost coverage.

File	Lines Lost	Coverage
informers/informers.go	2	90.0%
crd.go	8	89.12%
store.go	37	65.18%


Change from base Build 24036944238:	0.01%
Covered Lines:	17167
Relevant Lines:	21316

💛 - Coveralls

AndrewCharlesHay · 2026-01-12T02:26:53Z

@ivankatliarchuk Fair point. I've looked into RBAC and didn't see anything wrong. I'll look some more. You want to keep this PR just as a feature or close it?

ivankatliarchuk · 2026-01-12T08:51:21Z

How to reproduce the condition you are facing? How many resources are in the cluster? Which source is timing out?

AndrewCharlesHay · 2026-01-12T15:47:33Z

ivankatliarchuk

Reproduction & Symptoms:
I am observing this consistently on v0.20.0. The pod enters a CrashLoopBackOff and terminates after exactly 60 seconds of runtime (e.g., Start: 10:25:48, Finish: 10:26:48). This aligns perfectly with the hardcoded time.Minute timeout for WaitForCacheSync in informers.go. Despite us setting --request-timeout=2m, the flag is ignored for the sync operation in the current release.
Resource Scale:
My cluster is relatively small (~600 resources total). This suggests the bottleneck is not object volume but rather API server latency or network conditions in our environment.
Source of Timeout:
The timeout occurs during the initialization of the shared informers used by the service and ingress sources.
RBAC Investigation:
I verified that permissions are not the cause. The ServiceAccount has full [get, watch, list] access to all required resources, and there are no 403 Forbidden errors in the logs.
Regardless of the specific root cause in our environment, I think making the timeout configurable is a valuable improvement. It aligns with the existing --request-timeout flag intent and provides an escape hatch for users facing similar latency issues without forcing a hardcoded limit.

ivankatliarchuk · 2026-01-12T21:45:45Z

Gotcha. So I see few issues here

The request-timeout flag is specifically for API requests, while the error this PR addresses is about informer sync operations timing out, i.e., the internal watch/sync loop rather than a single HTTP request.
Based on explanation here and in issue Bug: Cloudflare Provider crashes with "Identical record already exists" (81058) when using Region Key #6091 (comment). Pods crashing in constant loops from informer resyncs can indeed risk exhausting Kubernetes API resources through repeated LIST/WATCH calls, and increasing informer sync timeout would exacerbate that if the underlying crash trigger persists. infinete loop and DOS.

Each ExternalDNS pod start triggers informer cache sync, which does full LIST calls to Kubernetes API for resources like EndpointSlices/Ingresses; if the sync handler hits a fatal error (like Cloudflare 81058 conflict), the process exits immediately, kubelet restarts it, and the cycle repeats.
Kubernetes API server has rate limits on LIST/WATCH; rapid pod restarts create a feedback loop of LIST requests that can hit those limits, causing further informer sync failures.

Documentation is advising things I'm not fully agree with, based on my previous comments.

ivankatliarchuk · 2026-01-12T21:47:02Z

If we going to increase this timeout, without fixing the root cause -> If crashes happen during sync (not single requests), longer informer timeout means each pod lives slightly longer but makes more API calls before crashing, worsening the DoS effect you describe.

The preventive measures

Add pod restartPolicy: over backoffLimit in Deployment args to avoid restart loops;
use --kubeconfig rate limiting or cluster API priority/quotas.

ivankatliarchuk · 2026-01-12T21:52:42Z

We still could have flag for cache resync, the problem is, it's very difficult to use it correctly. Increasing it will not necessary solve the isssue, but hide the root cause and make more harm, dicreasing it, and nobody knows what will happen.

I'm not sure what other consider, but my view, there should be a soft error. Issues

Not related to cloudflare, but why is more DR ready approach is to log and try instead of crash aka soft vs hard erorrs #5794 (comment)

AndrewCharlesHay · 2026-01-13T17:56:05Z

@ivankatliarchuk Thank you for the thorough review. You were right on the cause

The actual issue was a missing RBAC permission for discovery.k8s.io/endpointslices (required for EKS 1.33+ where Endpoints API is deprecated). Adding this fixed the CrashLoopBackOff immediately.

Changes made based on your feedback:

Renamed the flag from --request-timeout to --informer-sync-timeout to accurately reflect what it controls (informer cache sync, not HTTP requests). The old flag is kept for backward compatibility but marked deprecated.

Implemented soft error handling - WaitForCacheSync now logs warnings and continues instead of returning errors. This prevents the crash loop → API server DoS feedback loop you described. ExternalDNS will operate with potentially stale data rather than repeatedly crashing.

Updated documentation with stronger caveats:

Added prominent warning to investigate RBAC, network, and API server health first
Included RBAC example showing the endpointslices permission
Clarified this is a last resort, not a first fix
The soft error approach aligns with your suggestion for "log and try instead of crash" - it's more DR-ready and prevents the LIST/WATCH flood that occurs during crash loops

ivankatliarchuk · 2026-01-13T18:43:24Z

+//
+// The function returns nil to allow the application to continue operating with potentially
+// stale cache data, which is preferable to crashing repeatedly.
+func WaitForCacheSync(ctx context.Context, factory informerFactory, timeout time.Duration) error {


This changes are not right and is quite a big behavour chagne.

The external-dns should crash when informer fails, as this logic invoked from the constructor - aka fail fast. Operators must tune backoffLimit to find a right balance of crashes vs stop trying.

The cloudflare API has to soft error, not informer.

ivankatliarchuk · 2026-01-13T18:44:03Z

+The default value is `60s`. Setting the value to `0s` uses the default timeout.
+
+!!! note "Flag Deprecation"
+    The `--request-timeout` flag is deprecated. Use `--informer-sync-timeout` instead, as it more accurately


why the flag --request-timeout is marked deprecated? It has his use cases

My bad. I got confused on what it was used for. Do you want me to just close this PR or would you find being able to set the timeout as a useful feature?

There were 2 requests, could be more

Informer timed out on 60s #5636

Add flag for setting the cache sync timeout #2999

I’ll leave the decision to you and can review the code if the other reviewers are comfortable with the change. I don’t have a strong opinion on it.

Both issues mention informer timeouts, but the root cause is usually something else, at least for cases in the issue

k8s-ci-robot · 2026-03-24T00:56:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from ivankatliarchuk. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

AndrewCharlesHay · 2026-03-29T17:00:22Z

Friendly ping @ivankatliarchuk — rebased and addressed the feedback from your review. Would appreciate another look when you have time!

ivankatliarchuk · 2026-03-29T17:17:11Z

Aloha!

I think you need

chage title and description to match the changes
your code have a bug, worth to execute against cluster and you be able to see it ;-)
let's rename flag to something like -kube-api-<something>
there is as well crd source, it does not support same pattern yet, you could try to implement or just add flag support directly

Flag with informer is too specific, and users may not even knew what that beast is.

You need to think about documentation and common sense use cases. Just heads up, Ill try to share statics for quite large cluster, and cache sunc takes milliseconds - if seconds -> we tune kube api.

On my phone, will try to do deeper review tomorrow.

AndrewCharlesHay · 2026-04-06T15:43:53Z

Rebased on master and addressed your feedback:

Flag renamed to --kube-api-cache-sync-timeout (fits the --kube-api-* family alongside --kube-api-qps and --kube-api-burst)
Timeout is now always enforced — <= 0 falls back to the default (60s) instead of allowing an indefinite hang. Exported informers.DefaultCacheSyncTimeout as a shared constant.
CRD source now uses the same flag via startAndSync, completing parity across all source types
PR title and description updated to match the current state

Thanks again for the thorough review — you were right about the RBAC root cause in my original case (missing discovery.k8s.io/endpointslices on EKS 1.33+).

…ync timeout Add a new --kube-api-cache-sync-timeout flag (default: 60s) to configure the timeout for Kubernetes informer cache sync operations during startup. This applies to all informer-based sources and the CRD source. Values <= 0 fall back to the default (60s). The --request-timeout flag remains unchanged for HTTP client requests. Changes: - Add CacheSyncTimeout field to Config, plumb through to all sources - Add timeout parameter to informers.WaitForCacheSync - Add cache sync wait to CRD source (previously had none) - Remove pre-Start() WaitForCacheSync calls in service source (bug fix) - Export DefaultCacheSyncTimeout constant for shared use Signed-off-by: Andrew Hay <andrew.hay@benchmarkanalytics.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

linux-foundation-easycla · 2026-04-10T18:23:40Z

✅ login: AndrewCharlesHay / name: Andrew Hay (12ca27b)
❌ The email address for the commit (12ca27b) is not linked to the GitHub account, preventing the EasyCLA check. Consult this Help Article and GitHub Help to resolve. (To view the commit's email address, add .patch at the end of this PR page's URL.) For further assistance with EasyCLA, please submit a support request ticket.

One or more co-authors of this pull request were not found. You must specify co-authors in commit message trailer via:

Co-authored-by: name <email>

Supported Co-authored-by: formats include:

Anything <id+login@users.noreply.github.com> - it will locate your GitHub user by id part.
Anything <login@users.noreply.github.com> - it will locate your GitHub user by login part.
Anything <public-email> - it will locate your GitHub user by public-email part. Note that this email must be made public on Github.
Anything <other-email> - it will locate your GitHub user by other-email part but only if that email was used before for any other CLA as a main commit author.
login <any-valid-email> - it will locate your GitHub user by login part, note that login part must be at least 3 characters long.

Alternatively, if the co-author should not be included, remove the Co-authored-by: line from the commit message.

Please update your commit message(s) by doing git commit --amend and then git push [--force] and then request re-running CLA check via commenting on this pull request:

/easycla

k8s-ci-robot · 2026-04-10T18:23:41Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-04-10T18:23:48Z

@AndrewCharlesHay: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-external-dns-licensecheck	`12ca27b`	link	true	`/test pull-external-dns-licensecheck`
pull-external-dns-unit-test	`12ca27b`	link	true	`/test pull-external-dns-unit-test`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

AndrewCharlesHay · 2026-04-10T18:52:11Z

Closing in favor of a rebased PR on the latest master with a cleaner diff (22 files instead of 75). Addresses all review feedback.

k8s-ci-robot requested a review from ivankatliarchuk January 11, 2026 19:48

k8s-ci-robot added the source label Jan 11, 2026

k8s-ci-robot requested a review from mloiseleur January 11, 2026 19:48

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 11, 2026

k8s-ci-robot added the docs label Jan 12, 2026

AndrewCharlesHay force-pushed the feat/configurable-timeout branch from 0e090f5 to bf43904 Compare January 12, 2026 03:02

k8s-ci-robot added apis Issues or PRs related to API change size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 13, 2026

ivankatliarchuk suggested changes Jan 13, 2026

View reviewed changes

k8s-ci-robot assigned ivankatliarchuk Jan 13, 2026

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 16, 2026

ivankatliarchuk mentioned this pull request Mar 17, 2026

docs: add operational best practices guide #6287

Merged

3 tasks

AndrewCharlesHay force-pushed the feat/configurable-timeout branch from e318372 to f3f4c13 Compare March 24, 2026 00:56

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 24, 2026

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 31, 2026

AndrewCharlesHay changed the title ~~feat: make RequestTimeout configurable for all sources~~ feat: add --kube-api-cache-sync-timeout flag for configurable cache sync timeout Mar 31, 2026

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 31, 2026

AndrewCharlesHay force-pushed the feat/configurable-timeout branch from c50955a to 85c6644 Compare March 31, 2026 19:42

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 31, 2026

AndrewCharlesHay force-pushed the feat/configurable-timeout branch from 85c6644 to e72f4cc Compare March 31, 2026 20:07

ivankatliarchuk reviewed Apr 1, 2026

View reviewed changes

Comment thread docs/flags.md Outdated

AndrewCharlesHay force-pushed the feat/configurable-timeout branch from e72f4cc to ac63669 Compare April 1, 2026 12:23

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 6, 2026

AndrewCharlesHay force-pushed the feat/configurable-timeout branch from ac63669 to 10c6b85 Compare April 6, 2026 18:11

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 6, 2026

AndrewCharlesHay force-pushed the feat/configurable-timeout branch from 10c6b85 to 12ca27b Compare April 10, 2026 18:23

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 10, 2026

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 10, 2026

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 10, 2026

AndrewCharlesHay closed this Apr 10, 2026

AndrewCharlesHay mentioned this pull request Apr 10, 2026

feat: add --kube-api-cache-sync-timeout flag for configurable cache sync timeout #6363

Open

Conversation

AndrewCharlesHay commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Notes

Uh oh!

ivankatliarchuk commented Jan 11, 2026

Uh oh!

coveralls commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report for PR #6104

Coverage increased (+0.01%) to 80.536%

Diff Coverage: No coverable lines changed

Coverage Regressions

💛 - Coveralls

Uh oh!

AndrewCharlesHay commented Jan 12, 2026

Uh oh!

ivankatliarchuk commented Jan 12, 2026

Uh oh!

AndrewCharlesHay commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivankatliarchuk commented Jan 12, 2026

Uh oh!

ivankatliarchuk commented Jan 12, 2026

Uh oh!

ivankatliarchuk commented Jan 12, 2026

Uh oh!

AndrewCharlesHay commented Jan 13, 2026

Uh oh!

ivankatliarchuk Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivankatliarchuk Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

AndrewCharlesHay Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

ivankatliarchuk Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Mar 24, 2026

Uh oh!

AndrewCharlesHay commented Mar 29, 2026

Uh oh!

ivankatliarchuk commented Mar 29, 2026

Uh oh!

Uh oh!

AndrewCharlesHay commented Apr 6, 2026

Uh oh!

linux-foundation-easycla bot commented Apr 10, 2026

Uh oh!

k8s-ci-robot commented Apr 10, 2026

Uh oh!

k8s-ci-robot commented Apr 10, 2026

Uh oh!

AndrewCharlesHay commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AndrewCharlesHay commented Jan 11, 2026 •

edited

Loading

coveralls commented Jan 12, 2026 •

edited

Loading

AndrewCharlesHay commented Jan 12, 2026 •

edited

Loading

ivankatliarchuk Jan 13, 2026 •

edited

Loading