Skip to content

feat: add --kube-api-cache-sync-timeout flag for configurable cache sync timeout#6104

Closed
AndrewCharlesHay wants to merge 1 commit intokubernetes-sigs:masterfrom
AndrewCharlesHay:feat/configurable-timeout
Closed

feat: add --kube-api-cache-sync-timeout flag for configurable cache sync timeout#6104
AndrewCharlesHay wants to merge 1 commit intokubernetes-sigs:masterfrom
AndrewCharlesHay:feat/configurable-timeout

Conversation

@AndrewCharlesHay
Copy link
Copy Markdown
Contributor

@AndrewCharlesHay AndrewCharlesHay commented Jan 11, 2026

Summary

Add a new --kube-api-cache-sync-timeout flag (default: 60s) to configure the timeout for Kubernetes informer cache sync operations during startup. This applies to all informer-based sources and the CRD source.

Fixes #6091 #5636

Changes

  • Add CacheSyncTimeout field to Config, plumb through to all sources
  • Add timeout parameter to informers.WaitForCacheSync
  • Add cache sync wait to CRD source (previously had none)
  • Remove pre-Start() WaitForCacheSync calls in service source (bug: was waiting before factory started)
  • Export DefaultCacheSyncTimeout constant for shared use
  • Values <= 0 fall back to the default (60s) — no infinite wait allowed
  • --request-timeout remains unchanged for HTTP client requests

Notes

  • The flag name --kube-api-cache-sync-timeout avoids exposing informer internals to users
  • Fail-fast behavior is preserved: if cache sync times out, the source returns an error immediately

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 11, 2026
@ivankatliarchuk
Copy link
Copy Markdown
Member

Have you been able to reproduce this issue or observe it in practice? We run clusters with a fairly large amount of resources and haven’t been able to hit any cache-sync limits. Something like 100k+ pods just take 2-5 seconds.

In our experience, when this fails it’s usually due to RBAC issues or interference from other controllers, such as Gateway API controllers. Increasing this number typically doesn’t solve the underlying problem.

Personally, I’d lean toward making this configurable-or even removing it entirely-but I see it more as a signal that something is wrong, for example a regression in controller-runtime or apimachinery, rather than a capacity issue.

@k8s-ci-robot k8s-ci-robot added provider Issues or PRs related to a provider size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 12, 2026
@coveralls
Copy link
Copy Markdown

coveralls commented Jan 12, 2026

Coverage Report for PR #6104

Coverage increased (+0.01%) to 80.536%

Diff Coverage: No coverable lines changed

Coverage Regressions

47 previously-covered lines in 3 files lost coverage.

File Lines Lost Coverage
informers/informers.go 2 90.0%
crd.go 8 89.12%
store.go 37 65.18%

Coverage Status
Change from base Build 24036944238: 0.01%
Covered Lines: 17167
Relevant Lines: 21316

💛 - Coveralls

@AndrewCharlesHay
Copy link
Copy Markdown
Contributor Author

@ivankatliarchuk Fair point. I've looked into RBAC and didn't see anything wrong. I'll look some more. You want to keep this PR just as a feature or close it?

@ivankatliarchuk
Copy link
Copy Markdown
Member

How to reproduce the condition you are facing? How many resources are in the cluster? Which source is timing out?

@AndrewCharlesHay
Copy link
Copy Markdown
Contributor Author

AndrewCharlesHay commented Jan 12, 2026

ivankatliarchuk

  1. Reproduction & Symptoms:
    I am observing this consistently on v0.20.0. The pod enters a CrashLoopBackOff and terminates after exactly 60 seconds of runtime (e.g., Start: 10:25:48, Finish: 10:26:48). This aligns perfectly with the hardcoded time.Minute timeout for WaitForCacheSync in informers.go. Despite us setting --request-timeout=2m, the flag is ignored for the sync operation in the current release.
  2. Resource Scale:
    My cluster is relatively small (~600 resources total). This suggests the bottleneck is not object volume but rather API server latency or network conditions in our environment.
  3. Source of Timeout:
    The timeout occurs during the initialization of the shared informers used by the service and ingress sources.
  4. RBAC Investigation:
    I verified that permissions are not the cause. The ServiceAccount has full [get, watch, list] access to all required resources, and there are no 403 Forbidden errors in the logs.
    Regardless of the specific root cause in our environment, I think making the timeout configurable is a valuable improvement. It aligns with the existing --request-timeout flag intent and provides an escape hatch for users facing similar latency issues without forcing a hardcoded limit.

@ivankatliarchuk
Copy link
Copy Markdown
Member

Gotcha. So I see few issues here

  1. The request-timeout flag is specifically for API requests, while the error this PR addresses is about informer sync operations timing out, i.e., the internal watch/sync loop rather than a single HTTP request.
  2. Based on explanation here and in issue Bug: Cloudflare Provider crashes with "Identical record already exists" (81058) when using Region Key #6091 (comment). Pods crashing in constant loops from informer resyncs can indeed risk exhausting Kubernetes API resources through repeated LIST/WATCH calls, and increasing informer sync timeout would exacerbate that if the underlying crash trigger persists. infinete loop and DOS.
  • Each ExternalDNS pod start triggers informer cache sync, which does full LIST calls to Kubernetes API for resources like EndpointSlices/Ingresses; if the sync handler hits a fatal error (like Cloudflare 81058 conflict), the process exits immediately, kubelet restarts it, and the cycle repeats.
  • Kubernetes API server has rate limits on LIST/WATCH; rapid pod restarts create a feedback loop of LIST requests that can hit those limits, causing further informer sync failures.
  1. Documentation is advising things I'm not fully agree with, based on my previous comments.

@ivankatliarchuk
Copy link
Copy Markdown
Member

If we going to increase this timeout, without fixing the root cause -> If crashes happen during sync (not single requests), longer informer timeout means each pod lives slightly longer but makes more API calls before crashing, worsening the DoS effect you describe.

The preventive measures

  • Add pod restartPolicy: over backoffLimit in Deployment args to avoid restart loops;
  • use --kubeconfig rate limiting or cluster API priority/quotas.

@ivankatliarchuk
Copy link
Copy Markdown
Member

We still could have flag for cache resync, the problem is, it's very difficult to use it correctly. Increasing it will not necessary solve the isssue, but hide the root cause and make more harm, dicreasing it, and nobody knows what will happen.

I'm not sure what other consider, but my view, there should be a soft error. Issues

  1. Don't crash when cloudflare returns 502 #5225
  2. Cloudflare 5XX responses crash pod #4876

Not related to cloudflare, but why is more DR ready approach is to log and try instead of crash aka soft vs hard erorrs #5794 (comment)

@k8s-ci-robot k8s-ci-robot added apis Issues or PRs related to API change size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 13, 2026
@AndrewCharlesHay
Copy link
Copy Markdown
Contributor Author

@ivankatliarchuk Thank you for the thorough review. You were right on the cause

The actual issue was a missing RBAC permission for discovery.k8s.io/endpointslices (required for EKS 1.33+ where Endpoints API is deprecated). Adding this fixed the CrashLoopBackOff immediately.

Changes made based on your feedback:

Renamed the flag from --request-timeout to --informer-sync-timeout to accurately reflect what it controls (informer cache sync, not HTTP requests). The old flag is kept for backward compatibility but marked deprecated.

Implemented soft error handling - WaitForCacheSync now logs warnings and continues instead of returning errors. This prevents the crash loop → API server DoS feedback loop you described. ExternalDNS will operate with potentially stale data rather than repeatedly crashing.

Updated documentation with stronger caveats:

Added prominent warning to investigate RBAC, network, and API server health first
Included RBAC example showing the endpointslices permission
Clarified this is a last resort, not a first fix
The soft error approach aligns with your suggestion for "log and try instead of crash" - it's more DR-ready and prevents the LIST/WATCH flood that occurs during crash loops

//
// The function returns nil to allow the application to continue operating with potentially
// stale cache data, which is preferable to crashing repeatedly.
func WaitForCacheSync(ctx context.Context, factory informerFactory, timeout time.Duration) error {
Copy link
Copy Markdown
Member

@ivankatliarchuk ivankatliarchuk Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes are not right and is quite a big behavour chagne.

The external-dns should crash when informer fails, as this logic invoked from the constructor - aka fail fast. Operators must tune backoffLimit to find a right balance of crashes vs stop trying.

The cloudflare API has to soft error, not informer.

Comment thread docs/advanced/informer-sync-timeout.md Outdated
The default value is `60s`. Setting the value to `0s` uses the default timeout.

!!! note "Flag Deprecation"
The `--request-timeout` flag is deprecated. Use `--informer-sync-timeout` instead, as it more accurately
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the flag --request-timeout is marked deprecated? It has his use cases

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. I got confused on what it was used for. Do you want me to just close this PR or would you find being able to set the timeout as a useful feature?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were 2 requests, could be more

  1. Informer timed out on 60s #5636
  2. Add flag for setting the cache sync timeout #2999

I’ll leave the decision to you and can review the code if the other reviewers are comfortable with the change. I don’t have a strong opinion on it.

Both issues mention informer timeouts, but the root cause is usually something else, at least for cases in the issue

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 16, 2026
@AndrewCharlesHay AndrewCharlesHay force-pushed the feat/configurable-timeout branch from e318372 to f3f4c13 Compare March 24, 2026 00:56
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from ivankatliarchuk. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 24, 2026
@AndrewCharlesHay
Copy link
Copy Markdown
Contributor Author

Friendly ping @ivankatliarchuk — rebased and addressed the feedback from your review. Would appreciate another look when you have time!

@ivankatliarchuk
Copy link
Copy Markdown
Member

Aloha!

I think you need

  • chage title and description to match the changes
  • your code have a bug, worth to execute against cluster and you be able to see it ;-)
  • let's rename flag to something like -kube-api-<something>
  • there is as well crd source, it does not support same pattern yet, you could try to implement or just add flag support directly

Flag with informer is too specific, and users may not even knew what that beast is.

You need to think about documentation and common sense use cases. Just heads up, Ill try to share statics for quite large cluster, and cache sunc takes milliseconds - if seconds -> we tune kube api.

On my phone, will try to do deeper review tomorrow.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 31, 2026
@AndrewCharlesHay AndrewCharlesHay changed the title feat: make RequestTimeout configurable for all sources feat: add --kube-api-cache-sync-timeout flag for configurable cache sync timeout Mar 31, 2026
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 31, 2026
@AndrewCharlesHay AndrewCharlesHay force-pushed the feat/configurable-timeout branch from c50955a to 85c6644 Compare March 31, 2026 19:42
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 31, 2026
@AndrewCharlesHay AndrewCharlesHay force-pushed the feat/configurable-timeout branch from 85c6644 to e72f4cc Compare March 31, 2026 20:07
Comment thread docs/flags.md Outdated
@AndrewCharlesHay AndrewCharlesHay force-pushed the feat/configurable-timeout branch from e72f4cc to ac63669 Compare April 1, 2026 12:23
@AndrewCharlesHay
Copy link
Copy Markdown
Contributor Author

Rebased on master and addressed your feedback:

  1. Flag renamed to --kube-api-cache-sync-timeout (fits the --kube-api-* family alongside --kube-api-qps and --kube-api-burst)
  2. Timeout is now always enforced<= 0 falls back to the default (60s) instead of allowing an indefinite hang. Exported informers.DefaultCacheSyncTimeout as a shared constant.
  3. CRD source now uses the same flag via startAndSync, completing parity across all source types
  4. PR title and description updated to match the current state

Thanks again for the thorough review — you were right about the RBAC root cause in my original case (missing discovery.k8s.io/endpointslices on EKS 1.33+).

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 6, 2026
@AndrewCharlesHay AndrewCharlesHay force-pushed the feat/configurable-timeout branch from ac63669 to 10c6b85 Compare April 6, 2026 18:11
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 6, 2026
…ync timeout

Add a new --kube-api-cache-sync-timeout flag (default: 60s) to configure
the timeout for Kubernetes informer cache sync operations during startup.
This applies to all informer-based sources and the CRD source.

Values <= 0 fall back to the default (60s). The --request-timeout flag
remains unchanged for HTTP client requests.

Changes:
- Add CacheSyncTimeout field to Config, plumb through to all sources
- Add timeout parameter to informers.WaitForCacheSync
- Add cache sync wait to CRD source (previously had none)
- Remove pre-Start() WaitForCacheSync calls in service source (bug fix)
- Export DefaultCacheSyncTimeout constant for shared use

Signed-off-by: Andrew Hay <andrew.hay@benchmarkanalytics.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AndrewCharlesHay AndrewCharlesHay force-pushed the feat/configurable-timeout branch from 10c6b85 to 12ca27b Compare April 10, 2026 18:23
@linux-foundation-easycla
Copy link
Copy Markdown

CLA Missing ID CLA Not Signed

One or more co-authors of this pull request were not found. You must specify co-authors in commit message trailer via:

Co-authored-by: name <email>

Supported Co-authored-by: formats include:

  1. Anything <id+login@users.noreply.github.com> - it will locate your GitHub user by id part.
  2. Anything <login@users.noreply.github.com> - it will locate your GitHub user by login part.
  3. Anything <public-email> - it will locate your GitHub user by public-email part. Note that this email must be made public on Github.
  4. Anything <other-email> - it will locate your GitHub user by other-email part but only if that email was used before for any other CLA as a main commit author.
  5. login <any-valid-email> - it will locate your GitHub user by login part, note that login part must be at least 3 characters long.

Alternatively, if the co-author should not be included, remove the Co-authored-by: line from the commit message.

Please update your commit message(s) by doing git commit --amend and then git push [--force] and then request re-running CLA check via commenting on this pull request:

/easycla

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@AndrewCharlesHay: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-external-dns-licensecheck 12ca27b link true /test pull-external-dns-licensecheck
pull-external-dns-unit-test 12ca27b link true /test pull-external-dns-unit-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 10, 2026
@AndrewCharlesHay
Copy link
Copy Markdown
Contributor Author

Closing in favor of a rebased PR on the latest master with a cleaner diff (22 files instead of 75). Addresses all review feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

apis Issues or PRs related to API change cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. docs needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. provider Issues or PRs related to a provider size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Cloudflare Provider crashes with "Identical record already exists" (81058) when using Region Key

4 participants