feat: add --kube-api-cache-sync-timeout flag for configurable cache sync timeout#6104
Conversation
|
Have you been able to reproduce this issue or observe it in practice? We run clusters with a fairly large amount of resources and haven’t been able to hit any cache-sync limits. Something like 100k+ pods just take 2-5 seconds. In our experience, when this fails it’s usually due to RBAC issues or interference from other controllers, such as Gateway API controllers. Increasing this number typically doesn’t solve the underlying problem. Personally, I’d lean toward making this configurable-or even removing it entirely-but I see it more as a signal that something is wrong, for example a regression in controller-runtime or apimachinery, rather than a capacity issue. |
Coverage Report for PR #6104Coverage increased (+0.01%) to 80.536%Diff Coverage: No coverable lines changedCoverage Regressions47 previously-covered lines in 3 files lost coverage.
💛 - Coveralls |
|
@ivankatliarchuk Fair point. I've looked into RBAC and didn't see anything wrong. I'll look some more. You want to keep this PR just as a feature or close it? |
0e090f5 to
bf43904
Compare
|
How to reproduce the condition you are facing? How many resources are in the cluster? Which source is timing out? |
|
|
Gotcha. So I see few issues here
|
|
If we going to increase this timeout, without fixing the root cause -> If crashes happen during sync (not single requests), longer informer timeout means each pod lives slightly longer but makes more API calls before crashing, worsening the DoS effect you describe. The preventive measures
|
|
We still could have flag for cache resync, the problem is, it's very difficult to use it correctly. Increasing it will not necessary solve the isssue, but hide the root cause and make more harm, dicreasing it, and nobody knows what will happen. I'm not sure what other consider, but my view, there should be a soft error. Issues Not related to cloudflare, but why is more DR ready approach is to log and try instead of crash aka soft vs hard erorrs #5794 (comment) |
|
@ivankatliarchuk Thank you for the thorough review. You were right on the cause The actual issue was a missing RBAC permission for discovery.k8s.io/endpointslices (required for EKS 1.33+ where Endpoints API is deprecated). Adding this fixed the CrashLoopBackOff immediately. Changes made based on your feedback: Renamed the flag from Implemented soft error handling - WaitForCacheSync now logs warnings and continues instead of returning errors. This prevents the crash loop → API server DoS feedback loop you described. ExternalDNS will operate with potentially stale data rather than repeatedly crashing. Updated documentation with stronger caveats: Added prominent warning to investigate RBAC, network, and API server health first |
| // | ||
| // The function returns nil to allow the application to continue operating with potentially | ||
| // stale cache data, which is preferable to crashing repeatedly. | ||
| func WaitForCacheSync(ctx context.Context, factory informerFactory, timeout time.Duration) error { |
There was a problem hiding this comment.
This changes are not right and is quite a big behavour chagne.
The external-dns should crash when informer fails, as this logic invoked from the constructor - aka fail fast. Operators must tune backoffLimit to find a right balance of crashes vs stop trying.
The cloudflare API has to soft error, not informer.
| The default value is `60s`. Setting the value to `0s` uses the default timeout. | ||
|
|
||
| !!! note "Flag Deprecation" | ||
| The `--request-timeout` flag is deprecated. Use `--informer-sync-timeout` instead, as it more accurately |
There was a problem hiding this comment.
why the flag --request-timeout is marked deprecated? It has his use cases
There was a problem hiding this comment.
My bad. I got confused on what it was used for. Do you want me to just close this PR or would you find being able to set the timeout as a useful feature?
There was a problem hiding this comment.
There were 2 requests, could be more
I’ll leave the decision to you and can review the code if the other reviewers are comfortable with the change. I don’t have a strong opinion on it.
Both issues mention informer timeouts, but the root cause is usually something else, at least for cases in the issue
e318372 to
f3f4c13
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Friendly ping @ivankatliarchuk — rebased and addressed the feedback from your review. Would appreciate another look when you have time! |
|
Aloha! I think you need
Flag with informer is too specific, and users may not even knew what that beast is. You need to think about documentation and common sense use cases. Just heads up, Ill try to share statics for quite large cluster, and cache sunc takes milliseconds - if seconds -> we tune kube api. On my phone, will try to do deeper review tomorrow. |
c50955a to
85c6644
Compare
85c6644 to
e72f4cc
Compare
e72f4cc to
ac63669
Compare
|
Rebased on master and addressed your feedback:
Thanks again for the thorough review — you were right about the RBAC root cause in my original case (missing |
ac63669 to
10c6b85
Compare
…ync timeout Add a new --kube-api-cache-sync-timeout flag (default: 60s) to configure the timeout for Kubernetes informer cache sync operations during startup. This applies to all informer-based sources and the CRD source. Values <= 0 fall back to the default (60s). The --request-timeout flag remains unchanged for HTTP client requests. Changes: - Add CacheSyncTimeout field to Config, plumb through to all sources - Add timeout parameter to informers.WaitForCacheSync - Add cache sync wait to CRD source (previously had none) - Remove pre-Start() WaitForCacheSync calls in service source (bug fix) - Export DefaultCacheSyncTimeout constant for shared use Signed-off-by: Andrew Hay <andrew.hay@benchmarkanalytics.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
10c6b85 to
12ca27b
Compare
One or more co-authors of this pull request were not found. You must specify co-authors in commit message trailer via: Supported
Alternatively, if the co-author should not be included, remove the Please update your commit message(s) by doing |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@AndrewCharlesHay: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Closing in favor of a rebased PR on the latest master with a cleaner diff (22 files instead of 75). Addresses all review feedback. |
Summary
Add a new
--kube-api-cache-sync-timeoutflag (default: 60s) to configure the timeout for Kubernetes informer cache sync operations during startup. This applies to all informer-based sources and the CRD source.Fixes #6091 #5636
Changes
CacheSyncTimeoutfield to Config, plumb through to all sourcesinformers.WaitForCacheSyncStart()WaitForCacheSynccalls in service source (bug: was waiting before factory started)DefaultCacheSyncTimeoutconstant for shared use--request-timeoutremains unchanged for HTTP client requestsNotes
--kube-api-cache-sync-timeoutavoids exposing informer internals to users