-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cache at provider level #3402
Conversation
8f68fca
to
a891979
Compare
@tjamet do you want to rebase it? |
/ok-to-test |
provider/cached_provider.go
Outdated
log.Info("Records cache provider: refreshing records list cache") | ||
c.cache, c.err = c.Provider.Records(ctx) | ||
if c.err != nil { | ||
log.Errorf("Records cache provider: list records failed: %v", c.err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't both log and return an error.
provider/cached_provider.go
Outdated
c.lastRead = time.Now() | ||
cachedRecordsCallsTotal.WithLabelValues("false").Inc() | ||
} else { | ||
log.Info("Records cache provider: using records list from cache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit chatty. Perhaps debug level would be better?
provider/cached_provider.go
Outdated
func (c *CachedProvider) Records(ctx context.Context) ([]*endpoint.Endpoint, error) { | ||
if c.needRefresh() { | ||
log.Info("Records cache provider: refreshing records list cache") | ||
c.cache, c.err = c.Provider.Records(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's actually no reason to store err
in the CachedProvider
. If this call returns a non-nil
error, then it should just set c.cache = nil
.
This would be a really useful feature. |
really looking forward to have this feature. |
The txt registry has a similar cache. It would make sense to pull this to a separate layer above the registry. |
Hi! |
**Description** In the current implementation, DNS providers are called to list all records on every loop. This is expensive in terms of number of requests to the provider and may result in being rate limited, as reported in 1293 and 3397. In our case, we have approximately 20,000 records in our AWS Hosted Zone. The ListResourceRecordSets API call allows a maximum of 300 items per call. That requires 67 API calls per external-dns deployment during every sync period With this, we introduce an optional generic caching mechanism at the provider level, that re-uses the latest known list of records for a given time. This prevents from expensive Provider calls to list all records for each object modification that does not change the actual record (annotations, statuses, ingress routing, ...) This introduces 2 trade-offs: 1. Any changes or corruption directly on the provider side will be longer to detect and to resolve, up to the cache time 2. Any conflicting records in the DNS provider (such as a different external-dns instance) injected during the cache validity will cause the first iteration of the next reconcile loop to fail, and hence add a delay until the next retry **Checklist** - [X] Unit tests updated - [X] End user documentation updated Change-Id: I0bdcfa994ac1b76acedb05d458a97c080284c5aa
Change-Id: Icaf1ffe34e75c320d4efbb428f831deb8784cd11
Change-Id: I3f4646cabd66216fd028fbae3adf68129a8a2cbf
Change-Id: I13da2aa28eef3e2c8e81b502321c4dc137087b2d
8dcbacb
to
4f50112
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@szuecs I have rebased the code atop the head of master.
@johngmyers Thanks for the input. Getting the cache at lower levels increases the cache success rate and further reduces the calls to the service provider. This being said, I tend to indeed see that the txt cache is redundant here and we could deprecate or remove it. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@tjamet: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@tjamet Do you think you can rebase this PR ? |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
looking forward to this feature. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@tjamet - Any chance to re-visit this PR? Anything I can do to assist with? |
Hi! Let me know if there is any other thing I should take a look at /reopen |
@tjamet: Failed to re-open PR: state cannot be changed. The add-provider-cache branch was force-pushed or recreated. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Description
In the current implementation, DNS providers are called to list all records on every loop. This is expensive in terms of number of requests to the provider and may result in being rate limited, as reported in 1293 and 3397.
In our case, we have approximately 20,000 records in our AWS Hosted Zone. The ListResourceRecordSets API call allows a maximum of 300 items per call. That requires 67 API calls per external-dns deployment during every sync period
With this, we introduce an optional generic caching mechanism at the provider level, that re-uses the latest known list of records for a given time.
This prevents from expensive Provider calls to list all records for each object modification that does not change the actual record (annotations, statuses, ingress routing, ...)
This introduces 2 trade-offs:
Any changes or corruption directly on the provider side will be longer to detect and to resolve, up to the cache time
Any conflicting records in the DNS provider (such as a different external-dns instance) injected during the cache validity will cause the first iteration of the next reconcile loop to fail, and hence add a delay until the next retry
Checklist