Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added sleep time between paging ListResourceRecordSet #4611

Closed
wants to merge 6 commits into from

Conversation

oferzern
Copy link

Description

Added a paging interval between AWS ListResourceRecordSet calls.
This is a major issue for places who work with N clusters over a single AWS Route 53 Account - the paging of a few K's of records make the rate limit to hit every few seconds.

With this update I've added a new flag for AWS Paging Interval - this will allow a better control over rate limits for the ListResourceRecordSet, as nothing is currently exists to mitigate rate limit issues caused by this call (as for ListHostedZones there are a few).

Note: first time here, would love any feedback :)

Checklist

  • Unit tests updated
  • End user documentation updated

Copy link

linux-foundation-easycla bot commented Jul 14, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign szuecs for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jul 14, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @oferzern!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @oferzern. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 14, 2024
@oferzern
Copy link
Author

/easycla

1 similar comment
@oferzern
Copy link
Author

/easycla

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jul 14, 2024
@mloiseleur
Copy link
Contributor

🤔 I'm wondering, why can't you use --interval ?

@oferzern
Copy link
Author

Hi @mloiseleur
I'm using interval but it only has an effect on the hosted zone listing, and we see the ListResourceRecordSets being called constantly.
And with 3 zones (total of about 18k records) on the same account and 6 EKS clusters, these calls are causing the API rate limits 24/7.
I believe this approach of spacing the paging calls will allow us to maintain a quick update intervals (since there's no record caching) without hitting the rate limits.
I've experienced just a few days ago that it took more than 30m for external-dns to list a new record due to this rate limits.

Maybe I'm missing something?

@mloiseleur
Copy link
Contributor

There are already many parameters in external-dns.

So, wdyt about finding why ListResourceRecordSets is not using interval and fix it ?

@oferzern
Copy link
Author

Because as I see it the two calls (hostedZone and listResource) configs should be different.
The current interval is only implemented for HostZones, as they're not changed often and it is reasonable to wait even an hour or more for it to be synced.

But with records - there are new services listing every few minutes, so I believe that it is by design that the ListResourceRecordSets is not affected by the interval flag.

Does that make sense?

@mloiseleur
Copy link
Contributor

@Raffo @szuecs Wdyt ?

@Raffo
Copy link
Contributor

Raffo commented Jul 26, 2024

I'm definitely not a fan of having a sleep, it's always a hack when we rely on something like this. I think either extending the current interval implementation or finding an entirely different solution would be better.

@oferzern
Copy link
Author

I'm definitely not a fan of having a sleep, it's always a hack when we rely on something like this. I think either extending the current interval implementation or finding an entirely different solution would be better.

@Raffo I agree that it's not the best to have a sleep. But even after implementing a new interval for querying ListResourceRecordSets (for ex. PR #4597 ), there would be multiple calls immediately (num of records/300), causing the rate limit again.
For us, we got about 9k records, so once ListResourceRecordSets is triggered it'll call it 28 times.
With seeing this across multiple clusters, I've tried to think of a solution that would have increase the time between paging the records.

If there's more elegant idea / logic, I'd love to hear and I'll work on it.

@mloiseleur
Copy link
Contributor

@oferzern Adding a sleep in the controller comes with many drawbacks. A path to a more reliable idea / logic has been described by @Raffo:

I think either extending the current interval implementation or finding an entirely different solution would be better.

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 7, 2024
@mloiseleur
Copy link
Contributor

We won't add a sleep time in the controller.
Feel free to re-open it (or open a new one) if we missed something.
/close

@k8s-ci-robot
Copy link
Contributor

@mloiseleur: Closed this PR.

In response to this:

We won't add a sleep time in the controller.
Feel free to re-open it (or open a new one) if we missed something.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants