Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JENKINS-64418: add simple exponential backoff to the sleep for bitbucket api rate limits #414

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

b-dean
Copy link

@b-dean b-dean commented Dec 10, 2020

When we have hit the rate limits for Bitbucket API it slows everything down to a crawl. Also other API requests (such as getting the Jenkinsfile for multibranch pipelines) are rate limited and slower.

Adding exponential backoff with a minimum of 100 ms and a max of 1 minute, should alleviate the constant retry loop hitting the limit every 5s.

See JENKINS-64418.

Your checklist for this pull request

  • Make sure you are requesting to pull a topic/feature/bugfix branch (right side) and not your master branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or in Jenkins JIRA
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Did you provide a test-case? That demonstrates feature works or fixes the issue.

@b-dean b-dean changed the title add simple exponential backoff to the sleep for bitbucket api rate limits JENKINS-64418: add simple exponential backoff to the sleep for bitbucket api rate limits Dec 10, 2020
@artheus
Copy link

artheus commented Dec 22, 2020

Looking at Atlassians documentation for "Adjusting your code for rate limiting" I would argue that, of the three alternative ways proposed there, the 3rd option makes the most sense here.

See: https://confluence.atlassian.com/adminjiraserver/adjusting-your-code-for-rate-limiting-987143384.html

The 3rd option states that one can adjust the requests sent based on the request limit http headers.

  • x-ratelimit-interval-seconds: The time interval in seconds. You get a batch of new tokens every time interval.
  • x-ratelimit-fillrate: The number of tokens you get every time interval.
  • retry-after: The number of seconds you need to wait for new tokens. Make sure that your rate assumes waiting longer than this value.

I argue that this would be the way to go, due to the two reasons mentioned in the documentation about exponential backoff

⛔ High impact on a Jira instance because of concurrency. We’re assuming most active users will send requests whenever they’re available. This window will be similar for all users, making spikes in Jira performance. The same applies to threads — most will either be busy at the same time or idle.

⛔ Unpredictable. If you need to make a few critical requests, you can’t be sure all of them will be successful.

@bitwiseman
Copy link
Contributor

@artheus
Agreed, that is a better way.
If you take a look at https://github.com/jenkinsci/github-branch-source-plugin/blob/master/src/main/java/org/jenkinsci/plugins/github_branch_source/ApiRateLimitChecker.java it uses similar header info from GitHub to reason over rate limits and choose a strategy. However, that will take considerably more work. Would you be interested in submitting a PR?

Also, take a look at #187 (comment). There might be a way to mitigate the behavior in some cases by using caching.

@artheus
Copy link

artheus commented Dec 31, 2020

@bitwiseman I will start working on a PR for this.
I do not really agree to the use of caching as a part of the plugin though. I would much rather suggest/recommend users to use a caching reverse-proxy between Jenkins and Bitbucket, this way users would be in much better control of the caching layer, and the plugin would work as expected.
Creating a caching layer here would make this a much larger job, as we would have to make a large amount of separate decisions, e.g:

  • In which scope to cache (per job, run, bitbucket project or globally)
  • In RAM or on disk?
    • if on disk, where would be the best place to store caches?
  • How to invalidate caches?
  • What is a good timeout for caching data from which changes are expected?
  • If a synchronization job is run, and cached data is used, that would be a complete waste of resources (RAM, CPU...)

I think that the scope for this issue should be just to add a way to properly comply with the Bitbucket rate limits.
We could create a separate issue, PR or project for a caching layer.

Copy link
Contributor

@bitwiseman bitwiseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a test that exercises this.

@@ -866,7 +869,8 @@ protected CloseableHttpResponse executeMethod(HttpHost host, HttpRequestBase htt
httpMethod.setConfig(requestConfig.build());

CloseableHttpResponse response = client.execute(host, httpMethod, requestContext);
while (response.getStatusLine().getStatusCode() == API_RATE_LIMIT_CODE) {
Random random = new Random();
for (int attempts = 0; response.getStatusLine().getStatusCode() == API_RATE_LIMIT_CODE; attempts++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum sleep should remain 5 seconds. Once the limit is hit, the client needs to a wait at least on the order of seconds before trying again.

The maximum around 2 or 3 minutes sounds good.

We don't currently limit the number of attempts. That shouldn't change. We should hit the max and stay there.

@bitwiseman
Copy link
Contributor

@artheus

I see your point about a caching reverse-proxy.

In the best of worlds, people would implement an additional system that gave them more control. However, most Jenkins systems do not exist in the best of worlds. Most don't have what you are describing. As such, we have to look to see if there are any simple changes that can be made in this plugin to mitigate this issue, such as turning on local caching provided by the client http library.

OkHttp makes adding local, on-disk, size constrained caching is relatively straightforward. It looks like Apache CachingHttpClients provides something similar, either in memory or on disk.

Take a look at the github-branch-source implementation. There is a cache for each credential (since different users are likely to see different results) and each cache is size-limited on-disk.

The model here is that queries are always sent to the server (with an ETag from any previous result), the server then either returns new data or informs the client that their cache is still valid. Cache invalidation is done automatically and there is no timeout: if there's a result that hasn't changed in a day or a week, as long as it is still valid it can be used.

If a synchronization job is run, and cached data is used, that would be a complete waste of resources (RAM, CPU...)

As synchronization job? I'm not sure what you mean, but the most common scenario overall is that some mixture query results are valid and caches are used and some are not and new data is retrieved.

In anycase...
I completely support the plugin properly complying with rate limits. I look forward to your PR. That would certainly help this issue as well.

In the meanwhile, I've requested a few changes to this PR and will happily merge this a first step once they are made.

Thanks for contributing!

@b-dean b-dean marked this pull request as draft February 4, 2021 19:21
@b-dean
Copy link
Author

b-dean commented Feb 4, 2021

@artheus, that article you linked to from Atlassian is for JIRA, not Bitbucket. I logged all the headers that come back from the request that has the 429 response and there's nothing that says retry-after

Server: nginx
Vary: Authorization, cookie, user-context
Cache-Control: no-cache, no-store
Content-Type: text/plain
X-B3-TraceId: 1a4e3ec6e8f2fe21
X-Dc-Location: ash2
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Date: Thu, 04 Feb 2021 21:55:33 GMT
X-Served-By: app-3002
X-Static-Version: 6d8cc6ef88d5
X-Credential-Type: apikey
X-Render-Time: 0.0227448940277
X-Accepted-OAuth-Scopes: repository
Connection: Keep-Alive
X-Version: 6d8cc6ef88d5
X-Request-Count: 152
X-Frame-Options: SAMEORIGIN
Content-Length: 46

I thought about X-Request-Count but someone else asked Atlassian and they said it means nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants