Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controllers should rate limit themselves and space retries out across a longer period of time #1561

Merged
merged 2 commits into from
Apr 6, 2015

Conversation

bparees
Copy link
Contributor

@bparees bparees commented Apr 1, 2015

No description provided.

@bparees bparees changed the title add delays when handling retries so we don't tight loop [DO_NOT_MERGE] add delays when handling retries so we don't tight loop Apr 1, 2015
@bparees
Copy link
Contributor Author

bparees commented Apr 1, 2015

the throttle.go change will be removed from this PR since it's being made upstream.

@smarterclayton @ironcladlou ptal.

with this change we get a burst of up to 10 retries as fast as possible, otherwise we're limited to 1 retry per second (for each controller's retry handler). call it a poor man's backoff.

@smarterclayton
Copy link
Contributor

So in the presence of retries the throughput of the controller drops to 10/second? Seems reasonable.

@bparees
Copy link
Contributor Author

bparees commented Apr 1, 2015

@smarterclayton it's a little harder to quantify than that...... if you have nothing but retries in your queue, the rate will drop to 1 per second (not 10 per second).

If you have a mix of retries and new items in your queue, we'll process the new items as fast as possible, but every time we come to a retry item, we'll get held up for a second while we wait for the retry item to get enqueued. (assuming we've drained our burst allocation already).

in otherwords, we are throttling the rate at which you can enqueue a retry, not the rate at which you can process an event.

@smarterclayton
Copy link
Contributor

This blocks the controller loop so the controller won't make progress.

On Apr 1, 2015, at 6:34 PM, Ben Parees [email protected] wrote:

@smarterclayton it's a little harder to quantify than that...... if you have nothing but retries in your queue, the rate will drop to 1 per second (not 10 per second).

If you have a mix of retries and new items in your queue, we'll process the new items as fast as possible, but every time we come to a retry item, we'll get held up for a second while we wait for the retry item to get enqueued. (assuming we've drained our burst allocation already).

in otherwords, we are throttling the rate at which you can enqueue a retry, not the rate at which you can process an event.


Reply to this email directly or view it on GitHub.

@bparees
Copy link
Contributor Author

bparees commented Apr 1, 2015

it'll make progress more slowly in the presence of retries, but again the delay only affects it when a retry is procesed, so if you have 1 retry and 100 new events, it'll delay 1s, requeue the retry, then process the 100 new events as fast as possible, then delay another 1s on the retry event (assuming it continues to fail).

I don't think there's a better approach short of the available-time-aware-FIFO we discussed that would actually bypass dequeuing events until their next retry time was reached.

@bparees
Copy link
Contributor Author

bparees commented Apr 1, 2015

And note that if processing the other 100 events takes longer than 1s, then the retry event will be requeueable immediately because there will be a token available again.

@smarterclayton
Copy link
Contributor

On Apr 1, 2015, at 6:40 PM, Ben Parees [email protected] wrote:

it'll make progress more slowly in the presence of retries, but again the delay only affects it when a retry is procesed, so if you have 1 retry and 100 new events, it'll delay 1s, requeue the retry, then process the 100 new events as fast as possible, then delay another 1s on the retry event (assuming it continues to fail).

How? Retry handler is invoked from handleOne, and is single threaded.
I don't think there's a better approach short of the available-time-aware-FIFO we discussed that would actually bypass dequeuing events until their next retry time was reached.


Reply to this email directly or view it on GitHub.

@bparees
Copy link
Contributor Author

bparees commented Apr 1, 2015

not following your point. but per our other discussion i'll ensure all our controllers configure a max number of retries of 60 or something (effectively ensuring at least 50 seconds of retry in the "worst" case)

@smarterclayton
Copy link
Contributor

And double check the sync period (it should probably be nonzero but several minutes apart).

On Apr 1, 2015, at 7:29 PM, Ben Parees [email protected] wrote:

not following your point. but per our other discussion i'll ensure all our controllers configure a max number of retries of 60 or something (effectively ensuring at least 50 seconds of retry in the "worst" case)


Reply to this email directly or view it on GitHub.

@bparees
Copy link
Contributor Author

bparees commented Apr 2, 2015

Updated build controllers to retry up to 60 times instead of forever. deploy controllers already retry only once.

sync period is 2 minutes for deploy+build controller watches.

@bparees
Copy link
Contributor Author

bparees commented Apr 2, 2015

this will fix #1555

@bparees
Copy link
Contributor Author

bparees commented Apr 2, 2015

[test]

@smarterclayton
Copy link
Contributor

Should deployment config retry more times?

On Apr 1, 2015, at 11:23 PM, Ben Parees [email protected] wrote:

Updated build controllers to retry up to 60 times instead of forever. deploy controllers already retry only once.

sync period is 2 minutes for deploy+build controller watches.


Reply to this email directly or view it on GitHub.

@bparees
Copy link
Contributor Author

bparees commented Apr 2, 2015

that's a question for @ironcladlou, i haven't studied deployment logic enough to know what's appropriate there.

@bparees
Copy link
Contributor Author

bparees commented Apr 2, 2015

@ironcladlou added a unit test for the rate limiting behavior.

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_openshift3/1707/)

@bparees
Copy link
Contributor Author

bparees commented Apr 3, 2015

Note that the behavior of this is such that every time we re-sync the image repo changes, we attempt to trigger the build again and ultimately fail (after 60 retries) because the To is not valid.

We could fix this by updating the BuildConfig so we recognize we've already attempted to trigger a build for this imagechange (successfully or otherwise). We could also go ahead and create the build in a failed state when there is something invalid about it (like a bad To).

Both of those things go against my original design thought which was that you should be able to correct your errors (ie create the missing imagerepo), but the latter approach (creating the build but immediately marking it failed) is starting to have more appeal to me. not sure it needs to hold up this merge though.

@bparees bparees changed the title [DO_NOT_MERGE] add delays when handling retries so we don't tight loop add delays when handling retries so we don't tight loop Apr 3, 2015
@bparees
Copy link
Contributor Author

bparees commented Apr 3, 2015

@smarterclayton i think this is good to go, assuming I did the copy from upstream commit correctly.

@bparees
Copy link
Contributor Author

bparees commented Apr 3, 2015

[test]

@smarterclayton
Copy link
Contributor

On Apr 3, 2015, at 12:09 AM, Ben Parees [email protected] wrote:

Note that the behavior of this is such that every time we re-sync the image repo changes, we attempt to trigger the build again and ultimately fail (after 60 retries) because the To is not valid.

We could fix this by updating the BuildConfig so we recognize we've already attempted to trigger a build for this imagechange (successfully or otherwise). We could also go ahead and create the build in a failed state when there is something invalid about it (like a bad To).

Both of those things go against my original design thought which was that you should be able to correct your errors (ie create the missing imagerepo), but the latter approach (creating the build but immediately marking it failed) is starting to have more appeal to me. not sure it needs to hold up this merge though.

The thing is I really do want to be able to create things out of order and have it "just work".

Reply to this email directly or view it on GitHub.

kutil.HandleError(err)
if count > 60 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment should really be on the func, and then you can simplify this to return count < 60. I would rename the method to be limitedLogAndRetry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, will do.

return &QueueRetryManager{
queue: queue,
keyFunc: keyFn,
retryFunc: retryFn,
retries: make(map[string]int),
limiter: kutil.NewTokenBucketRateLimiter(rate, burst),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to take kutil.RateLimiter as the argument, not the Args. That would enable you to properly test this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also means any consumers of this now have to import, understand, and instantiate, another type instead of two primitives. i'm not convinced that's an improvement from a usability+readability perspective. I like that the RateLimiter interface is currently encapsulated so users of QueueRetryManager don't need to be aware of it, the potential benefits w/ respect to test code (that you mention below) not withstanding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Apr 3, 2015, at 10:18 PM, Ben Parees [email protected] wrote:

In pkg/controller/controller.go:

return &QueueRetryManager{
    queue:     queue,
    keyFunc:   keyFn,
    retryFunc: retryFn,
    retries:   make(map[string]int),
  •   limiter:   kutil.NewTokenBucketRateLimiter(rate, burst),
    
    it also means any consumers of this now have to import, understand, and instantiate, another type instead of two primitives. i'm not convinced that's an improvement from a usability+readability perspective.

Eh... Two arguments that you have to explain are not any worse to me than a copy and paste constructor (which is going to be pretty name clear). Taking the interface is straight up better - it means if you want a complex rate limiter you don't have to change the class. Also, Factory is kind of a "configure a bunch of things" including other lower level bits. I don't like encapsulating function that isn't specific to the class (the "how" of limiting is orthogonal to the need to limit).
I like that the RateLimiter interface is currently encapsulated so users of QueueRetryManager don't need to be aware of it, the potential benefits w/ respect to test code (that you mention below) not withstanding.


Reply to this email directly or view it on GitHub.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example here - if we want to add a new rate limited token behavior, we shouldn't have to take more arguments to QueueRetryManager.

@bparees
Copy link
Contributor Author

bparees commented Apr 4, 2015

The thing is I really do want to be able to create things out of order and have it "just work".

@smarterclayton is that an argument for, or against, failing builds like this? it sounds like an argument against. Certainly it's an argument against immediatley failing them. Failing them after some period of time is something we could consider doing too (we know when the Build object was created... if we can't validly resolve it after X time, fail). But again, outside the scope of this PR.

@bparees bparees force-pushed the tight_loops branch 2 times, most recently from 56da6ff to fe493ca Compare April 4, 2015 02:29
@smarterclayton
Copy link
Contributor

On Apr 3, 2015, at 10:08 PM, Ben Parees [email protected] wrote:

The thing is I really do want to be able to create things out of order and have it "just work".

@smarterclayton is that an argument for, or against, failing builds like this? it sounds like an argument against. Certainly it's an argument against immediatley failing them. Failing them after some period of time is something we could consider doing too (we know when the Build object was created... if we can't validly resolve it after X time, fail). But again, outside the scope of this PR.

Post 3.0 we should come back to it.

Reply to this email directly or view it on GitHub.

@bparees
Copy link
Contributor Author

bparees commented Apr 6, 2015

@smarterclayton reworked with RateLimiter interface..ptal.

@smarterclayton
Copy link
Contributor

LGTM [merge]

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_openshift3/1472/) (Image: devenv-fedora_1209)

@bparees
Copy link
Contributor Author

bparees commented Apr 6, 2015

router test failures. @pweil- any known flakiness in router that's being looked into?

[merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin up to e218a54

@pweil-
Copy link

pweil- commented Apr 6, 2015

@bparees #1603 is what I know about (issues with the probes happening too quickly). There have been some int test failures that folks have mentioned but we've been unable to reproduce them.

openshift-bot pushed a commit that referenced this pull request Apr 6, 2015
@openshift-bot openshift-bot merged commit 58fb418 into openshift:master Apr 6, 2015
@bparees bparees deleted the tight_loops branch April 7, 2015 14:01
@smarterclayton smarterclayton changed the title add delays when handling retries so we don't tight loop Controllers should rate limit themselves and space retries out across a longer period of time Apr 8, 2015
jboyd01 pushed a commit to jboyd01/origin that referenced this pull request Feb 6, 2018
…service-catalog/' changes from d969acde90..b69b4a6c80

b69b4a6c80 origin build: modify hard coded path
527fac4d02 origin build: add origin tooling
545ffdb chart changes for v0.1.5 release (openshift#1709)
4d9be8f Use userInfo for Originating-Identity so extras is correct. (openshift#1702)
f358b99 Call destroy function on each storage interface (openshift#1705)
36b5de9 refactor binding reconciliation functions (openshift#1687)
5699360 Change binding_retrievable to bindingRetrievable
0d8bcfe thread through stopCh to DestroyFunc (openshift#1671)
1c45aef Migrate from glide to dep for dependency management (openshift#1670)
1cf0dd9 Add svcat to Makefile (openshift#1683)
45b1013 make verify validates that versioned APIs contain json tags for fields, addresses openshift#1303 (openshift#1480)
0ee8398 Build the integration test binary before running any tests (openshift#1666)
0fe0aa7 Update design.md (openshift#1674)
1280d24 controller requires permission to update secrets (openshift#1663)
129d98e Contribute svcat (openshift#1664)
ff9739b Update dependencies to kubernetes-1.9.1 (openshift#1633)
9c36019 chart changes for v0.1.4 release (openshift#1669)
93319f6 move apiserver generation to script and verify (openshift#1662)
385f0da refactor service instance provision/update/poll reconciliation (openshift#1648)
e015212 run each integration test individually (openshift#1661)
412e242 Tell people whether we're checking external hrefs (openshift#1659)
ae05361 retry failed unbind requests (openshift#1653)
7eae845 doc for setting up Service Catalog with Prometheus metrics (openshift#1654)
0720cf9 minor README copy edit (openshift#1656)
8bd347d run some integration subtests in parallel (openshift#1637)
b83800c Use $ and console to indicate multi-command blocks
789c4b2 Use dynamic reaction to fix data race (openshift#1650)
f1be763 only check external hrefs on master (openshift#1652)
65c6d20 Controller-manager crash loops if API server is not available on startup (openshift#1376) (openshift#1591)
9225c92 embedded etcd is the way of the future for our tests (openshift#1651)
605c952 Fix required fields in OpenAPI schema (openshift#1602)
899ca21 Revert "Switch to wget for integration apiserver checks (openshift#1384)" (openshift#1585)
2f496ee Update code-of-conduct.md (openshift#1635)
c1c69cf Build the e2e binary in CI (openshift#1647)
4e2dcef Wait for successful API discovery (openshift#1646)
5ae6d99 Bump copyright date in generated code (openshift#1645)
8be5b05 Serve OpenAPI spec only when --serve-openapi-spec switch is enabled (openshift#1612)
19fb30e silence go-restful logging output (openshift#1622)
fdbabf0 Add walkthrough link back (openshift#1620)
7c73e9a Add link to main k8s docs on service-catalog (openshift#1627)
f59adc9 Overhauling the design document (openshift#1619)
cd7b633 Updating the install documentation (openshift#1616)
f6e5441 fix compilation error from updated util.WaitForBindingCondition() (openshift#1629) (openshift#1631)
54e57af Provide OSB Client proxy to instrument with metrics (openshift#1615)
026b86f Disable test added in 1611 that contains data race (openshift#1626)
cb735a6 Add integration tests for ServiceInstances (openshift#1611)
67dbabb Cleaning up the docs README (openshift#1618)
6bddc07 remove email from docker login during Travis deploy (openshift#1614)
a604bc3 Use ConfigMaps for leader election (openshift#1599)
c6f193a Add controller integration tests for ServiceInstance create and update (openshift#1578)
26cf23b Rename OWNERS assignees: to approvers: (openshift#1508)
1163edc expose Prometheus metrics from Controller (openshift#677) (openshift#1608)
2cd6554 Clean up docs/ folder (openshift#1609)
1d7e96d Adding Service Binding Create Integration Tests (openshift#1580)
6a4c469 Make the maximum polling backoff configurable (openshift#1607)
31bbf55 Rename the imported package to avoid name conflict (openshift#1603)
3cdd556 Add validation for broker spec to SAR admission controller (openshift#1605)
a3408ce fix docker volume mount when building with docker under SELinux (openshift#1500) (openshift#1534)
307e747 Remove unneeded vendors of vendors (openshift#1596)
770fc74 Make ups-instance.yaml in walkthrough to demonstrate good practices (openshift#1592)
9112ba1 Add links to docs/README (openshift#1589)
8902648 Add additional service to ups-broker to fix e2e (openshift#1583)
0bcbc7d move instance update logic out of reconcileServiceInstanceDelete (openshift#1584)
7ef5a3e Do not send Parameters field when there are no parameters from sourced secret (openshift#1559)
4c51b25 Remove unneeded code that sets reason for provision/update call failure (openshift#1561)
b122cb9 fix bind injection failure not being persisted in API server (openshift#1546)
66421d5 Clear out current operation when starting reconciliation of a delete (openshift#1564)
8cca70a Send an empty object for Parameters when deleting all parameters of a ServiceInstance (openshift#1555)
270426c Add controllerTest type as a helper for running controller integration tests. (openshift#1577)
e26c2d7 Ignore .vscode folder in project root (openshift#1579)
REVERT: d969acde90 Add additional service to ups-broker to fix e2e (openshift#1583)
REVERT: 1bcd53b684 origin build: add origin tooling

git-subtree-dir: cmd/service-catalog/go/src/github.com/kubernetes-incubator/service-catalog
git-subtree-split: b69b4a6c8003f25d040e3087c7b1b16d1854a9e9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants