Controllers should rate limit themselves and space retries out across a longer period of time #1561

bparees · 2015-04-01T22:25:30Z

No description provided.

bparees · 2015-04-01T22:28:03Z

the throttle.go change will be removed from this PR since it's being made upstream.

with this change we get a burst of up to 10 retries as fast as possible, otherwise we're limited to 1 retry per second (for each controller's retry handler). call it a poor man's backoff.

smarterclayton · 2015-04-01T22:30:31Z

So in the presence of retries the throughput of the controller drops to 10/second? Seems reasonable.

bparees · 2015-04-01T22:34:11Z

@smarterclayton it's a little harder to quantify than that...... if you have nothing but retries in your queue, the rate will drop to 1 per second (not 10 per second).

If you have a mix of retries and new items in your queue, we'll process the new items as fast as possible, but every time we come to a retry item, we'll get held up for a second while we wait for the retry item to get enqueued. (assuming we've drained our burst allocation already).

in otherwords, we are throttling the rate at which you can enqueue a retry, not the rate at which you can process an event.

smarterclayton · 2015-04-01T22:36:01Z

This blocks the controller loop so the controller won't make progress.

On Apr 1, 2015, at 6:34 PM, Ben Parees [email protected] wrote:

@smarterclayton it's a little harder to quantify than that...... if you have nothing but retries in your queue, the rate will drop to 1 per second (not 10 per second).

If you have a mix of retries and new items in your queue, we'll process the new items as fast as possible, but every time we come to a retry item, we'll get held up for a second while we wait for the retry item to get enqueued. (assuming we've drained our burst allocation already).

in otherwords, we are throttling the rate at which you can enqueue a retry, not the rate at which you can process an event.

—
Reply to this email directly or view it on GitHub.

bparees · 2015-04-01T22:40:42Z

it'll make progress more slowly in the presence of retries, but again the delay only affects it when a retry is procesed, so if you have 1 retry and 100 new events, it'll delay 1s, requeue the retry, then process the 100 new events as fast as possible, then delay another 1s on the retry event (assuming it continues to fail).

I don't think there's a better approach short of the available-time-aware-FIFO we discussed that would actually bypass dequeuing events until their next retry time was reached.

bparees · 2015-04-01T22:41:44Z

And note that if processing the other 100 events takes longer than 1s, then the retry event will be requeueable immediately because there will be a token available again.

smarterclayton · 2015-04-01T22:43:41Z

On Apr 1, 2015, at 6:40 PM, Ben Parees [email protected] wrote:

it'll make progress more slowly in the presence of retries, but again the delay only affects it when a retry is procesed, so if you have 1 retry and 100 new events, it'll delay 1s, requeue the retry, then process the 100 new events as fast as possible, then delay another 1s on the retry event (assuming it continues to fail).

How? Retry handler is invoked from handleOne, and is single threaded.
I don't think there's a better approach short of the available-time-aware-FIFO we discussed that would actually bypass dequeuing events until their next retry time was reached.

—
Reply to this email directly or view it on GitHub.

bparees · 2015-04-01T23:29:18Z

not following your point. but per our other discussion i'll ensure all our controllers configure a max number of retries of 60 or something (effectively ensuring at least 50 seconds of retry in the "worst" case)

smarterclayton · 2015-04-02T00:04:59Z

And double check the sync period (it should probably be nonzero but several minutes apart).

On Apr 1, 2015, at 7:29 PM, Ben Parees [email protected] wrote:

not following your point. but per our other discussion i'll ensure all our controllers configure a max number of retries of 60 or something (effectively ensuring at least 50 seconds of retry in the "worst" case)

—
Reply to this email directly or view it on GitHub.

bparees · 2015-04-02T03:23:42Z

Updated build controllers to retry up to 60 times instead of forever. deploy controllers already retry only once.

sync period is 2 minutes for deploy+build controller watches.

bparees · 2015-04-02T03:24:00Z

this will fix #1555

bparees · 2015-04-02T03:31:57Z

[test]

smarterclayton · 2015-04-02T03:47:37Z

Should deployment config retry more times?

On Apr 1, 2015, at 11:23 PM, Ben Parees [email protected] wrote:

Updated build controllers to retry up to 60 times instead of forever. deploy controllers already retry only once.

sync period is 2 minutes for deploy+build controller watches.

—
Reply to this email directly or view it on GitHub.

bparees · 2015-04-02T04:07:56Z

that's a question for @ironcladlou, i haven't studied deployment logic enough to know what's appropriate there.

bparees · 2015-04-02T18:15:35Z

@ironcladlou added a unit test for the rate limiting behavior.

openshift-bot · 2015-04-02T22:11:02Z

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_openshift3/1707/)

bparees · 2015-04-03T04:09:30Z

Note that the behavior of this is such that every time we re-sync the image repo changes, we attempt to trigger the build again and ultimately fail (after 60 retries) because the To is not valid.

We could fix this by updating the BuildConfig so we recognize we've already attempted to trigger a build for this imagechange (successfully or otherwise). We could also go ahead and create the build in a failed state when there is something invalid about it (like a bad To).

Both of those things go against my original design thought which was that you should be able to correct your errors (ie create the missing imagerepo), but the latter approach (creating the build but immediately marking it failed) is starting to have more appeal to me. not sure it needs to hold up this merge though.

bparees · 2015-04-03T04:23:49Z

@smarterclayton i think this is good to go, assuming I did the copy from upstream commit correctly.

bparees · 2015-04-03T04:25:07Z

[test]

smarterclayton · 2015-04-03T14:45:17Z

On Apr 3, 2015, at 12:09 AM, Ben Parees [email protected] wrote:

Note that the behavior of this is such that every time we re-sync the image repo changes, we attempt to trigger the build again and ultimately fail (after 60 retries) because the To is not valid.

We could fix this by updating the BuildConfig so we recognize we've already attempted to trigger a build for this imagechange (successfully or otherwise). We could also go ahead and create the build in a failed state when there is something invalid about it (like a bad To).

Both of those things go against my original design thought which was that you should be able to correct your errors (ie create the missing imagerepo), but the latter approach (creating the build but immediately marking it failed) is starting to have more appeal to me. not sure it needs to hold up this merge though.

The thing is I really do want to be able to create things out of order and have it "just work".
—
Reply to this email directly or view it on GitHub.

smarterclayton · 2015-04-03T14:47:34Z

pkg/build/controller/factory/factory.go

 	kutil.HandleError(err)
+	if count > 60 {


the comment should really be on the func, and then you can simplify this to return count < 60. I would rename the method to be limitedLogAndRetry.

good idea, will do.

smarterclayton · 2015-04-03T14:50:27Z

pkg/controller/controller.go

 	return &QueueRetryManager{
 		queue:     queue,
 		keyFunc:   keyFn,
 		retryFunc: retryFn,
 		retries:   make(map[string]int),
+		limiter:   kutil.NewTokenBucketRateLimiter(rate, burst),


I think it would be better to take kutil.RateLimiter as the argument, not the Args. That would enable you to properly test this.

it also means any consumers of this now have to import, understand, and instantiate, another type instead of two primitives. i'm not convinced that's an improvement from a usability+readability perspective. I like that the RateLimiter interface is currently encapsulated so users of QueueRetryManager don't need to be aware of it, the potential benefits w/ respect to test code (that you mention below) not withstanding.

On Apr 3, 2015, at 10:18 PM, Ben Parees [email protected] wrote:

In pkg/controller/controller.go:

return &QueueRetryManager{ queue: queue, keyFunc: keyFn, retryFunc: retryFn, retries: make(map[string]int),

limiter: kutil.NewTokenBucketRateLimiter(rate, burst),
it also means any consumers of this now have to import, understand, and instantiate, another type instead of two primitives. i'm not convinced that's an improvement from a usability+readability perspective.

Eh... Two arguments that you have to explain are not any worse to me than a copy and paste constructor (which is going to be pretty name clear). Taking the interface is straight up better - it means if you want a complex rate limiter you don't have to change the class. Also, Factory is kind of a "configure a bunch of things" including other lower level bits. I don't like encapsulating function that isn't specific to the class (the "how" of limiting is orthogonal to the need to limit).
I like that the RateLimiter interface is currently encapsulated so users of QueueRetryManager don't need to be aware of it, the potential benefits w/ respect to test code (that you mention below) not withstanding.

—
Reply to this email directly or view it on GitHub.

An example here - if we want to add a new rate limited token behavior, we shouldn't have to take more arguments to QueueRetryManager.

bparees · 2015-04-04T02:07:56Z

The thing is I really do want to be able to create things out of order and have it "just work".

@smarterclayton is that an argument for, or against, failing builds like this? it sounds like an argument against. Certainly it's an argument against immediatley failing them. Failing them after some period of time is something we could consider doing too (we know when the Build object was created... if we can't validly resolve it after X time, fail). But again, outside the scope of this PR.

smarterclayton · 2015-04-04T02:58:11Z

On Apr 3, 2015, at 10:08 PM, Ben Parees [email protected] wrote:

The thing is I really do want to be able to create things out of order and have it "just work".

@smarterclayton is that an argument for, or against, failing builds like this? it sounds like an argument against. Certainly it's an argument against immediatley failing them. Failing them after some period of time is something we could consider doing too (we know when the Build object was created... if we can't validly resolve it after X time, fail). But again, outside the scope of this PR.

Post 3.0 we should come back to it.
—
Reply to this email directly or view it on GitHub.

kubernetes/kubernetes#6314

bparees · 2015-04-06T15:28:27Z

@smarterclayton reworked with RateLimiter interface..ptal.

smarterclayton · 2015-04-06T16:44:56Z

LGTM [merge]

openshift-bot · 2015-04-06T16:45:41Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_openshift3/1472/) (Image: devenv-fedora_1209)

bparees · 2015-04-06T17:27:52Z

router test failures. @pweil- any known flakiness in router that's being looked into?

[merge]

openshift-bot · 2015-04-06T17:30:47Z

Evaluated for origin up to e218a54

pweil- · 2015-04-06T17:36:11Z

@bparees #1603 is what I know about (issues with the probes happening too quickly). There have been some int test failures that folks have mentioned but we've been unable to reproduce them.

Merged by openshift-bot

…service-catalog/' changes from d969acde90..b69b4a6c80 b69b4a6c80 origin build: modify hard coded path 527fac4d02 origin build: add origin tooling 545ffdb chart changes for v0.1.5 release (openshift#1709) 4d9be8f Use userInfo for Originating-Identity so extras is correct. (openshift#1702) f358b99 Call destroy function on each storage interface (openshift#1705) 36b5de9 refactor binding reconciliation functions (openshift#1687) 5699360 Change binding_retrievable to bindingRetrievable 0d8bcfe thread through stopCh to DestroyFunc (openshift#1671) 1c45aef Migrate from glide to dep for dependency management (openshift#1670) 1cf0dd9 Add svcat to Makefile (openshift#1683) 45b1013 make verify validates that versioned APIs contain json tags for fields, addresses openshift#1303 (openshift#1480) 0ee8398 Build the integration test binary before running any tests (openshift#1666) 0fe0aa7 Update design.md (openshift#1674) 1280d24 controller requires permission to update secrets (openshift#1663) 129d98e Contribute svcat (openshift#1664) ff9739b Update dependencies to kubernetes-1.9.1 (openshift#1633) 9c36019 chart changes for v0.1.4 release (openshift#1669) 93319f6 move apiserver generation to script and verify (openshift#1662) 385f0da refactor service instance provision/update/poll reconciliation (openshift#1648) e015212 run each integration test individually (openshift#1661) 412e242 Tell people whether we're checking external hrefs (openshift#1659) ae05361 retry failed unbind requests (openshift#1653) 7eae845 doc for setting up Service Catalog with Prometheus metrics (openshift#1654) 0720cf9 minor README copy edit (openshift#1656) 8bd347d run some integration subtests in parallel (openshift#1637) b83800c Use $ and console to indicate multi-command blocks 789c4b2 Use dynamic reaction to fix data race (openshift#1650) f1be763 only check external hrefs on master (openshift#1652) 65c6d20 Controller-manager crash loops if API server is not available on startup (openshift#1376) (openshift#1591) 9225c92 embedded etcd is the way of the future for our tests (openshift#1651) 605c952 Fix required fields in OpenAPI schema (openshift#1602) 899ca21 Revert "Switch to wget for integration apiserver checks (openshift#1384)" (openshift#1585) 2f496ee Update code-of-conduct.md (openshift#1635) c1c69cf Build the e2e binary in CI (openshift#1647) 4e2dcef Wait for successful API discovery (openshift#1646) 5ae6d99 Bump copyright date in generated code (openshift#1645) 8be5b05 Serve OpenAPI spec only when --serve-openapi-spec switch is enabled (openshift#1612) 19fb30e silence go-restful logging output (openshift#1622) fdbabf0 Add walkthrough link back (openshift#1620) 7c73e9a Add link to main k8s docs on service-catalog (openshift#1627) f59adc9 Overhauling the design document (openshift#1619) cd7b633 Updating the install documentation (openshift#1616) f6e5441 fix compilation error from updated util.WaitForBindingCondition() (openshift#1629) (openshift#1631) 54e57af Provide OSB Client proxy to instrument with metrics (openshift#1615) 026b86f Disable test added in 1611 that contains data race (openshift#1626) cb735a6 Add integration tests for ServiceInstances (openshift#1611) 67dbabb Cleaning up the docs README (openshift#1618) 6bddc07 remove email from docker login during Travis deploy (openshift#1614) a604bc3 Use ConfigMaps for leader election (openshift#1599) c6f193a Add controller integration tests for ServiceInstance create and update (openshift#1578) 26cf23b Rename OWNERS assignees: to approvers: (openshift#1508) 1163edc expose Prometheus metrics from Controller (openshift#677) (openshift#1608) 2cd6554 Clean up docs/ folder (openshift#1609) 1d7e96d Adding Service Binding Create Integration Tests (openshift#1580) 6a4c469 Make the maximum polling backoff configurable (openshift#1607) 31bbf55 Rename the imported package to avoid name conflict (openshift#1603) 3cdd556 Add validation for broker spec to SAR admission controller (openshift#1605) a3408ce fix docker volume mount when building with docker under SELinux (openshift#1500) (openshift#1534) 307e747 Remove unneeded vendors of vendors (openshift#1596) 770fc74 Make ups-instance.yaml in walkthrough to demonstrate good practices (openshift#1592) 9112ba1 Add links to docs/README (openshift#1589) 8902648 Add additional service to ups-broker to fix e2e (openshift#1583) 0bcbc7d move instance update logic out of reconcileServiceInstanceDelete (openshift#1584) 7ef5a3e Do not send Parameters field when there are no parameters from sourced secret (openshift#1559) 4c51b25 Remove unneeded code that sets reason for provision/update call failure (openshift#1561) b122cb9 fix bind injection failure not being persisted in API server (openshift#1546) 66421d5 Clear out current operation when starting reconciliation of a delete (openshift#1564) 8cca70a Send an empty object for Parameters when deleting all parameters of a ServiceInstance (openshift#1555) 270426c Add controllerTest type as a helper for running controller integration tests. (openshift#1577) e26c2d7 Ignore .vscode folder in project root (openshift#1579) REVERT: d969acde90 Add additional service to ups-broker to fix e2e (openshift#1583) REVERT: 1bcd53b684 origin build: add origin tooling git-subtree-dir: cmd/service-catalog/go/src/github.com/kubernetes-incubator/service-catalog git-subtree-split: b69b4a6c8003f25d040e3087c7b1b16d1854a9e9

bparees changed the title ~~add delays when handling retries so we don't tight loop~~ [DO_NOT_MERGE] add delays when handling retries so we don't tight loop Apr 1, 2015

bparees force-pushed the tight_loops branch from 650049f to f4e4197 Compare April 1, 2015 22:27

bparees force-pushed the tight_loops branch from f4e4197 to 24222b5 Compare April 2, 2015 03:06

bparees force-pushed the tight_loops branch from 24222b5 to c6f339e Compare April 2, 2015 18:15

bparees force-pushed the tight_loops branch from c6f339e to 83ec26e Compare April 2, 2015 18:33

bparees force-pushed the tight_loops branch from 83ec26e to 5a82154 Compare April 3, 2015 04:22

bparees changed the title ~~[DO_NOT_MERGE] add delays when handling retries so we don't tight loop~~ add delays when handling retries so we don't tight loop Apr 3, 2015

smarterclayton reviewed Apr 3, 2015
View reviewed changes

bparees force-pushed the tight_loops branch 2 times, most recently from 56da6ff to fe493ca Compare April 4, 2015 02:29

bparees added 2 commits April 6, 2015 11:27

UPSTREAM: add a blocking accept method to RateLimiter

e218a54

kubernetes/kubernetes#6314

add delays when handling retries so we don't tight loop

e03b50d

bparees force-pushed the tight_loops branch from bdec0f4 to e03b50d Compare April 6, 2015 15:27

openshift-bot pushed a commit that referenced this pull request Apr 6, 2015

Merge pull request #1561 from bparees/tight_loops

58fb418

Merged by openshift-bot

openshift-bot merged commit 58fb418 into openshift:master Apr 6, 2015

bparees deleted the tight_loops branch April 7, 2015 14:01

smarterclayton changed the title ~~add delays when handling retries so we don't tight loop~~ Controllers should rate limit themselves and space retries out across a longer period of time Apr 8, 2015

smarterclayton added the enhancement label Apr 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controllers should rate limit themselves and space retries out across a longer period of time #1561

Controllers should rate limit themselves and space retries out across a longer period of time #1561

bparees commented Apr 1, 2015

bparees commented Apr 1, 2015

smarterclayton commented Apr 1, 2015

bparees commented Apr 1, 2015

smarterclayton commented Apr 1, 2015

bparees commented Apr 1, 2015

bparees commented Apr 1, 2015

smarterclayton commented Apr 1, 2015

bparees commented Apr 1, 2015

smarterclayton commented Apr 2, 2015

bparees commented Apr 2, 2015

bparees commented Apr 2, 2015

bparees commented Apr 2, 2015

smarterclayton commented Apr 2, 2015

bparees commented Apr 2, 2015

bparees commented Apr 2, 2015

openshift-bot commented Apr 2, 2015

bparees commented Apr 3, 2015

bparees commented Apr 3, 2015

bparees commented Apr 3, 2015

smarterclayton commented Apr 3, 2015

smarterclayton Apr 3, 2015

bparees Apr 4, 2015

smarterclayton Apr 3, 2015

bparees Apr 4, 2015

smarterclayton Apr 4, 2015

smarterclayton Apr 6, 2015

bparees commented Apr 4, 2015

smarterclayton commented Apr 4, 2015

bparees commented Apr 6, 2015

smarterclayton commented Apr 6, 2015

openshift-bot commented Apr 6, 2015

bparees commented Apr 6, 2015

openshift-bot commented Apr 6, 2015

pweil- commented Apr 6, 2015

Controllers should rate limit themselves and space retries out across a longer period of time #1561

Controllers should rate limit themselves and space retries out across a longer period of time #1561

Conversation

bparees commented Apr 1, 2015

bparees commented Apr 1, 2015

smarterclayton commented Apr 1, 2015

bparees commented Apr 1, 2015

smarterclayton commented Apr 1, 2015

bparees commented Apr 1, 2015

bparees commented Apr 1, 2015

smarterclayton commented Apr 1, 2015

bparees commented Apr 1, 2015

smarterclayton commented Apr 2, 2015

bparees commented Apr 2, 2015

bparees commented Apr 2, 2015

bparees commented Apr 2, 2015

smarterclayton commented Apr 2, 2015

bparees commented Apr 2, 2015

bparees commented Apr 2, 2015

openshift-bot commented Apr 2, 2015

bparees commented Apr 3, 2015

bparees commented Apr 3, 2015

bparees commented Apr 3, 2015

smarterclayton commented Apr 3, 2015

smarterclayton Apr 3, 2015

Choose a reason for hiding this comment

bparees Apr 4, 2015

Choose a reason for hiding this comment

smarterclayton Apr 3, 2015

Choose a reason for hiding this comment

bparees Apr 4, 2015

Choose a reason for hiding this comment

smarterclayton Apr 4, 2015

Choose a reason for hiding this comment

smarterclayton Apr 6, 2015

Choose a reason for hiding this comment

bparees commented Apr 4, 2015

smarterclayton commented Apr 4, 2015

bparees commented Apr 6, 2015

smarterclayton commented Apr 6, 2015

openshift-bot commented Apr 6, 2015

bparees commented Apr 6, 2015

openshift-bot commented Apr 6, 2015

pweil- commented Apr 6, 2015