pkg/destroy/aws: Don't give up on Route 53 rate limits #940

wking · 2018-12-18T23:46:16Z

We've been hitting Route 53 rate limits in the busy CI account:

level=debug msg="Deleting Route53 zones (map[openshiftClusterID:5b0921a0-5e21-4ebf-a5f9-396a92526ec1])"
level=debug msg="Deleting Route53 zones (map[kubernetes.io/cluster/ci-op-piz2m00h-1d3f3:owned])"
level=debug msg="error converting r53Zones to native AWS objects: Throttling: Rate exceeded\n\tstatus code: 400, request id: 80e10c03-0306-11e9-b9b6-abeb053f0218"
level=debug msg="Exiting deleting Route53 zones (map[kubernetes.io/cluster/ci-op-piz2m00h-1d3f3:owned])"
level=debug msg="error converting r53Zones to native AWS objects: Throttling: Rate exceeded\n\tstatus code: 400, request id: 81cd4026-0306-11e9-9710-21e3250d9953"
level=debug msg="Exiting deleting Route53 zones (map[openshiftClusterID:5b0921a0-5e21-4ebf-a5f9-396a92526ec1])"

We've had trouble with Route 53 rate limits before; see discussion in openshift/hive@f945dbb3 (openshift/hive#113). With this commit, instead of bailing part way through listing tags for all the hosted zones, we just retry that particular zone until it goes through and keep going on tags for the whole list. This should reduce our overall load on the Route 53 APIs.

The current strings.Contains comparison is a bit of a hack based on the above log excerpt. But there doesn't seem to be a strongly-typed approach to this; see the internal isCodeThrottle in vendor/github.com/aws/aws-sdk-go/aws/request/retryer.go.

I've also taken the opportunity to pull the Hive package over into our repository (and I've dropped it from vendor/). We'd been planning on doing this for a while, but hadn't gotten around to it before. Once this PR lands, I'll file one with Hive so they can vendor us for hiveutil. CC @joelddiaz

We've been meaning to take this off the hands of the Hive folks for a while. Finally copy it over (I'll drop the vendored copy soon). This is a verbatim copy of the file as it stands with openshift/hive@ad6f8d5b (Merge pull request openshift/hive#143 from abutcher/capischeme, 2018-12-18).

The fact that it's a subpackage of pkg/destroy is sufficient context without repeating "deprovision" in the package name. And the fact that the deprovision is tag based is an implementation detail that doesn't need to be surfaced in the package name. Also drop the copyright header. We're also an Apache-2.0 project, so this just removes the copyright holder information from the header. And "the Kubernetes Authors" wasn't all that helpful anyway. If folks want to see who authored the contents, they should check the Git history. I'm breaking that history here, so here's a list of authors in case the hive repository goes away or something: Abhinav Dahiya, Dan Mace, Devan Goodwin, Joel Diaz, Miciah Masters, Thomas Wiest, and me.

Generated with: $ emacs Gopkg.toml # drop the hive constraint $ dep ensure using: $ dep version dep: version : v0.5.0 build date : git hash : 22125cf go version : go1.10.3 go compiler : gc platform : linux/amd64 features : ImportDuringSolve=false

abhinavdahiya · 2018-12-18T23:51:00Z

can't we move the throttled error handling up to the runners here

That way all throttled requests are handled at a common place.

pkg/destroy/aws/aws.go

abhinavdahiya · 2018-12-18T23:54:24Z

The current strings.Contains comparison is a bit of a hack based on the above log excerpt. But there doesn't seem to be a strongly-typed approach to this; see the internal isCodeThrottle in vendor/github.com/aws/aws-sdk-go/aws/request/retryer.go.

Also why can't we use

installer/vendor/github.com/aws/aws-sdk-go/aws/request/retryer.go

Line 119 in dd6fd9f

func IsErrorThrottle(err error) bool {

we could switch the impl of destroy/aws to use github/pkg/errors so that we can Cause()

We've been hitting Route 53 rate limits in the busy CI account: level=debug msg="Deleting Route53 zones (map[openshiftClusterID:5b0921a0-5e21-4ebf-a5f9-396a92526ec1])" level=debug msg="Deleting Route53 zones (map[kubernetes.io/cluster/ci-op-piz2m00h-1d3f3:owned])" level=debug msg="error converting r53Zones to native AWS objects: Throttling: Rate exceeded\n\tstatus code: 400, request id: 80e10c03-0306-11e9-b9b6-abeb053f0218" level=debug msg="Exiting deleting Route53 zones (map[kubernetes.io/cluster/ci-op-piz2m00h-1d3f3:owned])" level=debug msg="error converting r53Zones to native AWS objects: Throttling: Rate exceeded\n\tstatus code: 400, request id: 81cd4026-0306-11e9-9710-21e3250d9953" level=debug msg="Exiting deleting Route53 zones (map[openshiftClusterID:5b0921a0-5e21-4ebf-a5f9-396a92526ec1])" We've had trouble with Route 53 rate limits before; see discussion in openshift/hive@f945dbb3 (awstagdeprovision: Ignore more errors, 2018-11-27, openshift/hive#113). With this commit, instead of bailing part way through listing tags for all the hosted zones, we just retry that particular zone until it goes through and keep going on tags for the whole list. This should reduce our overall load on the Route 53 APIs.

wking · 2018-12-19T00:01:06Z

Also why can't we use IsErrorThrottle

Because I hadn't found it ;). Done now with 4d7b82d -> 6447e9c.

wking · 2018-12-19T00:05:24Z

can't we move the throttled error handling up to the runners here

That way all throttled requests are handled at a common place.

The existing exponential backoff there is handling throttling. 6447e9c adds special handling for Route 53 because:

Route 53's per-account rate limits are especially low compared to other services, and
When we do get rate limited, we want to keep the successfully-resolved tags from that run instead of throwing them all out and starting again later. So instead of:
1. Get zones.
2. Get tags for zones 1, 2, 3.
3. Try and get tags for zone 4.
4. Hit a rate limit on zone 4, sleep and start over at 1.
we now have:
1. Get zones.
2. Get tags for zones 1, 2, 3.
3. Try and get tags for zone 4.
4. Hit a rate limit on zone 4, sleep and start over at 3.

To avoid [1]: 2018/12/19 00:01:38 Executing test golint /go/src/github.com/openshift/installer/pkg/destroy/aws/aws.go:33:6: type name will be used as aws.AWSFilter by other packages, and that stutters; consider calling this Filter Found 1 lint suggestions; failing. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/940/pull-ci-openshift-installer-master-golint/1879/build-log.txt

abhinavdahiya · 2018-12-19T00:19:33Z

can't we move the throttled error handling up to the runners here
That way all throttled requests are handled at a common place.

The existing exponential backoff there is handling throttling. 6447e9c adds special handling for Route 53 because:
* Route 53's per-account rate limits are especially low compared to other services, and

This seems good enough reason to do r53 only handling..

* When we do get rate limited, we want to keep the successfully-resolved tags from that run instead of throwing them all out and starting again later.  So instead of:
  
  1. Get zones.
  2. Get tags for zones 1, 2, 3.
  3. Try and get tags for zone 4.
  4. Hit a rate limit on zone 4, sleep and start over at 1.
  
  we now have:
  
  1. Get zones.
  2. Get tags for zones 1, 2, 3.
  3. Try and get tags for zone 4.
  4. Hit a rate limit on zone 4, sleep and start over at 3.

wking · 2018-12-19T01:03:54Z

All green, and teardown was so fast I missed the logs :p.

2018/12/19 00:16:18 Running pod e2e-aws
2018/12/19 00:35:09 Container setup in pod e2e-aws completed successfully
2018/12/19 00:57:40 Container test in pod e2e-aws completed successfully
2018/12/19 01:01:48 Container teardown in pod e2e-aws completed successfully
2018/12/19 01:01:48 Pod e2e-aws succeeded after 45m30s

abhinavdahiya · 2018-12-19T01:05:32Z

/lgtm

openshift-ci-robot · 2018-12-19T01:05:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Generated with: $ dep ensure using: $ dep version dep: version : v0.5.0 build date : git hash : 22125cf go version : go1.10.3 go compiler : gc platform : linux/amd64 features : ImportDuringSolve=false I hadn't realized I'd need this after 6447e9c (pkg/destroy/aws: Don't give up on Route 53 rate limits, 2018-12-18, openshift#940) added a direct consumer of this package.

akostadinov · 2018-12-20T12:42:26Z

#957 seems related to this. Failing on zone list rate limit. Can we avoid listing a huge number of zones somehow?

wking added 3 commits December 18, 2018 14:53

vendor: Drop openshift/hive

5a21112

Generated with: $ emacs Gopkg.toml # drop the hive constraint $ dep ensure using: $ dep version dep: version : v0.5.0 build date : git hash : 22125cf go version : go1.10.3 go compiler : gc platform : linux/amd64 features : ImportDuringSolve=false

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 18, 2018

openshift-ci-robot requested review from abhinavdahiya and rajatchopra December 18, 2018 23:47

abhinavdahiya reviewed Dec 18, 2018

View reviewed changes

pkg/destroy/aws/aws.go Outdated Show resolved Hide resolved

wking force-pushed the local-aws-destroy branch from 4d7b82d to 6447e9c Compare December 19, 2018 00:00

openshift-ci-robot assigned abhinavdahiya Dec 19, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 19, 2018

openshift-merge-robot merged commit a7468d1 into openshift:master Dec 19, 2018

wking deleted the local-aws-destroy branch December 19, 2018 02:27

This was referenced Dec 19, 2018

Gopkg: Lock aws/request #941

Merged

contrib/pkg/awstagdeprovision: Replace with installer/pkg/destroy/aws openshift/hive#144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pkg/destroy/aws: Don't give up on Route 53 rate limits #940

pkg/destroy/aws: Don't give up on Route 53 rate limits #940

Uh oh!

wking commented Dec 18, 2018

Uh oh!

abhinavdahiya commented Dec 18, 2018

Uh oh!

Uh oh!

abhinavdahiya commented Dec 18, 2018 •

edited

Loading

Uh oh!

wking commented Dec 19, 2018

Uh oh!

wking commented Dec 19, 2018 •

edited

Loading

Uh oh!

abhinavdahiya commented Dec 19, 2018

Uh oh!

wking commented Dec 19, 2018

Uh oh!

abhinavdahiya commented Dec 19, 2018

Uh oh!

openshift-ci-robot commented Dec 19, 2018

Uh oh!

akostadinov commented Dec 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pkg/destroy/aws: Don't give up on Route 53 rate limits #940

pkg/destroy/aws: Don't give up on Route 53 rate limits #940

Uh oh!

Conversation

wking commented Dec 18, 2018

Uh oh!

abhinavdahiya commented Dec 18, 2018

Uh oh!

Uh oh!

abhinavdahiya commented Dec 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Dec 19, 2018

Uh oh!

wking commented Dec 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhinavdahiya commented Dec 19, 2018

Uh oh!

wking commented Dec 19, 2018

Uh oh!

abhinavdahiya commented Dec 19, 2018

Uh oh!

openshift-ci-robot commented Dec 19, 2018

Uh oh!

akostadinov commented Dec 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

abhinavdahiya commented Dec 18, 2018 •

edited

Loading

wking commented Dec 19, 2018 •

edited

Loading