Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Dec 18, 2018

We've been hitting Route 53 rate limits in the busy CI account:

level=debug msg="Deleting Route53 zones (map[openshiftClusterID:5b0921a0-5e21-4ebf-a5f9-396a92526ec1])"
level=debug msg="Deleting Route53 zones (map[kubernetes.io/cluster/ci-op-piz2m00h-1d3f3:owned])"
level=debug msg="error converting r53Zones to native AWS objects: Throttling: Rate exceeded\n\tstatus code: 400, request id: 80e10c03-0306-11e9-b9b6-abeb053f0218"
level=debug msg="Exiting deleting Route53 zones (map[kubernetes.io/cluster/ci-op-piz2m00h-1d3f3:owned])"
level=debug msg="error converting r53Zones to native AWS objects: Throttling: Rate exceeded\n\tstatus code: 400, request id: 81cd4026-0306-11e9-9710-21e3250d9953"
level=debug msg="Exiting deleting Route53 zones (map[openshiftClusterID:5b0921a0-5e21-4ebf-a5f9-396a92526ec1])"

We've had trouble with Route 53 rate limits before; see discussion in openshift/hive@f945dbb3 (openshift/hive#113). With this commit, instead of bailing part way through listing tags for all the hosted zones, we just retry that particular zone until it goes through and keep going on tags for the whole list. This should reduce our overall load on the Route 53 APIs.

The current strings.Contains comparison is a bit of a hack based on the above log excerpt. But there doesn't seem to be a strongly-typed approach to this; see the internal isCodeThrottle in vendor/github.com/aws/aws-sdk-go/aws/request/retryer.go.

I've also taken the opportunity to pull the Hive package over into our repository (and I've dropped it from vendor/). We'd been planning on doing this for a while, but hadn't gotten around to it before. Once this PR lands, I'll file one with Hive so they can vendor us for hiveutil. CC @joelddiaz

We've been meaning to take this off the hands of the Hive folks for a
while.  Finally copy it over (I'll drop the vendored copy soon).  This
is a verbatim copy of the file as it stands with
openshift/hive@ad6f8d5b (Merge pull request openshift/hive#143 from
abutcher/capischeme, 2018-12-18).
The fact that it's a subpackage of pkg/destroy is sufficient context
without repeating "deprovision" in the package name.  And the fact
that the deprovision is tag based is an implementation detail that
doesn't need to be surfaced in the package name.

Also drop the copyright header.  We're also an Apache-2.0 project, so
this just removes the copyright holder information from the header.
And "the Kubernetes Authors" wasn't all that helpful anyway.  If folks
want to see who authored the contents, they should check the Git
history.  I'm breaking that history here, so here's a list of authors
in case the hive repository goes away or something: Abhinav Dahiya,
Dan Mace, Devan Goodwin, Joel Diaz, Miciah Masters, Thomas Wiest,
and me.
Generated with:

  $ emacs Gopkg.toml  # drop the hive constraint
  $ dep ensure

using:

  $ dep version
  dep:
   version     : v0.5.0
   build date  :
   git hash    : 22125cf
   go version  : go1.10.3
   go compiler : gc
   platform    : linux/amd64
   features    : ImportDuringSolve=false
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 18, 2018
@abhinavdahiya
Copy link
Contributor

can't we move the throttled error handling up to the runners here

That way all throttled requests are handled at a common place.

@abhinavdahiya
Copy link
Contributor

abhinavdahiya commented Dec 18, 2018

The current strings.Contains comparison is a bit of a hack based on the above log excerpt. But there doesn't seem to be a strongly-typed approach to this; see the internal isCodeThrottle in vendor/github.com/aws/aws-sdk-go/aws/request/retryer.go.

Also why can't we use

we could switch the impl of destroy/aws to use github/pkg/errors so that we can Cause()

We've been hitting Route 53 rate limits in the busy CI account:

  level=debug msg="Deleting Route53 zones (map[openshiftClusterID:5b0921a0-5e21-4ebf-a5f9-396a92526ec1])"
  level=debug msg="Deleting Route53 zones (map[kubernetes.io/cluster/ci-op-piz2m00h-1d3f3:owned])"
  level=debug msg="error converting r53Zones to native AWS objects: Throttling: Rate exceeded\n\tstatus code: 400, request id: 80e10c03-0306-11e9-b9b6-abeb053f0218"
  level=debug msg="Exiting deleting Route53 zones (map[kubernetes.io/cluster/ci-op-piz2m00h-1d3f3:owned])"
  level=debug msg="error converting r53Zones to native AWS objects: Throttling: Rate exceeded\n\tstatus code: 400, request id: 81cd4026-0306-11e9-9710-21e3250d9953"
  level=debug msg="Exiting deleting Route53 zones (map[openshiftClusterID:5b0921a0-5e21-4ebf-a5f9-396a92526ec1])"

We've had trouble with Route 53 rate limits before; see discussion in
openshift/hive@f945dbb3 (awstagdeprovision: Ignore more errors,
2018-11-27, openshift/hive#113).  With this commit, instead of bailing
part way through listing tags for all the hosted zones, we just retry
that particular zone until it goes through and keep going on tags for
the whole list.  This should reduce our overall load on the Route 53
APIs.
@wking wking force-pushed the local-aws-destroy branch from 4d7b82d to 6447e9c Compare December 19, 2018 00:00
@wking
Copy link
Member Author

wking commented Dec 19, 2018

Also why can't we use IsErrorThrottle

Because I hadn't found it ;). Done now with 4d7b82d -> 6447e9c.

@wking
Copy link
Member Author

wking commented Dec 19, 2018

can't we move the throttled error handling up to the runners here

That way all throttled requests are handled at a common place.

The existing exponential backoff there is handling throttling. 6447e9c adds special handling for Route 53 because:

  • Route 53's per-account rate limits are especially low compared to other services, and

  • When we do get rate limited, we want to keep the successfully-resolved tags from that run instead of throwing them all out and starting again later. So instead of:

    1. Get zones.
    2. Get tags for zones 1, 2, 3.
    3. Try and get tags for zone 4.
    4. Hit a rate limit on zone 4, sleep and start over at 1.

    we now have:

    1. Get zones.
    2. Get tags for zones 1, 2, 3.
    3. Try and get tags for zone 4.
    4. Hit a rate limit on zone 4, sleep and start over at 3.

To avoid [1]:

  2018/12/19 00:01:38 Executing test golint
  /go/src/github.com/openshift/installer/pkg/destroy/aws/aws.go:33:6: type name will be used as aws.AWSFilter by other packages, and that stutters; consider calling this Filter
  Found 1 lint suggestions; failing.

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/940/pull-ci-openshift-installer-master-golint/1879/build-log.txt
@abhinavdahiya
Copy link
Contributor

can't we move the throttled error handling up to the runners here
That way all throttled requests are handled at a common place.

The existing exponential backoff there is handling throttling. 6447e9c adds special handling for Route 53 because:

* Route 53's per-account rate limits are especially low compared to other services, and

This seems good enough reason to do r53 only handling..

* When we do get rate limited, we want to keep the successfully-resolved tags from that run instead of throwing them all out and starting again later.  So instead of:
  
  1. Get zones.
  2. Get tags for zones 1, 2, 3.
  3. Try and get tags for zone 4.
  4. Hit a rate limit on zone 4, sleep and start over at 1.
  
  we now have:
  
  1. Get zones.
  2. Get tags for zones 1, 2, 3.
  3. Try and get tags for zone 4.
  4. Hit a rate limit on zone 4, sleep and start over at 3.

@wking
Copy link
Member Author

wking commented Dec 19, 2018

All green, and teardown was so fast I missed the logs :p.

2018/12/19 00:16:18 Running pod e2e-aws
2018/12/19 00:35:09 Container setup in pod e2e-aws completed successfully
2018/12/19 00:57:40 Container test in pod e2e-aws completed successfully
2018/12/19 01:01:48 Container teardown in pod e2e-aws completed successfully
2018/12/19 01:01:48 Pod e2e-aws succeeded after 45m30s

@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 19, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit a7468d1 into openshift:master Dec 19, 2018
@wking wking deleted the local-aws-destroy branch December 19, 2018 02:27
wking added a commit to wking/openshift-installer that referenced this pull request Dec 19, 2018
Generated with:

  $ dep ensure

using:

  $ dep version
  dep:
   version     : v0.5.0
   build date  :
   git hash    : 22125cf
   go version  : go1.10.3
   go compiler : gc
   platform    : linux/amd64
   features    : ImportDuringSolve=false

I hadn't realized I'd need this after 6447e9c (pkg/destroy/aws: Don't
give up on Route 53 rate limits, 2018-12-18, openshift#940) added a direct
consumer of this package.
@akostadinov
Copy link

#957 seems related to this. Failing on zone list rate limit. Can we avoid listing a huge number of zones somehow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants