Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Aug 14, 2019

Sometimes CI leaks untagged security groups and subnets. Because we are allowed to remove all resources from within a cluster-owned VPC, add *ByVPC walkers to remove these indirectly-owned resources.

Sometimes CI leaks untagged subnets.  Because we are allowed to remove
all resources from within a cluster-owned VPC, add a ByVPC walker to
remove these indirectly-owned subnets.

DescribeSubnetsPages has a strange history.  It was initially added in
aws/aws-sdk-go@3664ecc7da (Add initial implementation of pagination,
2015-03-23, aws/aws-sdk-go#247) but was removed in
aws/aws-sdk-go@bad551feb8 (Add support for multi-token pagination
rules, 2015-03-27) as a later step in that same pull request.  It
finally landed in master via aws/aws-sdk-go@52cd98f1ed (Release
v1.19.30, 2019-05-14, aws/aws-sdk-go#2599), but we only vendor
v1.16.14.  It doesn't seem like "zounds of untagged subnets" is a
high-probability thing, so I'm just using the unpaginated
DescribeSubnets instead of bumping the vendor to pick up
DescribeSubnetsPages.  Even if we do overflow DescribeSubnets with
untagged subnets, VPC deletion will fail and we'll get another pass at
deleting tagged subnets when we come around to the next deleteEC2VPC
attempt.
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 14, 2019
@wking
Copy link
Member Author

wking commented Aug 14, 2019

This approach is fine for CI, where we own the whole VPC. It's going to break down when we get to bring-your-own-VPC, so we may want to consider another approach now.

  • We might be able to get the AWS Terraform provider patched to fake atomic create-and-tag behavior, by having it delete the created object if a job failed or was aborted before tagging completed.
  • We could use Terraform to completely tear down clusters that fail during Terraform provisioning, although we currently leave the resources up to help with debugging and allow the possibility of users manually completing the cluster by adding the remaining resources.

However, neither of those approaches are going to work cleanly with creation runs that are killed (e.g. by a power outage on the install host) without recording the Terraform state. So we might want the belt-and-suspenders of these *ByVPC helpers to cover the cluster-owned-VPC case and later Terraform-based (partial) deletion to cover the "we're still alive with the Terraform state" case. And we'll just continue to leak the "borrowed VPC and lost the Terraform state" case.

@wking
Copy link
Member Author

wking commented Aug 14, 2019

Hrm. The upgrade job passed, and from its installer log we can see that the one round seems to include successful results for the earlier deleteEC2RouteTablesByVPC and the later deleteEC2VPCEndpointsByVPC but with nothing (successful or otherwise) from deleteEC2SecurityGroupsByVPC or deleteEC2SubnetsByVPC:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2214/pull-ci-openshift-installer-master-e2e-aws-upgrade/1828/artifacts/e2e-aws-upgrade/installer/.openshift_install.log | grep '2019-08-14T08:47:16'
time="2019-08-14T08:47:16Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-028406fc8261cde53" id=vpc-028406fc8261cde53 network interface=eni-0e11139a671c067c2
time="2019-08-14T08:47:16Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-028406fc8261cde53" id=vpc-028406fc8261cde53 network interface=eni-06f4f3ec769ccc88e
time="2019-08-14T08:47:16Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-028406fc8261cde53" id=vpc-028406fc8261cde53 network interface=eni-0be8a73d9a5bcdc9a
time="2019-08-14T08:47:16Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-028406fc8261cde53" id=vpc-028406fc8261cde53 table=rtb-02f4cc762db95fa77
time="2019-08-14T08:47:16Z" level=info msg=Deleted VPC endpoint=vpce-0ae761076c8a81705 arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-028406fc8261cde53" id=vpc-028406fc8261cde53
time="2019-08-14T08:47:16Z" level=debug msg="DependencyViolation: The vpc 'vpc-028406fc8261cde53' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: 2cb00845-55f8-4dd4-9b17-3cff5cb78102" arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-028406fc8261cde53"

Sometimes CI leaks untagged security groups.  Because we are allowed
to remove all resources from within a cluster-owned VPC, add a ByVPC
walker to remove these indirectly-owned groups.

The default name skip avoids errors like:

  time="2019-08-22T12:39:23-07:00" level=debug msg="deleting EC2 security group sg-07c2e6d7b620fb39c: CannotDelete: the specified group: \"sg-07c2e6d7b620fb39c\" name: \"default\" cannot be deleted by a user\n\tstatus code: 400, request id: c88fd74c-77c3-41fe-badb-c53e8022226d" arn="arn:aws:ec2:us-west-2:269733383066:vpc/vpc-0c9097bf5797f611b"

Without the name guard, hitting the error would cause an early exit
from deleteEC2SecurityGroupsByVPC, and mean we never progressed
further in deleteEC2VPC, leading a hung cluster teardown.
@wking wking force-pushed the delete-subnets-by-vpc branch from 7de50d2 to 55630a3 Compare August 22, 2019 20:20
@wking
Copy link
Member Author

wking commented Aug 22, 2019

I think I fixed the teardown hang with 7de50d2bb -> 55630a3, which no longer attempts to delete security groups with the reserved default name.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 22, 2019

@wking: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 7de50d2bb27ac814024a3330a22c640e68c0d639 link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Aug 22, 2019

e2e-aws has lots of:

 Aug 22 21:17:28.789: INFO: Couldn't delete ns: "e2e-svcaccounts-2381": namespace e2e-svcaccounts-2381 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed (&errors.errorString{s:"namespace e2e-svcaccounts-2381 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed"}) 

which is rhbz#1727090.

/test e2e-aws

@wking
Copy link
Member Author

wking commented Aug 23, 2019

Retest request didn't stick. Trying again:

/test e2e-aws

@wking
Copy link
Member Author

wking commented Aug 23, 2019

Green enough (just scaleup still in flight). CC @abhinavdahiya for review.

@wking wking changed the title pkg/destroy/aws: Delete security groups and subnets by VPC Bug 1747519: pkg/destroy/aws: Delete security groups and subnets by VPC Aug 30, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1747519, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 1747519: pkg/destroy/aws: Delete security groups and subnets by VPC

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 30, 2019
@abhinavdahiya
Copy link
Contributor

Do we have an idea on how many calls it adds to our already on-the-edge rate limiting..? VPC is usually the last resource, and this ByVPC is pretty expensive.

@wking
Copy link
Member Author

wking commented Sep 4, 2019

Do we have an idea on how many calls it adds...

Comparing e2e-aws (which used the new code for teardown) and e2e-aws-upgrade (which used whatever was in master at the time):

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2214/pull-ci-openshift-installer-master-e2e-aws/7377/artifacts/e2e-aws/installer/.openshift_install.log | grep 'DependencyViolation.*\(sg-\|subnet\|vpc\)' | wc -l
38
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2214/pull-ci-openshift-installer-master-e2e-aws-upgrade/1934/artifacts/e2e-aws-upgrade/installer/.openshift_install.log | grep 'DependencyViolation.*\(sg-\|subnet\|vpc\)' | wc -l
36

So perhaps a 6% increase, although I haven't dug deeper to see if one or the other had especially slow instance removal or similar.

@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 9, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 6fdffbb into openshift:master Sep 10, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: All pull requests linked via external trackers have merged. Bugzilla bug 1747519 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1747519: pkg/destroy/aws: Delete security groups and subnets by VPC

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the delete-subnets-by-vpc branch September 10, 2019 04:10
wking added a commit to wking/hive that referenced this pull request Sep 10, 2019
... and security groups.  Pulling in openshift/installer@55630a304
(pkg/destroy/aws: Delete security groups by VPC,
2019-08-13, openshift/installer#2214).

Generated with:

  $ sed -i 's/ae2baf820f22b9bf1ed40932e7b702d790087574/6fdffbb9c21d90ba97fc79dff3ccbeac22fb2a0e/' Gopkg.toml
  $ dep ensure

using:

  $ dep version
  dep:
   version     : v0.5.1
   build date  : 2019-03-20
   git hash    : faa61893
   go version  : go1.10.3
   go compiler : gc
   platform    : linux/amd64
   features    : ImportDuringSolve=false

The k8s.io/apimachinery/pkg/util/clock addition in Gopkg.lock is
probably catching up with c759bfa (Use expectations to ensure that
clusterDeployment controller does not create multiple overlapping
clusterprovisions, 2019-08-23, openshift#518).
wking added a commit to wking/openshift-installer that referenced this pull request Sep 11, 2019
When there are multiple untagged security groups, a security group may
fail deletion with DependencyViolation.  Before this commit, that
aborted that round of deleteEC2VPC, and when the blocking dependency
was another untagged security group, deletion would hang [1].

With this commit, we keep iterating through all untagged security
groups (just like we do for tagged security groups).  For security
group cycles, deletion will fail for the first few groups while we
drain out their ingress and egress rules.  But deletion will succeed
for at least some of the cycle.  And the rule draining will have
completely deconstructed the cycle, so the second pass through will
clear out all the security groups.

So prior to 55630a3 (pkg/destroy/aws: Delete security groups by
VPC, 2019-08-13, openshift#2214), destroy would hang on *any* untagged security
group.  And prior to this commit, it would hang when two untagged
security groups referenced each other (or, if we got unlucky, A
referenced B and our DescribeSecurityGroupsPages call always happened
to return B first).  With this commit, we should be good to go.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1747519#c5
@wking
Copy link
Member Author

wking commented Sep 11, 2019

If we pick this back to release-4.1, we want to pick #2346 back too. This change alone is still an improvement, but #2346 closes a preexisting hole that this PR missed.

wking added a commit to wking/openshift-installer that referenced this pull request Oct 11, 2019
Fix a bug from 37a7f49 (pkg/destroy/aws: Delete subnets by VPC,
2019-08-13, openshift#2214), where failFast=false lead to all deleteEC2Subnet
errors being ignored.  Luckily we have 'failFast: true' for them, so
no exposed change, but fixing this bug protects us from exposing it if
we toggle failFast in the future.
wking added a commit to wking/openshift-installer that referenced this pull request Oct 17, 2019
This allows users to feed in prexisting subnets.  I've also added
classification logic copied from the upstream Kubernetes AWS cloud
provider for categorizing private vs. public subnets.  There's no
explicit install-config field for the VPC; we'll extract that from the
given subnets.

populateSubnets is a bit heavy if all you need is the VPC, but Abhinav
prefers it [1], it would save future public/private subnet lookup by
warming those caches, and we'll only call Metadata.AWS late in
pkg/asset/cluster/tfvars.go (via a future commit), so the VPC
cache-warming logic doesn't get called at the moment anyway.

There's no verification yet; we'll get to that in follow-up work.

More on the DescribeSubnetsPagesWithContext FIXME in 37a7f49
(pkg/destroy/aws: Delete subnets by VPC, 2019-08-13, openshift#2214).

[1]: openshift#2477 (comment)
jhixson74 pushed a commit to jhixson74/installer that referenced this pull request Dec 6, 2019
When there are multiple untagged security groups, a security group may
fail deletion with DependencyViolation.  Before this commit, that
aborted that round of deleteEC2VPC, and when the blocking dependency
was another untagged security group, deletion would hang [1].

With this commit, we keep iterating through all untagged security
groups (just like we do for tagged security groups).  For security
group cycles, deletion will fail for the first few groups while we
drain out their ingress and egress rules.  But deletion will succeed
for at least some of the cycle.  And the rule draining will have
completely deconstructed the cycle, so the second pass through will
clear out all the security groups.

So prior to 55630a3 (pkg/destroy/aws: Delete security groups by
VPC, 2019-08-13, openshift#2214), destroy would hang on *any* untagged security
group.  And prior to this commit, it would hang when two untagged
security groups referenced each other (or, if we got unlucky, A
referenced B and our DescribeSecurityGroupsPages call always happened
to return B first).  With this commit, we should be good to go.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1747519#c5
jhixson74 pushed a commit to jhixson74/installer that referenced this pull request Dec 6, 2019
Fix a bug from 37a7f49 (pkg/destroy/aws: Delete subnets by VPC,
2019-08-13, openshift#2214), where failFast=false lead to all deleteEC2Subnet
errors being ignored.  Luckily we have 'failFast: true' for them, so
no exposed change, but fixing this bug protects us from exposing it if
we toggle failFast in the future.
jhixson74 pushed a commit to jhixson74/installer that referenced this pull request Dec 6, 2019
This allows users to feed in prexisting subnets.  I've also added
classification logic copied from the upstream Kubernetes AWS cloud
provider for categorizing private vs. public subnets.  There's no
explicit install-config field for the VPC; we'll extract that from the
given subnets.

populateSubnets is a bit heavy if all you need is the VPC, but Abhinav
prefers it [1], it would save future public/private subnet lookup by
warming those caches, and we'll only call Metadata.AWS late in
pkg/asset/cluster/tfvars.go (via a future commit), so the VPC
cache-warming logic doesn't get called at the moment anyway.

There's no verification yet; we'll get to that in follow-up work.

More on the DescribeSubnetsPagesWithContext FIXME in 37a7f49
(pkg/destroy/aws: Delete subnets by VPC, 2019-08-13, openshift#2214).

[1]: openshift#2477 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants