-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Bug 1747519: pkg/destroy/aws: Delete security groups and subnets by VPC #2214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1747519: pkg/destroy/aws: Delete security groups and subnets by VPC #2214
Conversation
Sometimes CI leaks untagged subnets. Because we are allowed to remove all resources from within a cluster-owned VPC, add a ByVPC walker to remove these indirectly-owned subnets. DescribeSubnetsPages has a strange history. It was initially added in aws/aws-sdk-go@3664ecc7da (Add initial implementation of pagination, 2015-03-23, aws/aws-sdk-go#247) but was removed in aws/aws-sdk-go@bad551feb8 (Add support for multi-token pagination rules, 2015-03-27) as a later step in that same pull request. It finally landed in master via aws/aws-sdk-go@52cd98f1ed (Release v1.19.30, 2019-05-14, aws/aws-sdk-go#2599), but we only vendor v1.16.14. It doesn't seem like "zounds of untagged subnets" is a high-probability thing, so I'm just using the unpaginated DescribeSubnets instead of bumping the vendor to pick up DescribeSubnetsPages. Even if we do overflow DescribeSubnets with untagged subnets, VPC deletion will fail and we'll get another pass at deleting tagged subnets when we come around to the next deleteEC2VPC attempt.
|
This approach is fine for CI, where we own the whole VPC. It's going to break down when we get to bring-your-own-VPC, so we may want to consider another approach now.
However, neither of those approaches are going to work cleanly with creation runs that are killed (e.g. by a power outage on the install host) without recording the Terraform state. So we might want the belt-and-suspenders of these |
|
Hrm. The upgrade job passed, and from its installer log we can see that the one round seems to include successful results for the earlier $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2214/pull-ci-openshift-installer-master-e2e-aws-upgrade/1828/artifacts/e2e-aws-upgrade/installer/.openshift_install.log | grep '2019-08-14T08:47:16'
time="2019-08-14T08:47:16Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-028406fc8261cde53" id=vpc-028406fc8261cde53 network interface=eni-0e11139a671c067c2
time="2019-08-14T08:47:16Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-028406fc8261cde53" id=vpc-028406fc8261cde53 network interface=eni-06f4f3ec769ccc88e
time="2019-08-14T08:47:16Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-028406fc8261cde53" id=vpc-028406fc8261cde53 network interface=eni-0be8a73d9a5bcdc9a
time="2019-08-14T08:47:16Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-028406fc8261cde53" id=vpc-028406fc8261cde53 table=rtb-02f4cc762db95fa77
time="2019-08-14T08:47:16Z" level=info msg=Deleted VPC endpoint=vpce-0ae761076c8a81705 arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-028406fc8261cde53" id=vpc-028406fc8261cde53
time="2019-08-14T08:47:16Z" level=debug msg="DependencyViolation: The vpc 'vpc-028406fc8261cde53' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: 2cb00845-55f8-4dd4-9b17-3cff5cb78102" arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-028406fc8261cde53" |
Sometimes CI leaks untagged security groups. Because we are allowed to remove all resources from within a cluster-owned VPC, add a ByVPC walker to remove these indirectly-owned groups. The default name skip avoids errors like: time="2019-08-22T12:39:23-07:00" level=debug msg="deleting EC2 security group sg-07c2e6d7b620fb39c: CannotDelete: the specified group: \"sg-07c2e6d7b620fb39c\" name: \"default\" cannot be deleted by a user\n\tstatus code: 400, request id: c88fd74c-77c3-41fe-badb-c53e8022226d" arn="arn:aws:ec2:us-west-2:269733383066:vpc/vpc-0c9097bf5797f611b" Without the name guard, hitting the error would cause an early exit from deleteEC2SecurityGroupsByVPC, and mean we never progressed further in deleteEC2VPC, leading a hung cluster teardown.
7de50d2 to
55630a3
Compare
|
I think I fixed the teardown hang with 7de50d2bb -> 55630a3, which no longer attempts to delete security groups with the reserved |
|
@wking: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
e2e-aws has lots of: which is rhbz#1727090. /test e2e-aws |
|
Retest request didn't stick. Trying again: /test e2e-aws |
|
Green enough (just scaleup still in flight). CC @abhinavdahiya for review. |
|
@wking: This pull request references Bugzilla bug 1747519, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Do we have an idea on how many calls it adds to our already |
Comparing e2e-aws (which used the new code for teardown) and e2e-aws-upgrade (which used whatever was in master at the time): $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2214/pull-ci-openshift-installer-master-e2e-aws/7377/artifacts/e2e-aws/installer/.openshift_install.log | grep 'DependencyViolation.*\(sg-\|subnet\|vpc\)' | wc -l
38
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2214/pull-ci-openshift-installer-master-e2e-aws-upgrade/1934/artifacts/e2e-aws-upgrade/installer/.openshift_install.log | grep 'DependencyViolation.*\(sg-\|subnet\|vpc\)' | wc -l
36So perhaps a 6% increase, although I haven't dug deeper to see if one or the other had especially slow instance removal or similar. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@wking: All pull requests linked via external trackers have merged. Bugzilla bug 1747519 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
... and security groups. Pulling in openshift/installer@55630a304 (pkg/destroy/aws: Delete security groups by VPC, 2019-08-13, openshift/installer#2214). Generated with: $ sed -i 's/ae2baf820f22b9bf1ed40932e7b702d790087574/6fdffbb9c21d90ba97fc79dff3ccbeac22fb2a0e/' Gopkg.toml $ dep ensure using: $ dep version dep: version : v0.5.1 build date : 2019-03-20 git hash : faa61893 go version : go1.10.3 go compiler : gc platform : linux/amd64 features : ImportDuringSolve=false The k8s.io/apimachinery/pkg/util/clock addition in Gopkg.lock is probably catching up with c759bfa (Use expectations to ensure that clusterDeployment controller does not create multiple overlapping clusterprovisions, 2019-08-23, openshift#518).
When there are multiple untagged security groups, a security group may fail deletion with DependencyViolation. Before this commit, that aborted that round of deleteEC2VPC, and when the blocking dependency was another untagged security group, deletion would hang [1]. With this commit, we keep iterating through all untagged security groups (just like we do for tagged security groups). For security group cycles, deletion will fail for the first few groups while we drain out their ingress and egress rules. But deletion will succeed for at least some of the cycle. And the rule draining will have completely deconstructed the cycle, so the second pass through will clear out all the security groups. So prior to 55630a3 (pkg/destroy/aws: Delete security groups by VPC, 2019-08-13, openshift#2214), destroy would hang on *any* untagged security group. And prior to this commit, it would hang when two untagged security groups referenced each other (or, if we got unlucky, A referenced B and our DescribeSecurityGroupsPages call always happened to return B first). With this commit, we should be good to go. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1747519#c5
Fix a bug from 37a7f49 (pkg/destroy/aws: Delete subnets by VPC, 2019-08-13, openshift#2214), where failFast=false lead to all deleteEC2Subnet errors being ignored. Luckily we have 'failFast: true' for them, so no exposed change, but fixing this bug protects us from exposing it if we toggle failFast in the future.
This allows users to feed in prexisting subnets. I've also added classification logic copied from the upstream Kubernetes AWS cloud provider for categorizing private vs. public subnets. There's no explicit install-config field for the VPC; we'll extract that from the given subnets. populateSubnets is a bit heavy if all you need is the VPC, but Abhinav prefers it [1], it would save future public/private subnet lookup by warming those caches, and we'll only call Metadata.AWS late in pkg/asset/cluster/tfvars.go (via a future commit), so the VPC cache-warming logic doesn't get called at the moment anyway. There's no verification yet; we'll get to that in follow-up work. More on the DescribeSubnetsPagesWithContext FIXME in 37a7f49 (pkg/destroy/aws: Delete subnets by VPC, 2019-08-13, openshift#2214). [1]: openshift#2477 (comment)
When there are multiple untagged security groups, a security group may fail deletion with DependencyViolation. Before this commit, that aborted that round of deleteEC2VPC, and when the blocking dependency was another untagged security group, deletion would hang [1]. With this commit, we keep iterating through all untagged security groups (just like we do for tagged security groups). For security group cycles, deletion will fail for the first few groups while we drain out their ingress and egress rules. But deletion will succeed for at least some of the cycle. And the rule draining will have completely deconstructed the cycle, so the second pass through will clear out all the security groups. So prior to 55630a3 (pkg/destroy/aws: Delete security groups by VPC, 2019-08-13, openshift#2214), destroy would hang on *any* untagged security group. And prior to this commit, it would hang when two untagged security groups referenced each other (or, if we got unlucky, A referenced B and our DescribeSecurityGroupsPages call always happened to return B first). With this commit, we should be good to go. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1747519#c5
Fix a bug from 37a7f49 (pkg/destroy/aws: Delete subnets by VPC, 2019-08-13, openshift#2214), where failFast=false lead to all deleteEC2Subnet errors being ignored. Luckily we have 'failFast: true' for them, so no exposed change, but fixing this bug protects us from exposing it if we toggle failFast in the future.
This allows users to feed in prexisting subnets. I've also added classification logic copied from the upstream Kubernetes AWS cloud provider for categorizing private vs. public subnets. There's no explicit install-config field for the VPC; we'll extract that from the given subnets. populateSubnets is a bit heavy if all you need is the VPC, but Abhinav prefers it [1], it would save future public/private subnet lookup by warming those caches, and we'll only call Metadata.AWS late in pkg/asset/cluster/tfvars.go (via a future commit), so the VPC cache-warming logic doesn't get called at the moment anyway. There's no verification yet; we'll get to that in follow-up work. More on the DescribeSubnetsPagesWithContext FIXME in 37a7f49 (pkg/destroy/aws: Delete subnets by VPC, 2019-08-13, openshift#2214). [1]: openshift#2477 (comment)
Sometimes CI leaks untagged security groups and subnets. Because we are allowed to remove all resources from within a cluster-owned VPC, add
*ByVPCwalkers to remove these indirectly-owned resources.