-
Notifications
You must be signed in to change notification settings - Fork 1.5k
alibaba: implement cluster destroy #5348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
alibaba: implement cluster destroy #5348
Conversation
|
Hi @bd233. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
3fce36d to
e118273
Compare
e118273 to
13e0ac8
Compare
|
@patrickdillon I have been testing this destroy code and it does clean up the cluster. I'll go through and verify again. |
@kwoodson I have fixed it. Please verify |
|
@patrickdillon Please review whether this is the desired validation Alibaba: fix: add metadata server IP validation |
|
@kwoodson I have a question, what should the number of worker nodes in the default configuration (if I do not modify the install-config file) depend on? Is the number of all available zones available? |
|
@bd233 The zones are determined here https://github.com/openshift/installer/blob/master/pkg/asset/machines/worker.go#L239-L249 As you know, this can be determined in any method you desire. I think the current behavior is 1 worker per zone but I think the standard node count is generally 3 worker nodes. That provides sufficient space to run the ingress with HA as well as the other cluster resources (registry, console, etc). It appears that other providers do something similar. AWS looks at subnets or number of AZs. I think you will always get number of availabilityzones + 1. Due to this code: |
|
Default cluster should always have 3 compute nodes. |
|
@bd233 I've noticed that the SLB that is created by the ingress operator has DeletionProtection enabled. Therefore the ingress controller's SLB does not get removed. Also appears that the OSS bucket that is created for the bootstrap node also does not get deleted. |
patrickdillon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just reviewed the metadata server validation. This stands alone as it's own PR so I would suggest moving it to a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because these functions are static validation--that is, they do not require connecting to the alibaba api--they should be moved to pkg/types. Specifically here: https://github.com/openshift/installer/blob/master/pkg/types/validation/installconfig.go#L465
the validatePlatform function will need to pass in the networking struct, similar to how OpenStack does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matedataServerIP -> metadataServerIP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably check the cluster network and service network as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the IP must not be in one of the machine networks" -> "the contains 100.100.100.200 which is reserved for the metadata service"
thx, I think I understand the logic here, |
@kwoodson I think this has nothing to do with deleting protection. If you run the In addition, as you know, the Installer will create two SLB instances. The installer searches for the two SLB instances through tags
I created the cluster and deleted it, but it did not reproduce the problem. Is there any more detailed information to help me reproduce it? |
a0f6d80 to
dfb149e
Compare
|
@bd233 Let me test the latest code and verify if the SLB gets removed. I'll also look into the OSS buckets. |
patrickdillon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier to review if you broke this in to multiple PRs. One for destroy, one for existing VPC, etc.
Also please squash where possible. I know there is at least one destroy commit that could be squashed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As staebler pointed out in the previous PR, the destroy code should never stop running if it knows that there are outstanding resources to be deleted. The destroyer should loop, attempting to destroy any resources as long as they exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is inappropriate to raise an error here.
But I think waitComplete here is redundant. The delete function of each cloud resource already contains the logic to wait for the deletion to be completed, so I should remove waitComplete.
67d0892 to
4ddbe60
Compare
|
/retitle alibaba: implement cluster destroy |
|
/ok-to-test |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: patrickdillon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
patrickdillon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix this one small error to pass the test, and we should be ready to merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is causing the go-lint test to fail:
/go/src/github.com/openshift/installer/pkg/destroy/alibabacloud/alibabacloud.go:987:1: exported method ClusterUninstaller.ListTagResources should have comment or be unexported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, please continue
4ddbe60 to
4a6ee69
Compare
|
@bd233 I attempted to run this destroy code today but it continued to run. I ran this with If the record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not love this solution and I am not proposing it but this addition was able to skip the resources that are returning an error message like DEBUG Message: The specified domain name does not exist. Refresh the page and try again.:
if ferr != nil {
if strings.Contains(ferr.Error(), "not exist") {
return true, nil
}
o.Logger.Debugf("%s: %v", f.name, ferr)
return false, nil
}
I recommend checking the ferr message for objects that do not exist and return true to continue the deletion.
@bd233 @patrickdillon WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the deleteDNSRecords function incorrectly returning an error when it can't find the domain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deleteDNSRecords is returning an error that the DNS record does not exist. I think this behavior is expected from the Alibaba API. We need to gracefully handle NoExist errors so that deletion can complete successfully.
@bd233 Can you confirm this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to check to see if it exists before trying to delete. That is what I thought this code was doing: https://github.com/openshift/installer/pull/5348/files#diff-5d31e7ab73dc252cc09331a0790fcc8b6cd3add977402effbe19a1b1b24fba39R1249-R1255
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kwoodson @patrickdillon Sorry to reply so late.
I think this is an abnormal scenario. I read the Start to search DNS records log, but did not see Start to delete DNS records, so the error should come from records, err: = o.listrecord (basedomain), But this should not be, if the domain name does not exist, it should be returned here:
if len(domains) == 0 {
return
}
I am trying to reproduce and fix this problem as soon as possible.
I recommend checking the ferr message for objects that do not exist and return true to continue the deletion.
@bd233 @patrickdillon WDYT?
I don't think 'ferr' should be handled. This error should be resolved in 'deleteDnsRecords'. This solution may cause some resources not to be deleted
@kwoodson Are there any manual deletion operations in the process of executing the installer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bd233 Generally speaking there shouldn't be any but that doesn't stop customers from updating or modifying the cluster objects. The reason this DNS item was modified is that we set up valid DNS and removed the original before removing the cluster. That caused the missing DNS record. I think that it is not unreasonable to have components get modified or deleted outside of the cluster. I think if objects return NoExist they should be considered to be deleted and the destroy code can continue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kwoodson @patrickdillon
I think I understand this problem. Querying domainName is a fuzzy query. If baseDomain is openshift.com, then bd.openshift.com may appear in the result, and len (domains) == 0 is false, so...
I'll fix it as soon as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this PR.
I think ListPrivateZones and ListDNSDomain in pkg/asset/installconfig/alibabacloud/client.go also has this problem. I will create a PR to fix it.
4a6ee69 to
9553bdf
Compare
patrickdillon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain (briefly) the fix for the dns issue? I am not sure I understand how it is being fixed.
Also it looks like an azure stack commit snuck in during a rebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the fix for the fuzzy domain search error? If so can you explain how this solves the problem? Also it would be nice to have a comment in the code for why we need the string manipulation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha~, about azure stack commit, it is a rebase mistake, I will updata it later.
The LIKE is default search mode on DescribeDomains api. In this mode, if you search openshift.com, can match xx.openshift.com and openshift.com.xx, but using EXACT mode, only return openshift.com.
Also it would be nice to have a comment in the code for why we need the string manipulation
I need base domain to search domain and delete it records, but I can not get base domain, so I use cluster domin to splite by . (the format of cluster domain is <cluster name>.<base domain>).
I will add comment here.
Adds destroy code for the Alibaba platform. Alibaba: fix: turn off SLB and ECS protection Before deleting ECS and SLB, turn off the protection state Alibaba: fix: update asynchronous destroy mode Refer to IMBCloud to optimize the part of destroying the cluster Alibaba: fix: destroy SLB created by ingress operator Query SLB created by ingress operator through tag, and destory it.
This commit was produced by running, and all modules verified
9553bdf to
6a08c59
Compare
|
@bd233: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
I tested this yesterday: |
|
/lgtm |
Add code for the
openshift-install destroy clustercommand for cluster using the alibaba platform.