-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Alibaba: fix: destroy the records of the current cluster #5421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alibaba: fix: destroy the records of the current cluster #5421
Conversation
|
Hi @bd233. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could consider using a structured logger here. See the example on the destroy flow for Route53 in AWS:
https://github.com/openshift/installer/blob/06e3fe64f3616f6d6baff95dde5cd3389cf785aa/pkg/destroy/aws/aws.go#L1824-L1832
The logger for this function is being used in every function that will call, tracking the id of hosted zone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bd233 I tested this code and the code from the #5411.
The domain I was using was *.apps.test (created by the ingress controller) as well as api.test (created by the installer). I created another domain to verify that it was not removed (*.apps.test1 and api.test1). When calling the destroy cluster I verified that api.test was removed. This is an improvement but I would also like to see the record from the ingress removed *.apps.test. We do not want to leak these resource records.
I think we will want to remove all resource records with the o.ClusterDomain. That should be relatively safe since the cluster domain should be unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. We cannot guarantee that the cluster domain is unique. What we do for AWS, which I recommend following here as well, is that we only delete the public DNS record if there is also a private DNS record for the same domain.
See
installer/pkg/destroy/aws/aws.go
Lines 1832 to 1865 in d3736c2
| publicZoneID, err := getPublicHostedZone(ctx, client, id, logger) | |
| if err != nil { | |
| // In some cases AWS may return the zone in the list of tagged resources despite the fact | |
| // it no longer exists. | |
| if err.(awserr.Error).Code() == route53.ErrCodeNoSuchHostedZone { | |
| return nil | |
| } | |
| return err | |
| } | |
| recordSetKey := func(recordSet *route53.ResourceRecordSet) string { | |
| return fmt.Sprintf("%s %s", *recordSet.Type, *recordSet.Name) | |
| } | |
| publicEntries := map[string]*route53.ResourceRecordSet{} | |
| if len(publicZoneID) != 0 { | |
| err = client.ListResourceRecordSetsPagesWithContext( | |
| ctx, | |
| &route53.ListResourceRecordSetsInput{HostedZoneId: aws.String(publicZoneID)}, | |
| func(results *route53.ListResourceRecordSetsOutput, lastPage bool) bool { | |
| for _, recordSet := range results.ResourceRecordSets { | |
| key := recordSetKey(recordSet) | |
| publicEntries[key] = recordSet | |
| } | |
| return !lastPage | |
| }, | |
| ) | |
| if err != nil { | |
| return err | |
| } | |
| } else { | |
| logger.Debug("shared public zone not found") | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand why the cluster domain is not unique. Can you provide more details. Can users create two clusters with the same cluster name and base domain (I think this is not feasible)? Or consider that the user may manually add the records containing the cluster domain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The installer does not know about any other clusters that exist. So the installer cannot directly verify that someone has not attempted to install a cluster with the exact same cluster domain. Moreover, the destroyer must be resilient to any changes that have been made in the cloud account in between when the installation attempt occurred and when the destroyer is called. The destroyer should always err on the side of caution because the worst thing that it can do is to delete a resource that did not actually belong to the cluster.
Let's look at an example. The user attempts to create a cluster with the cluster domain test.example.com. The installation failed at some point before creating the public DNS records. The next day, the user fixes whatever was causing the installation to fail the first time and tries another installation with the same cluster domain. This installation was successful. The next day, the user realizes that there are some leftover resources from the first installation attempt because the user forgot to destroy the cluster. The user runs the destroyer on the partial cluster from the first attempt. The destroyer sees that there is a *.apps.test.example.com public DNS record, so the destroyer deletes it. Now the user cannot access the apps on their working cluster because the public DNS record has been deleted.
|
@bd233 ptal on those comments above?:
Thanks! |
Okey, thanks, I will update it as soon as possible. |
06e3fe6 to
e37e5b6
Compare
|
I'll test this PR this morning when removing my cluster. |
|
/ok-to-test |
|
@bd233 : @kwoodson and I tested this today and it's deleting all the records from the public zone with suffix of In my tests I create on the console one RR Please review the suggestion above of @staebler what is done in AWS in this comment. Could you review it? Thanks! |
9f84080 to
0a85f51
Compare
|
@bd233 , I just test this PR and the DNS records created by the installer were successfully removed from the public DNS zone, alongside the private zone. In my tests, my cluster was not successfully installed, for that reason only one record was removed (the I also would like to mention that the output will be improved when the PR #5435 will be merged. |
|
/retest-required |
|
@kwoodson @staebler @patrickdillon ptal? |
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoot. Forgot to press the "Submit review" button. Sorry for the delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not return an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. At the same time, I realized a problem. Determine whether to delete the public analysis record through the analysis record of privateZone, but the privateZone or its analysis record may not exist (for example, the user manually deleted it). I am referring to AWS to try to get inspiration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user manually deletes the private zone or the analysis record, then the public record will be leaked. There is no way around that. The same thing would happen with AWS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK...I have updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the alibaba sdk handle paging? Is there a default page size? Or will the sdk return all of the results unless paging is explicitly requested. There could be a lot of records in the base domain zone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default page size is 20, and the SDK will not return all results. If this has caused a problem, I can temporarily increase the page size, such as 50. This is not a good solution, I will create a new PR to add the function of automatically obtaining all paging to alibabacloud.client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that the paging issue is limited to just this call. We need to handle the possibility of paging in all of the calls that we make to the SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, yes, let's move that to a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not return when there is an error deleting a record. Continue attempting to delete the other records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not something to change for this PR, but this approach of waiting for the DNS records to delete, blocking the destroyer from doing further work, is not ideal.
|
@bd233 Please update this PR. I believe this is a small fix. Thanks! |
0a85f51 to
6dfd1ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return "", errors.Wrap(err, fmt.Sprintf("matched to multiple private zones by clustedomain %q", clusterDomain)) | |
| return "", errors.Wrap(err, fmt.Sprintf("matched to multiple private zones by clusterdomain %q", clusterDomain)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| baseDomain := strings.Join(strings.Split(o.ClusterDomain, ".")[1:], ".") | |
| clusterName := strings.Split(o.ClusterDomain, ".")[0] | |
| domainParts := strings.Split(o.ClusterDomain, ".", 1) | |
| if len(domainParts) < 2 { | |
| return errors.New("could not determine cluster name from cluster domain") | |
| } | |
| clusterName := domainParts[0] | |
| baseDomain := domainParts[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there is an error deleting the record, remove the key from privateRecords so that we don't wait the full minute for the record to be deleted. But also capture that there was an error so that deleteDNSRecords returns an error later after the wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have updated^^
Filter the resolution records of the base domain through `api.<cluster name>`, otherwise it will cause all records to be released. Signed-off-by: sunhui <[email protected]>
6dfd1ad to
e2bfffb
Compare
|
/uncc |
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: staebler The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
@bd233: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Filter the resolution records of the base domain through
api.<cluster name>,otherwise it will cause all records to be released.