Alibaba: fix: destroy the records of the current cluster #5421

bd233 · 2021-11-26T02:01:37Z

Filter the resolution records of the base domain through api.<cluster name>,
otherwise it will cause all records to be released.

openshift-ci · 2021-11-26T02:02:14Z

Hi @bd233. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mtulio · 2021-11-26T17:49:03Z

pkg/destroy/alibabacloud/alibabacloud.go

You could consider using a structured logger here. See the example on the destroy flow for Route53 in AWS:
https://github.com/openshift/installer/blob/06e3fe64f3616f6d6baff95dde5cd3389cf785aa/pkg/destroy/aws/aws.go#L1824-L1832

The logger for this function is being used in every function that will call, tracking the id of hosted zone.

rr := fmt.Sprintf("api.%s", strings.Split(o.ClusterDomain, ".")[0])

@kwoodson @staebler I am also not sure if we should delete all RRs with a suffix of the o.ClusterDomain, rather than only the api.*, I guess we will have a leak of RR *.apps.(o.ClusterDomain) using this approach. wdyt?

@bd233 I tested this code and the code from the #5411.

The domain I was using was *.apps.test (created by the ingress controller) as well as api.test (created by the installer). I created another domain to verify that it was not removed (*.apps.test1 and api.test1). When calling the destroy cluster I verified that api.test was removed. This is an improvement but I would also like to see the record from the ingress removed *.apps.test. We do not want to leak these resource records.

I think we will want to remove all resource records with the o.ClusterDomain. That should be relatively safe since the cluster domain should be unique.

Not exactly. We cannot guarantee that the cluster domain is unique. What we do for AWS, which I recommend following here as well, is that we only delete the public DNS record if there is also a private DNS record for the same domain.
See

installer/pkg/destroy/aws/aws.go

Lines 1832 to 1865 in d3736c2

publicZoneID, err := getPublicHostedZone(ctx, client, id, logger)

if err != nil {

// In some cases AWS may return the zone in the list of tagged resources despite the fact

// it no longer exists.

if err.(awserr.Error).Code() == route53.ErrCodeNoSuchHostedZone {

return nil

}

return err

}

recordSetKey := func(recordSet *route53.ResourceRecordSet) string {

return fmt.Sprintf("%s %s", *recordSet.Type, *recordSet.Name)

}

publicEntries := map[string]*route53.ResourceRecordSet{}

if len(publicZoneID) != 0 {

err = client.ListResourceRecordSetsPagesWithContext(

ctx,

&route53.ListResourceRecordSetsInput{HostedZoneId: aws.String(publicZoneID)},

func(results *route53.ListResourceRecordSetsOutput, lastPage bool) bool {

for _, recordSet := range results.ResourceRecordSets {

key := recordSetKey(recordSet)

publicEntries[key] = recordSet

}

return !lastPage

},

)

if err != nil {

return err

}

} else {

logger.Debug("shared public zone not found")

}

Sorry, I don't understand why the cluster domain is not unique. Can you provide more details. Can users create two clusters with the same cluster name and base domain (I think this is not feasible)? Or consider that the user may manually add the records containing the cluster domain？

The installer does not know about any other clusters that exist. So the installer cannot directly verify that someone has not attempted to install a cluster with the exact same cluster domain. Moreover, the destroyer must be resilient to any changes that have been made in the cloud account in between when the installation attempt occurred and when the destroyer is called. The destroyer should always err on the side of caution because the worst thing that it can do is to delete a resource that did not actually belong to the cluster.

Let's look at an example. The user attempts to create a cluster with the cluster domain test.example.com. The installation failed at some point before creating the public DNS records. The next day, the user fixes whatever was causing the installation to fail the first time and tries another installation with the same cluster domain. This installation was successful. The next day, the user realizes that there are some leftover resources from the first installation attempt because the user forgot to destroy the cluster. The user runs the destroyer on the partial cluster from the first attempt. The destroyer sees that there is a *.apps.test.example.com public DNS record, so the destroyer deletes it. Now the user cannot access the apps on their working cluster because the public DNS record has been deleted.

mtulio · 2021-11-30T16:01:47Z

@bd233 ptal on those comments above?:

We should be using structured logs (as commented on slack)
We need to make sure that all RRs for the cluster must be removed when the cluster is destroyed. In this approach, the RR *.apps is leaking.

Thanks!

bd233 · 2021-12-01T09:40:05Z

@bd233 ptal on those comments above?:

We should be using structured logs (as commented on slack)

We need to make sure that all RRs for the cluster must be removed when the cluster is destroyed. In this approach, the RR *.apps is leaking.

Thanks!

Okey, thanks, I will update it as soon as possible.

bd233 · 2021-12-01T15:30:02Z

@staebler @mtulio I submitted the changes to formatting the log on #5435 PR. If it meets expectations, I will continue to modify all logs.

kwoodson · 2021-12-02T15:57:10Z

I'll test this PR this morning when removing my cluster.

patrickdillon · 2021-12-02T19:26:42Z

/ok-to-test

mtulio · 2021-12-03T19:10:16Z

@bd233 : @kwoodson and I tested this today and it's deleting all the records from the public zone with suffix of clusterName. It is not the expected behavior because we could have RRs created by any other component outside of OpenShift (example: manual) that is being deleted.

In my tests I create on the console one RR mrb.test on the public zone alicloud-dev.devcluster.openshift.com and it has been deleted when we triggered the destroy flow.

Please review the suggestion above of @staebler what is done in AWS in this comment.

Could you review it? Thanks!

bd233 · 2021-12-06T11:21:20Z

Please review the suggestion above of @staebler what is done in AWS in this comment.

@mtulio @staebler @kwoodson I have updated the deletion strategy according to the suggestion, please review it.

mtulio · 2021-12-08T18:14:42Z

@bd233 , I just test this PR and the DNS records created by the installer were successfully removed from the public DNS zone, alongside the private zone.

DEBUG Start to search DNS records    
DEBUG Start to search private zones                
DEBUG Start to delete DNS records                  
DEBUG Start to delete DNS record "734306694439286784" 
DEBUG Start to search private zones                
DEBUG Start to delete private zones                
DEBUG Start to delete private zone "550502d3a7eda57b049634b699a553f6"

In my tests, my cluster was not successfully installed, for that reason only one record was removed (the api.${clusterName} one). FYI @kwoodson

I also would like to mention that the output will be improved when the PR #5435 will be merged.
/lgtm

mtulio · 2021-12-08T18:16:31Z

/retest-required

mtulio · 2021-12-08T18:17:26Z

@kwoodson @staebler @patrickdillon ptal?

staebler

Shoot. Forgot to press the "Submit review" button. Sorry for the delay.

staebler · 2021-12-06T16:44:35Z

pkg/destroy/alibabacloud/alibabacloud.go

This should not return an error.

Thank you. At the same time, I realized a problem. Determine whether to delete the public analysis record through the analysis record of privateZone, but the privateZone or its analysis record may not exist (for example, the user manually deleted it). I am referring to AWS to try to get inspiration.

If the user manually deletes the private zone or the analysis record, then the public record will be leaked. There is no way around that. The same thing would happen with AWS.

OK...I have updated.

staebler · 2021-12-06T16:49:51Z

pkg/destroy/alibabacloud/alibabacloud.go

How does the alibaba sdk handle paging? Is there a default page size? Or will the sdk return all of the results unless paging is explicitly requested. There could be a lot of records in the base domain zone.

The default page size is 20, and the SDK will not return all results. If this has caused a problem, I can temporarily increase the page size, such as 50. This is not a good solution, I will create a new PR to add the function of automatically obtaining all paging to alibabacloud.client.

I do not think that the paging issue is limited to just this call. We need to handle the possibility of paging in all of the calls that we make to the SDK.

But, yes, let's move that to a separate PR.

staebler · 2021-12-06T16:56:16Z

pkg/destroy/alibabacloud/alibabacloud.go

Do not return when there is an error deleting a record. Continue attempting to delete the other records.

staebler · 2021-12-06T17:00:36Z

pkg/destroy/alibabacloud/alibabacloud.go

Not something to change for this PR, but this approach of waiting for the DNS records to delete, blocking the destroyer from doing further work, is not ideal.

mtulio · 2021-12-08T19:26:11Z

/lgtm cancel
@bd233 could you please take a look at the @staebler comments above? I can run new tests when it's ready. Thanks!

kwoodson · 2021-12-10T01:21:44Z

@bd233 Please update this PR. I believe this is a small fix. Thanks!

staebler · 2021-12-10T14:14:16Z

pkg/destroy/alibabacloud/alibabacloud.go

Suggested change

return "", errors.Wrap(err, fmt.Sprintf("matched to multiple private zones by clustedomain %q", clusterDomain))

return "", errors.Wrap(err, fmt.Sprintf("matched to multiple private zones by clusterdomain %q", clusterDomain))

staebler · 2021-12-10T14:19:34Z

pkg/destroy/alibabacloud/alibabacloud.go

Suggested change

baseDomain := strings.Join(strings.Split(o.ClusterDomain, ".")[1:], ".")

clusterName := strings.Split(o.ClusterDomain, ".")[0]

domainParts := strings.Split(o.ClusterDomain, ".", 1)

if len(domainParts) < 2 {

return errors.New("could not determine cluster name from cluster domain")

}

clusterName := domainParts[0]

baseDomain := domainParts[1]

staebler · 2021-12-10T14:24:57Z

pkg/destroy/alibabacloud/alibabacloud.go

When there is an error deleting the record, remove the key from privateRecords so that we don't wait the full minute for the record to be deleted. But also capture that there was an error so that deleteDNSRecords returns an error later after the wait.

Have updated^^

Filter the resolution records of the base domain through `api.<cluster name>`, otherwise it will cause all records to be released. Signed-off-by: sunhui <[email protected]>

jstuever · 2021-12-10T18:20:11Z

/uncc

staebler

/lgtm
/approve

openshift-ci · 2021-12-10T18:36:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [staebler]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-12-10T18:57:56Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-12-10T19:36:51Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

mtulio · 2021-12-10T20:25:04Z

/retest-required

openshift-bot · 2021-12-10T20:41:53Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2021-12-10T20:52:56Z

@bd233: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-single-node	`e2bfffb`	link	false	`/test e2e-aws-single-node`
ci/prow/e2e-libvirt	`e2bfffb`	link	false	`/test e2e-libvirt`
ci/prow/e2e-aws-workers-rhel8	`e2bfffb`	link	false	`/test e2e-aws-workers-rhel8`
ci/prow/okd-e2e-aws	`e2bfffb`	link	false	`/test okd-e2e-aws`
ci/prow/e2e-aws-fips	`e2bfffb`	link	false	`/test e2e-aws-fips`
ci/prow/e2e-crc	`e2bfffb`	link	false	`/test e2e-crc`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 26, 2021

openshift-ci bot requested review from jhixson74 and jstuever November 26, 2021 02:02

mtulio reviewed Nov 26, 2021

View reviewed changes

kwoodson mentioned this pull request Nov 30, 2021

Alibaba: fix: use EXACT search mode #5411

Merged

bd233 force-pushed the alibabacloud-destroy-records branch from 06e3fe6 to e37e5b6 Compare December 1, 2021 13:22

openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 2, 2021

bd233 force-pushed the alibabacloud-destroy-records branch from 9f84080 to 0a85f51 Compare December 6, 2021 11:18

openshift-ci bot assigned mtulio Dec 8, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2021

staebler reviewed Dec 8, 2021

View reviewed changes

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2021

bd233 force-pushed the alibabacloud-destroy-records branch from 0a85f51 to 6dfd1ad Compare December 10, 2021 05:24

staebler reviewed Dec 10, 2021

View reviewed changes

Alibaba: fix: destroy the records of the current cluster

e2bfffb

Filter the resolution records of the base domain through `api.<cluster name>`, otherwise it will cause all records to be released. Signed-off-by: sunhui <[email protected]>

bd233 force-pushed the alibabacloud-destroy-records branch from 6dfd1ad to e2bfffb Compare December 10, 2021 18:12

openshift-ci bot removed the request for review from jstuever December 10, 2021 18:20

staebler reviewed Dec 10, 2021

View reviewed changes

openshift-ci bot assigned staebler Dec 10, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2021

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2021

openshift-merge-robot merged commit 73e4935 into openshift:master Dec 10, 2021

	publicZoneID, err := getPublicHostedZone(ctx, client, id, logger)
	if err != nil {
	// In some cases AWS may return the zone in the list of tagged resources despite the fact
	// it no longer exists.
	if err.(awserr.Error).Code() == route53.ErrCodeNoSuchHostedZone {
	return nil
	}
	return err
	}

	recordSetKey := func(recordSet *route53.ResourceRecordSet) string {
	return fmt.Sprintf("%s %s", recordSet.Type, recordSet.Name)
	}

	publicEntries := map[string]*route53.ResourceRecordSet{}
	if len(publicZoneID) != 0 {
	err = client.ListResourceRecordSetsPagesWithContext(
	ctx,
	&route53.ListResourceRecordSetsInput{HostedZoneId: aws.String(publicZoneID)},
	func(results *route53.ListResourceRecordSetsOutput, lastPage bool) bool {
	for _, recordSet := range results.ResourceRecordSets {
	key := recordSetKey(recordSet)
	publicEntries[key] = recordSet
	}

	return !lastPage
	},
	)
	if err != nil {
	return err
	}
	} else {
	logger.Debug("shared public zone not found")
	}

	return "", errors.Wrap(err, fmt.Sprintf("matched to multiple private zones by clustedomain %q", clusterDomain))
	return "", errors.Wrap(err, fmt.Sprintf("matched to multiple private zones by clusterdomain %q", clusterDomain))

-	baseDomain := strings.Join(strings.Split(o.ClusterDomain, ".")[1:], ".")
-	clusterName := strings.Split(o.ClusterDomain, ".")[0]
+	domainParts := strings.Split(o.ClusterDomain, ".", 1)
+	if len(domainParts) < 2 {
+		return errors.New("could not determine cluster name from cluster domain")
+	}
+	clusterName := domainParts[0]
+	baseDomain := domainParts[1]

Alibaba: fix: destroy the records of the current cluster #5421

Alibaba: fix: destroy the records of the current cluster #5421

Uh oh!

Conversation

bd233 commented Nov 26, 2021

Uh oh!

openshift-ci bot commented Nov 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mtulio commented Nov 30, 2021

Uh oh!

bd233 commented Dec 1, 2021

Uh oh!

bd233 commented Dec 1, 2021

Uh oh!

kwoodson commented Dec 2, 2021

Uh oh!

patrickdillon commented Dec 2, 2021

Uh oh!

mtulio commented Dec 3, 2021

Uh oh!

bd233 commented Dec 6, 2021

Uh oh!

mtulio commented Dec 8, 2021

Uh oh!

mtulio commented Dec 8, 2021

Uh oh!

mtulio commented Dec 8, 2021

Uh oh!

staebler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mtulio commented Dec 8, 2021

Uh oh!

kwoodson commented Dec 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jstuever commented Dec 10, 2021

Uh oh!

staebler left a comment

Choose a reason for hiding this comment

Uh oh!