Skip to content

Conversation

@bd233
Copy link
Contributor

@bd233 bd233 commented Nov 26, 2021

Filter the resolution records of the base domain through api.<cluster name>,
otherwise it will cause all records to be released.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 26, 2021

Hi @bd233. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 26, 2021
@openshift-ci openshift-ci bot requested review from jhixson74 and jstuever November 26, 2021 02:02
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could consider using a structured logger here. See the example on the destroy flow for Route53 in AWS:
https://github.com/openshift/installer/blob/06e3fe64f3616f6d6baff95dde5cd3389cf785aa/pkg/destroy/aws/aws.go#L1824-L1832

The logger for this function is being used in every function that will call, tracking the id of hosted zone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rr := fmt.Sprintf("api.%s", strings.Split(o.ClusterDomain, ".")[0])

@kwoodson @staebler I am also not sure if we should delete all RRs with a suffix of the o.ClusterDomain, rather than only the api.*, I guess we will have a leak of RR *.apps.(o.ClusterDomain) using this approach. wdyt?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bd233 I tested this code and the code from the #5411.

The domain I was using was *.apps.test (created by the ingress controller) as well as api.test (created by the installer). I created another domain to verify that it was not removed (*.apps.test1 and api.test1). When calling the destroy cluster I verified that api.test was removed. This is an improvement but I would also like to see the record from the ingress removed *.apps.test. We do not want to leak these resource records.

I think we will want to remove all resource records with the o.ClusterDomain. That should be relatively safe since the cluster domain should be unique.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. We cannot guarantee that the cluster domain is unique. What we do for AWS, which I recommend following here as well, is that we only delete the public DNS record if there is also a private DNS record for the same domain.
See

publicZoneID, err := getPublicHostedZone(ctx, client, id, logger)
if err != nil {
// In some cases AWS may return the zone in the list of tagged resources despite the fact
// it no longer exists.
if err.(awserr.Error).Code() == route53.ErrCodeNoSuchHostedZone {
return nil
}
return err
}
recordSetKey := func(recordSet *route53.ResourceRecordSet) string {
return fmt.Sprintf("%s %s", *recordSet.Type, *recordSet.Name)
}
publicEntries := map[string]*route53.ResourceRecordSet{}
if len(publicZoneID) != 0 {
err = client.ListResourceRecordSetsPagesWithContext(
ctx,
&route53.ListResourceRecordSetsInput{HostedZoneId: aws.String(publicZoneID)},
func(results *route53.ListResourceRecordSetsOutput, lastPage bool) bool {
for _, recordSet := range results.ResourceRecordSets {
key := recordSetKey(recordSet)
publicEntries[key] = recordSet
}
return !lastPage
},
)
if err != nil {
return err
}
} else {
logger.Debug("shared public zone not found")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand why the cluster domain is not unique. Can you provide more details. Can users create two clusters with the same cluster name and base domain (I think this is not feasible)? Or consider that the user may manually add the records containing the cluster domain?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The installer does not know about any other clusters that exist. So the installer cannot directly verify that someone has not attempted to install a cluster with the exact same cluster domain. Moreover, the destroyer must be resilient to any changes that have been made in the cloud account in between when the installation attempt occurred and when the destroyer is called. The destroyer should always err on the side of caution because the worst thing that it can do is to delete a resource that did not actually belong to the cluster.

Let's look at an example. The user attempts to create a cluster with the cluster domain test.example.com. The installation failed at some point before creating the public DNS records. The next day, the user fixes whatever was causing the installation to fail the first time and tries another installation with the same cluster domain. This installation was successful. The next day, the user realizes that there are some leftover resources from the first installation attempt because the user forgot to destroy the cluster. The user runs the destroyer on the partial cluster from the first attempt. The destroyer sees that there is a *.apps.test.example.com public DNS record, so the destroyer deletes it. Now the user cannot access the apps on their working cluster because the public DNS record has been deleted.

@mtulio
Copy link
Contributor

mtulio commented Nov 30, 2021

@bd233 ptal on those comments above?:

  1. We should be using structured logs (as commented on slack)
  2. We need to make sure that all RRs for the cluster must be removed when the cluster is destroyed. In this approach, the RR *.apps is leaking.

Thanks!

@bd233
Copy link
Contributor Author

bd233 commented Dec 1, 2021

@bd233 ptal on those comments above?:

  1. We should be using structured logs (as commented on slack)
  2. We need to make sure that all RRs for the cluster must be removed when the cluster is destroyed. In this approach, the RR *.apps is leaking.

Thanks!

Okey, thanks, I will update it as soon as possible.

@bd233 bd233 force-pushed the alibabacloud-destroy-records branch from 06e3fe6 to e37e5b6 Compare December 1, 2021 13:22
@bd233
Copy link
Contributor Author

bd233 commented Dec 1, 2021

@staebler @mtulio I submitted the changes to formatting the log on #5435 PR. If it meets expectations, I will continue to modify all logs.

@kwoodson
Copy link

kwoodson commented Dec 2, 2021

I'll test this PR this morning when removing my cluster.

@patrickdillon
Copy link
Contributor

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 2, 2021
@mtulio
Copy link
Contributor

mtulio commented Dec 3, 2021

@bd233 : @kwoodson and I tested this today and it's deleting all the records from the public zone with suffix of clusterName. It is not the expected behavior because we could have RRs created by any other component outside of OpenShift (example: manual) that is being deleted.

In my tests I create on the console one RR mrb.test on the public zone alicloud-dev.devcluster.openshift.com and it has been deleted when we triggered the destroy flow.

Please review the suggestion above of @staebler what is done in AWS in this comment.

Could you review it? Thanks!

@bd233 bd233 force-pushed the alibabacloud-destroy-records branch from 9f84080 to 0a85f51 Compare December 6, 2021 11:18
@bd233
Copy link
Contributor Author

bd233 commented Dec 6, 2021

Please review the suggestion above of @staebler what is done in AWS in this comment.

@mtulio @staebler @kwoodson I have updated the deletion strategy according to the suggestion, please review it.

@mtulio
Copy link
Contributor

mtulio commented Dec 8, 2021

@bd233 , I just test this PR and the DNS records created by the installer were successfully removed from the public DNS zone, alongside the private zone.

DEBUG Start to search DNS records    
DEBUG Start to search private zones                
DEBUG Start to delete DNS records                  
DEBUG Start to delete DNS record "734306694439286784" 
DEBUG Start to search private zones                
DEBUG Start to delete private zones                
DEBUG Start to delete private zone "550502d3a7eda57b049634b699a553f6"

In my tests, my cluster was not successfully installed, for that reason only one record was removed (the api.${clusterName} one). FYI @kwoodson

I also would like to mention that the output will be improved when the PR #5435 will be merged.
/lgtm

@mtulio
Copy link
Contributor

mtulio commented Dec 8, 2021

/retest-required

@mtulio
Copy link
Contributor

mtulio commented Dec 8, 2021

@kwoodson @staebler @patrickdillon ptal?

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2021
Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoot. Forgot to press the "Submit review" button. Sorry for the delay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not return an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. At the same time, I realized a problem. Determine whether to delete the public analysis record through the analysis record of privateZone, but the privateZone or its analysis record may not exist (for example, the user manually deleted it). I am referring to AWS to try to get inspiration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user manually deletes the private zone or the analysis record, then the public record will be leaked. There is no way around that. The same thing would happen with AWS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK...I have updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the alibaba sdk handle paging? Is there a default page size? Or will the sdk return all of the results unless paging is explicitly requested. There could be a lot of records in the base domain zone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default page size is 20, and the SDK will not return all results. If this has caused a problem, I can temporarily increase the page size, such as 50. This is not a good solution, I will create a new PR to add the function of automatically obtaining all paging to alibabacloud.client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that the paging issue is limited to just this call. We need to handle the possibility of paging in all of the calls that we make to the SDK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, yes, let's move that to a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not return when there is an error deleting a record. Continue attempting to delete the other records.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not something to change for this PR, but this approach of waiting for the DNS records to delete, blocking the destroyer from doing further work, is not ideal.

@mtulio
Copy link
Contributor

mtulio commented Dec 8, 2021

/lgtm cancel
@bd233 could you please take a look at the @staebler comments above? I can run new tests when it's ready. Thanks!

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2021
@kwoodson
Copy link

@bd233 Please update this PR. I believe this is a small fix. Thanks!

@bd233 bd233 force-pushed the alibabacloud-destroy-records branch from 0a85f51 to 6dfd1ad Compare December 10, 2021 05:24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return "", errors.Wrap(err, fmt.Sprintf("matched to multiple private zones by clustedomain %q", clusterDomain))
return "", errors.Wrap(err, fmt.Sprintf("matched to multiple private zones by clusterdomain %q", clusterDomain))

Comment on lines 1274 to 1275
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
baseDomain := strings.Join(strings.Split(o.ClusterDomain, ".")[1:], ".")
clusterName := strings.Split(o.ClusterDomain, ".")[0]
domainParts := strings.Split(o.ClusterDomain, ".", 1)
if len(domainParts) < 2 {
return errors.New("could not determine cluster name from cluster domain")
}
clusterName := domainParts[0]
baseDomain := domainParts[1]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is an error deleting the record, remove the key from privateRecords so that we don't wait the full minute for the record to be deleted. But also capture that there was an error so that deleteDNSRecords returns an error later after the wait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated^^

Filter the resolution records of the base domain through `api.<cluster name>`,
otherwise it will cause all records to be released.

Signed-off-by: sunhui <[email protected]>
@bd233 bd233 force-pushed the alibabacloud-destroy-records branch from 6dfd1ad to e2bfffb Compare December 10, 2021 18:12
@jstuever
Copy link
Contributor

/uncc

@openshift-ci openshift-ci bot removed the request for review from jstuever December 10, 2021 18:20
Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 10, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2021
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@mtulio
Copy link
Contributor

mtulio commented Dec 10, 2021

/retest-required

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 73e4935 into openshift:master Dec 10, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 10, 2021

@bd233: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-single-node e2bfffb link false /test e2e-aws-single-node
ci/prow/e2e-libvirt e2bfffb link false /test e2e-libvirt
ci/prow/e2e-aws-workers-rhel8 e2bfffb link false /test e2e-aws-workers-rhel8
ci/prow/okd-e2e-aws e2bfffb link false /test okd-e2e-aws
ci/prow/e2e-aws-fips e2bfffb link false /test e2e-aws-fips
ci/prow/e2e-crc e2bfffb link false /test e2e-crc

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants