Skip to content

Conversation

@dgoodwin
Copy link
Contributor

…ror.

It appears in some situations, AWS may return a route53 hosted zone when
querying for tagged resources, despite the fact that the zone no longer
exists.

Update deprovision code to catch this error and skip route53 cleanup
when present.

@openshift-ci-robot
Copy link
Contributor

@dgoodwin: This pull request references Bugzilla bug 1817201, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1817201: Fix intermittent deprovision loop on NoSuchHostedZone er…

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Mar 27, 2020
@dgoodwin
Copy link
Contributor Author

This is showing up quite frequently in Hive deployments so we would like to get a fix out.

I cannot reproduce the problem so this is only an attempt to fix the issue.

We pull from master so no backports are needed on our behalf.

@dgoodwin
Copy link
Contributor Author

/cc @staebler

Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

Copy link
Contributor

@patrickdillon patrickdillon Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return nil? I would think that if there is no sharedZone then the resources that are deleted after this would also not exist, but here:

https://github.com/openshift/installer/blob/5df5607169bb8b8c878b8b5cc108d1a6d0fec0d7/pkg/destroy/aws/aws.go#L1747

we proceed even if no sharedZone exists...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The not-found error here is for the case where the private zone does not exist. If the private zone exists, but there is no public zone, then getSharedHostedZone will return an empty string and a nil error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying.

As mentioned, this seems like an AWS problem so it is beyond our control to fix.

/lgtm
/approve

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really unsure, sharedZoneID is used again on line 1761 so that's going to probably fail, line 1770 maybe? I don't know this code or route53 very well, it feels like it's game over if we can't get the shared zone ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops was looking at an old load of the tab. :)

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: patrickdillon, staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 27, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you drop this warning, this will be repeated again and again on user stdout.

none of the other notfound handling code adds logging,.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2020
@wking wking changed the title Bug 1817201: Fix intermittent deprovision loop on NoSuchHostedZone er… Bug 1817201: Fix intermittent deprovision loop on NoSuchHostedZone error Mar 28, 2020
…ror.

It appears in some situations, AWS may return a route53 hosted zone when
querying for tagged resources, despite the fact that the zone no longer
exists.

Update deprovision code to catch this error and skip route53 cleanup
when present.
@dgoodwin dgoodwin force-pushed the fix-NoSuchHostedZone-deprovision branch from 5df5607 to 81bf83d Compare March 30, 2020 11:27
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 30, 2020
@dgoodwin
Copy link
Contributor Author

Updated.

@dgoodwin
Copy link
Contributor Author

I reproduced the problem this morning and was able to test this fix successfully through Hive.

@abhinavdahiya can we get this hold lifted?

Interestingly it appears to be related to an initial failure to install. Sample log from the first failure ended with:

time="2020-03-31T13:21:57Z" level=debug msg="OpenShift Installer v4.4.0"
time="2020-03-31T13:21:57Z" level=debug msg="Built from commit ddafb2d87d4744e56bd5d857621245311f186261"
time="2020-03-31T13:21:57Z" level=info msg="Waiting up to 20m0s for the Kubernetes API at https://api.dgoodwin2.new-installer.openshift.com:6443..."
time="2020-03-31T13:21:58Z" level=debug msg="Still waiting for the Kubernetes API: Get https://api.dgoodwin2.new-installer.openshift.com:6443/version?timeout=32s: dial tcp: lookup api.dgoodwin2.new-installer.openshift.com on 10.96.0.10:53: server misbehaving"
time="2020-03-31T13:22:31Z" level=debug msg="Still waiting for the Kubernetes API: Get https://api.dgoodwin2.new-installer.openshift.com:6443/version?timeout=32s: dial tcp: lookup api.dgoodwin2.new-installer.openshift.com on 10.96.0.10:53: server misbehaving"
time="2020-03-31T13:22:57Z" level=error msg="error provisioning cluster" error="exit status 1" installID=n277xtfp
time="2020-03-31T13:22:57Z" level=error msg="error running openshift-install, running deprovision to clean up" error="exit status 1" installID=n277xtfp
time="2020-03-31T13:22:57Z" level=debug msg="OpenShift Installer v4.4.0"
time="2020-03-31T13:22:57Z" level=debug msg="Built from commit ddafb2d87d4744e56bd5d857621245311f186261"
time="2020-03-31T13:22:57Z" level=info msg="Waiting up to 20m0s for the Kubernetes API at https://api.dgoodwin2.new-installer.openshift.com:6443..."
time="2020-03-31T13:22:58Z" level=debug msg="Still waiting for the Kubernetes API: Get https://api.dgoodwin2.new-installer.openshift.com:6443/version?timeout=32s: dial tcp: lookup api.dgoodwin2.new-installer.openshift.com on 10.96.0.10:53: server misbehaving"
time="2020-03-31T13:23:32Z" level=debug msg="Still waiting for the Kubernetes API: Get https://api.dgoodwin2.new-installer.openshift.com:6443/version?timeout=32s: dial tcp: lookup api.dgoodwin2.new-installer.openshift.com on 10.96.0.10:53: server misbehaving"
time="2020-03-31T13:23:57Z" level=warning msg="unable to fetch logs from bootstrap node as SSH agent was not configured" installID=n277xtfp

@abhinavdahiya
Copy link
Contributor

/hold cancel

/lgtm

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 31, 2020
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 31, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 31, 2020

@dgoodwin: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-libvirt 81bf83d link /test e2e-libvirt

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit d200b24 into openshift:master Mar 31, 2020
@openshift-ci-robot
Copy link
Contributor

@dgoodwin: All pull requests linked via external trackers have merged: openshift/installer#3359. Bugzilla bug 1817201 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1817201: Fix intermittent deprovision loop on NoSuchHostedZone error

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants