Skip to content

Conversation

@hexfusion
Copy link
Contributor

@hexfusion hexfusion commented May 30, 2019

The PR resolves a few issues.

  • ETCD_NAME is currently populated from hostname but this is not always valid. For example, hostname in AWS returns the hostname of the node. But in baremetal hostname returns FQDN. Because we can no longer assume how etcd name is formatted we will ask for it as a mandatory param for etcd-member-recover.sh. With the addition of validate_etcd_name function, we can now extract the ETCD_NAME from ETCD_INITIAL_CLUSTER for etcd-snapshot-restore.sh.

  • ETCD_CONNSTRING is now renamed ETCD_INTIAL_CLUSTER to more accurately reflect value.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1714457

/hold

@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 30, 2019
@vrutkovs
Copy link
Contributor

/test e2e-restore-cluster-state

@hexfusion
Copy link
Contributor Author

@vrutkovs Looks like we fail to get bastion installed correctly?
...Still waiting for DNS...
nslookup: '' is not a legal name (unexpected end of input)
[....]
nslookup: '' is not a legal name (unexpected end of input)

@hexfusion
Copy link
Contributor Author

/retest

@vrutkovs
Copy link
Contributor

@vrutkovs Looks like we fail to get bastion installed correctly?

Yep, that's Route53 not replying back after 30 attempts :/

@hexfusion
Copy link
Contributor Author

@vrutkovs adding exit 1 to usage() gave us what we were expecting.

Path to snapshot and initial cluster are required: ./etcd-snapshot-restore.sh $path-to-snapshot $initial_cluster
exit code: 1

@vrutkovs
Copy link
Contributor

vrutkovs commented Jun 4, 2019

This is fixed in openshift/release#3919

@vrutkovs
Copy link
Contributor

vrutkovs commented Jun 5, 2019

/test e2e-restore-cluster-state

@vrutkovs
Copy link
Contributor

vrutkovs commented Jun 5, 2019

/test e2e-etcd-quorum-loss

@hexfusion
Copy link
Contributor Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 5, 2019
@hexfusion hexfusion changed the title DR: use params to populate etcd name instead of assumptions DR: use param to populate etcd name Jun 5, 2019
@vrutkovs
Copy link
Contributor

vrutkovs commented Jun 5, 2019

I think I started etcd-quorum-loss too soon - it still fails with this error.

/test e2e-etcd-quorum-loss

@hexfusion
Copy link
Contributor Author

/test e2e-restore-cluster-state

@hexfusion
Copy link
Contributor Author

hexfusion commented Jun 5, 2019

> + echo 'Snapshot file not found, restore failed: /root/assets/backup/etcd/member/snap/db.'
+ exit 1

/cc @vrutkovs

/test e2e-etcd-quorum-loss

@vrutkovs
Copy link
Contributor

vrutkovs commented Jun 5, 2019

Fix: openshift/release#3967

@hexfusion
Copy link
Contributor Author

I think the issue was actually on me I wanted to check that the snapshot existed early. But if we want to use the backup the check was too soon as it did not exist yet.

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
@hexfusion
Copy link
Contributor Author

all tests passed!

@hexfusion
Copy link
Contributor Author

/cc @runcom

@openshift-ci-robot openshift-ci-robot requested a review from runcom June 6, 2019 01:26
@vrutkovs
Copy link
Contributor

vrutkovs commented Jun 6, 2019

/test e2e-etcd-quorum-loss

@openshift-ci-robot
Copy link
Contributor

@hexfusion: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-restore-cluster-state b39e7e8f8c2f6b3fd1c5f0eeb1cc5f5837278a83 link /test e2e-restore-cluster-state
ci/prow/e2e-etcd-quorum-loss b8994e0 link /test e2e-etcd-quorum-loss

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@vrutkovs
Copy link
Contributor

vrutkovs commented Jun 6, 2019

Failing tests:

[Feature:Prometheus][Conformance] Prometheus when installed on the cluster should report less than two alerts in firing or pending state [Suite:openshift/conformance/parallel/minimal]

Known flake (although I assumed the fix has been merged by that time).
/skip

@vrutkovs
Copy link
Contributor

vrutkovs commented Jun 6, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 6, 2019
@hexfusion
Copy link
Contributor Author

"Pod openshift-sdn/sdn-kd54v is not healthy: container sdn has restarted more than 5 times",

The tests are still flakey due to other issues than this code directly. The changes here are needed for DR to work properly for bare-metal so we need to merge this and then solve other problems.

@hexfusion
Copy link
Contributor Author

@kikisdeliveryservice @cgwalters could I get an approval, please.

@runcom
Copy link
Member

runcom commented Jun 6, 2019

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, runcom, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 6, 2019
@openshift-merge-robot openshift-merge-robot merged commit bfe0429 into openshift:master Jun 6, 2019
@hexfusion
Copy link
Contributor Author

/cherrypick release-4.1

@openshift-cherrypick-robot

@hexfusion: #804 failed to apply on top of branch "release-4.1":

error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	templates/master/00-master/_base/files/usr-local-bin-etcd-member-recover-sh.yaml
M	templates/master/00-master/_base/files/usr-local-bin-etcd-snapshot-restore-sh.yaml
M	templates/master/00-master/_base/files/usr-local-bin-openshift-recovery-tools-sh.yaml
Falling back to patching base and 3-way merge...
Auto-merging templates/master/00-master/_base/files/usr-local-bin-openshift-recovery-tools-sh.yaml
CONFLICT (content): Merge conflict in templates/master/00-master/_base/files/usr-local-bin-openshift-recovery-tools-sh.yaml
Auto-merging templates/master/00-master/_base/files/usr-local-bin-etcd-snapshot-restore-sh.yaml
CONFLICT (content): Merge conflict in templates/master/00-master/_base/files/usr-local-bin-etcd-snapshot-restore-sh.yaml
Auto-merging templates/master/00-master/_base/files/usr-local-bin-etcd-member-recover-sh.yaml
CONFLICT (content): Merge conflict in templates/master/00-master/_base/files/usr-local-bin-etcd-member-recover-sh.yaml
Patch failed at 0001 DR: use param to populate etcd name for etcd-member-recover

Details

In response to this:

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hexfusion hexfusion deleted the fx_assumptions branch June 6, 2019 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants