DR: use param to populate etcd name #804

hexfusion · 2019-05-30T00:38:35Z

The PR resolves a few issues.

ETCD_NAME is currently populated from hostname but this is not always valid. For example, hostname in AWS returns the hostname of the node. But in baremetal hostname returns FQDN. Because we can no longer assume how etcd name is formatted we will ask for it as a mandatory param for etcd-member-recover.sh. With the addition of validate_etcd_name function, we can now extract the ETCD_NAME from ETCD_INITIAL_CLUSTER for etcd-snapshot-restore.sh.
ETCD_CONNSTRING is now renamed ETCD_INTIAL_CLUSTER to more accurately reflect value.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1714457

/hold

vrutkovs · 2019-05-30T13:17:56Z

/test e2e-restore-cluster-state

hexfusion · 2019-05-30T14:16:01Z

@vrutkovs Looks like we fail to get bastion installed correctly?
...Still waiting for DNS...
nslookup: '' is not a legal name (unexpected end of input)
[....]
nslookup: '' is not a legal name (unexpected end of input)

hexfusion · 2019-05-30T14:16:24Z

/retest

vrutkovs · 2019-05-30T14:17:46Z

@vrutkovs Looks like we fail to get bastion installed correctly?

Yep, that's Route53 not replying back after 30 attempts :/

templates/master/00-master/_base/files/usr-local-bin-etcd-snapshot-restore-sh.yaml

hexfusion · 2019-05-30T20:40:08Z

@vrutkovs adding exit 1 to usage() gave us what we were expecting.

Path to snapshot and initial cluster are required: ./etcd-snapshot-restore.sh $path-to-snapshot $initial_cluster
exit code: 1

vrutkovs · 2019-06-04T09:48:22Z

This is fixed in openshift/release#3919

vrutkovs · 2019-06-05T15:47:01Z

/test e2e-restore-cluster-state

vrutkovs · 2019-06-05T15:47:09Z

/test e2e-etcd-quorum-loss

hexfusion · 2019-06-05T15:55:14Z

/hold cancel

vrutkovs · 2019-06-05T17:31:42Z

I think I started etcd-quorum-loss too soon - it still fails with this error.

/test e2e-etcd-quorum-loss

hexfusion · 2019-06-05T18:26:02Z

/test e2e-restore-cluster-state

hexfusion · 2019-06-05T19:59:33Z

> + echo 'Snapshot file not found, restore failed: /root/assets/backup/etcd/member/snap/db.'
+ exit 1

/cc @vrutkovs

/test e2e-etcd-quorum-loss

vrutkovs · 2019-06-05T20:26:11Z

Fix: openshift/release#3967

hexfusion · 2019-06-05T23:47:44Z

I think the issue was actually on me I wanted to check that the snapshot existed early. But if we want to use the backup the check was too soon as it did not exist yet.

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>

hexfusion · 2019-06-06T01:25:08Z

all tests passed!

hexfusion · 2019-06-06T01:26:26Z

/cc @runcom

vrutkovs · 2019-06-06T06:52:24Z

/test e2e-etcd-quorum-loss

openshift-ci-robot · 2019-06-06T08:36:05Z

@hexfusion: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-restore-cluster-state	b39e7e8f8c2f6b3fd1c5f0eeb1cc5f5837278a83	link	`/test e2e-restore-cluster-state`
ci/prow/e2e-etcd-quorum-loss	`b8994e0`	link	`/test e2e-etcd-quorum-loss`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

vrutkovs · 2019-06-06T09:13:27Z

Failing tests:

[Feature:Prometheus][Conformance] Prometheus when installed on the cluster should report less than two alerts in firing or pending state [Suite:openshift/conformance/parallel/minimal]

Known flake (although I assumed the fix has been merged by that time).
/skip

vrutkovs · 2019-06-06T09:29:25Z

/lgtm

hexfusion · 2019-06-06T11:39:45Z

"Pod openshift-sdn/sdn-kd54v is not healthy: container sdn has restarted more than 5 times",

The tests are still flakey due to other issues than this code directly. The changes here are needed for DR to work properly for bare-metal so we need to merge this and then solve other problems.

hexfusion · 2019-06-06T11:41:51Z

@kikisdeliveryservice @cgwalters could I get an approval, please.

runcom · 2019-06-06T13:31:54Z

/approve

openshift-ci-robot · 2019-06-06T13:32:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, runcom, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hexfusion · 2019-06-06T18:36:27Z

/cherrypick release-4.1

openshift-cherrypick-robot · 2019-06-06T18:36:30Z

@hexfusion: #804 failed to apply on top of branch "release-4.1":

error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	templates/master/00-master/_base/files/usr-local-bin-etcd-member-recover-sh.yaml
M	templates/master/00-master/_base/files/usr-local-bin-etcd-snapshot-restore-sh.yaml
M	templates/master/00-master/_base/files/usr-local-bin-openshift-recovery-tools-sh.yaml
Falling back to patching base and 3-way merge...
Auto-merging templates/master/00-master/_base/files/usr-local-bin-openshift-recovery-tools-sh.yaml
CONFLICT (content): Merge conflict in templates/master/00-master/_base/files/usr-local-bin-openshift-recovery-tools-sh.yaml
Auto-merging templates/master/00-master/_base/files/usr-local-bin-etcd-snapshot-restore-sh.yaml
CONFLICT (content): Merge conflict in templates/master/00-master/_base/files/usr-local-bin-etcd-snapshot-restore-sh.yaml
Auto-merging templates/master/00-master/_base/files/usr-local-bin-etcd-member-recover-sh.yaml
CONFLICT (content): Merge conflict in templates/master/00-master/_base/files/usr-local-bin-etcd-member-recover-sh.yaml
Patch failed at 0001 DR: use param to populate etcd name for etcd-member-recover

Details

In response to this:

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 30, 2019

openshift-ci-robot requested review from cgwalters and kikisdeliveryservice May 30, 2019 00:38

hexfusion force-pushed the fx_assumptions branch from ca2e5c5 to 69669c2 Compare May 30, 2019 13:11

hexfusion force-pushed the fx_assumptions branch from 69669c2 to b39e7e8 Compare May 30, 2019 17:39

sgreene570 reviewed May 30, 2019

View reviewed changes

templates/master/00-master/_base/files/usr-local-bin-etcd-snapshot-restore-sh.yaml Outdated Show resolved Hide resolved

vrutkovs mentioned this pull request Jun 4, 2019

DR: always set etcd initial cluster connstring in quorum restore test openshift/release#3947

Merged

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 5, 2019

hexfusion changed the title ~~DR: use params to populate etcd name instead of assumptions~~ DR: use param to populate etcd name Jun 5, 2019

openshift-ci-robot requested a review from vrutkovs June 5, 2019 19:59

hexfusion force-pushed the fx_assumptions branch from b39e7e8 to 9e10d42 Compare June 5, 2019 23:46

hexfusion mentioned this pull request Jun 5, 2019

DR: use etcd DB location as a source for snapshot restore openshift/release#3967

Closed

DR: use param to populate etcd name for etcd-member-recover

b8994e0

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>

hexfusion force-pushed the fx_assumptions branch from 9e10d42 to b8994e0 Compare June 6, 2019 01:25

openshift-ci-robot requested a review from runcom June 6, 2019 01:26

openshift-ci-robot assigned vrutkovs Jun 6, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 6, 2019

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 6, 2019

openshift-merge-robot merged commit bfe0429 into openshift:master Jun 6, 2019

hexfusion deleted the fx_assumptions branch June 6, 2019 18:36

hexfusion mentioned this pull request Jun 7, 2019

[release-4.1] Bug 1746176: DR: backport fixes #834

Merged

DR: use param to populate etcd name #804

DR: use param to populate etcd name #804

Uh oh!

Conversation

hexfusion commented May 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrutkovs commented May 30, 2019

Uh oh!

hexfusion commented May 30, 2019

Uh oh!

hexfusion commented May 30, 2019

Uh oh!

vrutkovs commented May 30, 2019

Uh oh!

Uh oh!

hexfusion commented May 30, 2019

Uh oh!

vrutkovs commented Jun 4, 2019

Uh oh!

vrutkovs commented Jun 5, 2019

Uh oh!

vrutkovs commented Jun 5, 2019

Uh oh!

hexfusion commented Jun 5, 2019

Uh oh!

vrutkovs commented Jun 5, 2019

Uh oh!

hexfusion commented Jun 5, 2019

Uh oh!

hexfusion commented Jun 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrutkovs commented Jun 5, 2019

Uh oh!

hexfusion commented Jun 5, 2019

Uh oh!

hexfusion commented Jun 6, 2019

Uh oh!

hexfusion commented Jun 6, 2019

Uh oh!

vrutkovs commented Jun 6, 2019

Uh oh!

openshift-ci-robot commented Jun 6, 2019

Uh oh!

vrutkovs commented Jun 6, 2019

Uh oh!

vrutkovs commented Jun 6, 2019

Uh oh!

hexfusion commented Jun 6, 2019

Uh oh!

hexfusion commented Jun 6, 2019

Uh oh!

runcom commented Jun 6, 2019

Uh oh!

openshift-ci-robot commented Jun 6, 2019

Uh oh!

hexfusion commented Jun 6, 2019

Uh oh!

openshift-cherrypick-robot commented Jun 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

hexfusion commented May 30, 2019 •

edited

Loading

hexfusion commented Jun 5, 2019 •

edited

Loading