installer template: add 'recover from etcd quorum loss' test #3572

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

vrutkovs wants to merge 2 commits into openshift:master from vrutkovs:disaster-recovery-test

Contributor

vrutkovs commented Apr 24, 2019 •

edited

Loading

Rework of #3495 - run etcd quorum loss as a dedicated test in installer template.

This adds a new optional test e2e-etcd-quorum-loss which simulates the quorum loss. At this point its expected to fail as a single etcd node doesn't get restored yet.

This PR uses a forked version of scripts served by MCO. This would be fixed in a separate PR so that we could work in parallel on various small tasks related to DR

TODO:

Rebase when installer template: add a function to test cluster state restore #3595 lands

openshift-ci-robot added do-not-merge/work-in-progress size/L labels

openshift-ci-robot requested review from aaronlevy and abhinavdahiya

April 24, 2019 09:49

vrutkovs mentioned this pull request

Add disaster recovery template #3495

Closed

1 task

vrutkovs force-pushed the disaster-recovery-test branch from 8da3189 to 0999570 Compare

April 24, 2019 11:51

wking reviewed

View reviewed changes

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml Outdated Show resolved Hide resolved

vrutkovs force-pushed the disaster-recovery-test branch 3 times, most recently from 75352c6 to 8cec737 Compare

April 24, 2019 12:46

vrutkovs changed the title ~~WIP installer template: add 'recover from etcd quorum loss' test~~ installer template: add 'recover from etcd quorum loss' test

openshift-ci-robot removed the do-not-merge/work-in-progress label

vrutkovs force-pushed the disaster-recovery-test branch 6 times, most recently from 6806b0e to 604edbb Compare

April 25, 2019 15:17

vrutkovs force-pushed the disaster-recovery-test branch from 604edbb to 290ec0e Compare

May 6, 2019 08:17

openshift-ci-robot added size/XL and removed size/L labels

vrutkovs force-pushed the disaster-recovery-test branch 9 times, most recently from fdb57cc to d9d7513 Compare

May 13, 2019 16:02

vrutkovs force-pushed the disaster-recovery-test branch 5 times, most recently from 1d78d93 to 3e901a8 Compare

May 16, 2019 09:05

Contributor Author

vrutkovs commented May 16, 2019 •

edited

Loading

Failing tests: 
[Feature:Platform][Smoke] Managed cluster should ensure control plane operators do not make themselves unevictable [Suite:openshift/conformance/parallel] 
[Feature:Platform][Smoke] Managed cluster should ensure pods use images from our release image with proper ImagePullPolicy [Suite:openshift/conformance/parallel]

Fixed in the latest commit - remove ssh-bastion, scale etcd-quorum-guard to 3 and remove etcd-signer pod

vrutkovs force-pushed the disaster-recovery-test branch 5 times, most recently from 787556f to e1b523a Compare

May 16, 2019 12:33

hexfusion reviewed

View reviewed changes

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml Outdated Show resolved Hide resolved

vrutkovs force-pushed the disaster-recovery-test branch 3 times, most recently from d5f3e39 to caface9 Compare

May 16, 2019 18:26

vrutkovs mentioned this pull request

installer template: add etcd quorum loss scenario #3842

Merged

2 tasks

wking reviewed

View reviewed changes

Member

wking left a comment

Looks good. Left a few nits inline. It's not clear to me how much of this code is staying vs. moving out the the machine-config operator. Is it just the here-docs moving out? If so, there's a bit of shared functionality between this and restore-cluster-state that could be pulled out into helper functions to stay DRY.

ci-operator/config/openshift/installer/openshift-installer-master.yaml Outdated Show resolved Hide resolved

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml Outdated Show resolved Hide resolved

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml Outdated Show resolved Hide resolved

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml Outdated Show resolved Hide resolved

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml Outdated

Member

wking May 21, 2019

What are we waiting for here? This seems brittle.

Contributor Author

vrutkovs May 22, 2019

Machine API won't wait for machine to be deleted during oc delete machine call, so API is available for a few seconds after 2 masters were removed.

This pause is necessary to confirm API is no longer responding

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml Outdated

Member

wking May 21, 2019

floating downloads 😭 😉

Contributor Author

vrutkovs May 22, 2019

Its a temporary measure while we debug MCO scripts.

This is fixed in #3842 which is the same PR + MCO scripts

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml Outdated Show resolved Hide resolved

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml Outdated Show resolved Hide resolved

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml Outdated

Member

wking May 21, 2019

Not part of the test framework that I can review, but if your etcd cluster is down to one node (because you killed two of three control-plane nodes), then how is the remaining etcd still functioning? I'd have expected it to be freaking out about having lost quorum and refusing to take possibly-split-brained actions. If there's some quick explanation for how this works, I'm very curious. If the explanation is longer, we should probably skip it to avoid distracting from the test script itself.

Contributor Author

vrutkovs May 22, 2019

if your etcd cluster is down to one node (because you killed two of three control-plane nodes), then how is the remaining etcd still functioning?

etcd-snapshot-restore.sh without params would restore this etcd forming a one node cluster - https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/_base/files/usr-local-bin-openshift-recovery-tools-sh.yaml#L141-L168.

Added a bit better comment for this in b99911295

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml Outdated Show resolved Hide resolved

vrutkovs force-pushed the disaster-recovery-test branch 2 times, most recently from 4126222 to 7eaf502 Compare

May 22, 2019 12:19

vrutkovs mentioned this pull request

tests: include several test files necessary for DR scenario tests openshift/origin#22887

Closed

vrutkovs force-pushed the disaster-recovery-test branch from 361822a to ade9148 Compare

May 22, 2019 15:07

Contributor Author

vrutkovs commented May 22, 2019 •

edited

Loading

Fixed all outstanding issues, rebased on master.

This would eventually be substituted by #3842, changes to use MCO are contained in ad5a69f.

Some scripts would land in origin tests image - see openshift/origin#22887

Contributor Author

vrutkovs commented May 22, 2019

fail [github.com/openshift/origin/test/extended/operators/cluster.go:118]: Expected
    <[]string | len:4, cap:4>: [
        "Pod openshift-apiserver-operator/openshift-apiserver-operator-56c559db6f-xfmhm is not healthy: container openshift-apiserver-operator has restarted more than 5 times",
        "Pod openshift-controller-manager-operator/openshift-controller-manager-operator-85cc57f9d7-qvgnm is not healthy: container operator has restarted more than 5 times",
        "Pod openshift-kube-apiserver-operator/kube-apiserver-operator-84c9f5b76c-rklfj is not healthy: container kube-apiserver-operator has restarted more than 5 times",
        "Pod openshift-kube-scheduler-operator/openshift-kube-scheduler-operator-6c677b65cc-rmw7r is not healthy: container kube-scheduler-operator-container has restarted more than 5 times",
    ]
to be empty

/test pj-rehearse


          installer template: add 'recover from etcd quorum loss' test

9a737cc

vrutkovs force-pushed the disaster-recovery-test branch from ade9148 to 9a737cc Compare

May 23, 2019 07:46


          install boto3 module via pip

9db3c7f

vrutkovs force-pushed the disaster-recovery-test branch from 692d2df to 9db3c7f Compare

May 23, 2019 09:42

Contributor

openshift-ci-robot commented May 23, 2019

@vrutkovs: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/rehearse/openshift/installer/master/e2e-vsphere	75352c6446e2ffae369b56381c914245bf0aa37c	link	`/test pj-rehearse`
ci/rehearse/openshift/builder/master/e2e-aws	`9db3c7f`	link	`/test pj-rehearse`
ci/rehearse/openshift/installer/master/e2e-etcd-quorum-loss	`9db3c7f`	link	`/test pj-rehearse`
ci/prow/pj-rehearse	`9db3c7f`	link	`/test pj-rehearse`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Contributor Author

vrutkovs commented May 23, 2019

Superceded by #3842

vrutkovs closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

aaronlevy Awaiting requested review from aaronlevy

abhinavdahiya Awaiting requested review from abhinavdahiya

2 more reviewers

wking wking left review comments

hexfusion hexfusion left review comments

Labels

sig/azure size/L