Skip to content

Conversation

@vrutkovs
Copy link
Contributor

@vrutkovs vrutkovs commented Apr 24, 2019

Rework of #3495 - run etcd quorum loss as a dedicated test in installer template.

This adds a new optional test e2e-etcd-quorum-loss which simulates the quorum loss. At this point its expected to fail as a single etcd node doesn't get restored yet.

This PR uses a forked version of scripts served by MCO. This would be fixed in a separate PR so that we could work in parallel on various small tasks related to DR

TODO:

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 24, 2019
@vrutkovs vrutkovs mentioned this pull request Apr 24, 2019
1 task
@vrutkovs vrutkovs force-pushed the disaster-recovery-test branch from 8da3189 to 0999570 Compare April 24, 2019 11:51
@vrutkovs vrutkovs force-pushed the disaster-recovery-test branch 3 times, most recently from 75352c6 to 8cec737 Compare April 24, 2019 12:46
@vrutkovs vrutkovs changed the title WIP installer template: add 'recover from etcd quorum loss' test installer template: add 'recover from etcd quorum loss' test Apr 24, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 24, 2019
@vrutkovs vrutkovs force-pushed the disaster-recovery-test branch 6 times, most recently from 6806b0e to 604edbb Compare April 25, 2019 15:17
@vrutkovs vrutkovs force-pushed the disaster-recovery-test branch from 604edbb to 290ec0e Compare May 6, 2019 08:17
@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 6, 2019
@vrutkovs vrutkovs force-pushed the disaster-recovery-test branch 9 times, most recently from fdb57cc to d9d7513 Compare May 13, 2019 16:02
@vrutkovs vrutkovs force-pushed the disaster-recovery-test branch 5 times, most recently from 1d78d93 to 3e901a8 Compare May 16, 2019 09:05
@vrutkovs
Copy link
Contributor Author

vrutkovs commented May 16, 2019

Failing tests: 
[Feature:Platform][Smoke] Managed cluster should ensure control plane operators do not make themselves unevictable [Suite:openshift/conformance/parallel] 
[Feature:Platform][Smoke] Managed cluster should ensure pods use images from our release image with proper ImagePullPolicy [Suite:openshift/conformance/parallel]

Fixed in the latest commit - remove ssh-bastion, scale etcd-quorum-guard to 3 and remove etcd-signer pod

@vrutkovs vrutkovs force-pushed the disaster-recovery-test branch 5 times, most recently from 787556f to e1b523a Compare May 16, 2019 12:33
@vrutkovs vrutkovs force-pushed the disaster-recovery-test branch 3 times, most recently from d5f3e39 to caface9 Compare May 16, 2019 18:26
Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Left a few nits inline. It's not clear to me how much of this code is staying vs. moving out the the machine-config operator. Is it just the here-docs moving out? If so, there's a bit of shared functionality between this and restore-cluster-state that could be pulled out into helper functions to stay DRY.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we waiting for here? This seems brittle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Machine API won't wait for machine to be deleted during oc delete machine call, so API is available for a few seconds after 2 masters were removed.

This pause is necessary to confirm API is no longer responding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

floating downloads 😭 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a temporary measure while we debug MCO scripts.

This is fixed in #3842 which is the same PR + MCO scripts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of the test framework that I can review, but if your etcd cluster is down to one node (because you killed two of three control-plane nodes), then how is the remaining etcd still functioning? I'd have expected it to be freaking out about having lost quorum and refusing to take possibly-split-brained actions. If there's some quick explanation for how this works, I'm very curious. If the explanation is longer, we should probably skip it to avoid distracting from the test script itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if your etcd cluster is down to one node (because you killed two of three control-plane nodes), then how is the remaining etcd still functioning?

etcd-snapshot-restore.sh without params would restore this etcd forming a one node cluster - https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/_base/files/usr-local-bin-openshift-recovery-tools-sh.yaml#L141-L168.

Added a bit better comment for this in b99911295

@vrutkovs
Copy link
Contributor Author

vrutkovs commented May 22, 2019

Fixed all outstanding issues, rebased on master.

This would eventually be substituted by #3842, changes to use MCO are contained in ad5a69f.

Some scripts would land in origin tests image - see openshift/origin#22887

@vrutkovs
Copy link
Contributor Author

fail [github.com/openshift/origin/test/extended/operators/cluster.go:118]: Expected
    <[]string | len:4, cap:4>: [
        "Pod openshift-apiserver-operator/openshift-apiserver-operator-56c559db6f-xfmhm is not healthy: container openshift-apiserver-operator has restarted more than 5 times",
        "Pod openshift-controller-manager-operator/openshift-controller-manager-operator-85cc57f9d7-qvgnm is not healthy: container operator has restarted more than 5 times",
        "Pod openshift-kube-apiserver-operator/kube-apiserver-operator-84c9f5b76c-rklfj is not healthy: container kube-apiserver-operator has restarted more than 5 times",
        "Pod openshift-kube-scheduler-operator/openshift-kube-scheduler-operator-6c677b65cc-rmw7r is not healthy: container kube-scheduler-operator-container has restarted more than 5 times",
    ]
to be empty

/test pj-rehearse

@vrutkovs vrutkovs force-pushed the disaster-recovery-test branch from ade9148 to 9a737cc Compare May 23, 2019 07:46
@vrutkovs vrutkovs force-pushed the disaster-recovery-test branch from 692d2df to 9db3c7f Compare May 23, 2019 09:42
@openshift-ci-robot
Copy link
Contributor

@vrutkovs: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/rehearse/openshift/installer/master/e2e-vsphere 75352c6446e2ffae369b56381c914245bf0aa37c link /test pj-rehearse
ci/rehearse/openshift/builder/master/e2e-aws 9db3c7f link /test pj-rehearse
ci/rehearse/openshift/installer/master/e2e-etcd-quorum-loss 9db3c7f link /test pj-rehearse
ci/prow/pj-rehearse 9db3c7f link /test pj-rehearse

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@vrutkovs
Copy link
Contributor Author

Superceded by #3842

@vrutkovs vrutkovs closed this May 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sig/azure Categorizes item related to Azure jobs size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants