-
Notifications
You must be signed in to change notification settings - Fork 2.1k
installer template: add 'recover from etcd quorum loss' test #3572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8da3189 to
0999570
Compare
ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
Outdated
Show resolved
Hide resolved
75352c6 to
8cec737
Compare
6806b0e to
604edbb
Compare
604edbb to
290ec0e
Compare
fdb57cc to
d9d7513
Compare
1d78d93 to
3e901a8
Compare
Fixed in the latest commit - remove ssh-bastion, scale etcd-quorum-guard to 3 and remove etcd-signer pod |
787556f to
e1b523a
Compare
ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
Outdated
Show resolved
Hide resolved
d5f3e39 to
caface9
Compare
wking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Left a few nits inline. It's not clear to me how much of this code is staying vs. moving out the the machine-config operator. Is it just the here-docs moving out? If so, there's a bit of shared functionality between this and restore-cluster-state that could be pulled out into helper functions to stay DRY.
ci-operator/config/openshift/installer/openshift-installer-master.yaml
Outdated
Show resolved
Hide resolved
ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
Outdated
Show resolved
Hide resolved
ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
Outdated
Show resolved
Hide resolved
ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we waiting for here? This seems brittle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Machine API won't wait for machine to be deleted during oc delete machine call, so API is available for a few seconds after 2 masters were removed.
This pause is necessary to confirm API is no longer responding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
floating downloads 😭 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a temporary measure while we debug MCO scripts.
This is fixed in #3842 which is the same PR + MCO scripts
ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
Outdated
Show resolved
Hide resolved
ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not part of the test framework that I can review, but if your etcd cluster is down to one node (because you killed two of three control-plane nodes), then how is the remaining etcd still functioning? I'd have expected it to be freaking out about having lost quorum and refusing to take possibly-split-brained actions. If there's some quick explanation for how this works, I'm very curious. If the explanation is longer, we should probably skip it to avoid distracting from the test script itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if your etcd cluster is down to one node (because you killed two of three control-plane nodes), then how is the remaining etcd still functioning?
etcd-snapshot-restore.sh without params would restore this etcd forming a one node cluster - https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/_base/files/usr-local-bin-openshift-recovery-tools-sh.yaml#L141-L168.
Added a bit better comment for this in b99911295
ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
Outdated
Show resolved
Hide resolved
4126222 to
7eaf502
Compare
361822a to
ade9148
Compare
|
Fixed all outstanding issues, rebased on master. This would eventually be substituted by #3842, changes to use MCO are contained in ad5a69f. Some scripts would land in origin tests image - see openshift/origin#22887 |
/test pj-rehearse |
ade9148 to
9a737cc
Compare
692d2df to
9db3c7f
Compare
|
@vrutkovs: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Superceded by #3842 |
Rework of #3495 - run etcd quorum loss as a dedicated test in installer template.
This adds a new optional test
e2e-etcd-quorum-losswhich simulates the quorum loss. At this point its expected to fail as a single etcd node doesn't get restored yet.This PR uses a forked version of scripts served by MCO. This would be fixed in a separate PR so that we could work in parallel on various small tasks related to DR
TODO: