tests: add e2e tests to verify DR scenarios #23208

vrutkovs · 2019-06-19T09:08:07Z

This adds new disruptive tests for DR scenarios: etcd snapshot restore (part of # #23080) and etcd quorum restore

TODO:

Get the direction approved
Wait for Run DR tests from Origin tests release#4101 to be merged and verify tests are passing
Run standard e2e tests after selected DR suite
ssh-keygen fails to start:

No user exists for uid 1475100000

Snapshot restore - first update won't complete in time
Snapshot restore - Some cluster operators never became available <nil>/marketplace, <nil>/operator-lifecycle-manager-packageserver
Fix verify test
Other tests are failing after DR scenario. SDN issue?
Make sure tests are consistently passing
Disable/skip Managed cluster should have no crashlooping pods in core namespaces over two minutes tests - frequently fails on etcd quorum restore

vrutkovs · 2019-06-19T09:18:45Z

/cc @deads2k @smarterclayton @hexfusion

vrutkovs · 2019-06-20T06:29:34Z

/test e2e-dr-quorum-tests

vrutkovs · 2019-06-20T06:29:45Z

/test e2e-dr-snapshot-tests

vrutkovs · 2019-06-20T08:13:27Z

/retest

hexfusion · 2019-06-20T12:28:14Z

test/e2e/dr/quorum-restore.go

+		o.Expect(err).NotTo(o.HaveOccurred())
+		o.Expect(mapiPods.Items).NotTo(o.BeEmpty())
+
+		survivingNodeName := mapiPods.Items[0].Spec.NodeName


Just a question how random is the list here. Meaning are we making any assumptions by always taking the first in the list. I assume mapiPods is in no predetermined order but I wanted to make sure.

The test would pick the first one as there should be just one instance of machine-api-controller pod running

OK so we are in a sense not randomizing what survives. I know we need to do this for some reason but the customer does not get to make this choice. Do we need to fix this core problem?

If we nuke machine-api-controller the operator should reschedule it to another master.

Is this just a matter of waiting for that to happen? can we force it?

Correct, it would reschedule to a different master. The issue is that it won't allow to remove a master its running on via Machine API.

So the test would find out which node machine controller is running on and remove the other two - it can't unfortunately kill random masters.

That should not affect customer scenarios, as once single etcd master is restored machine-api-controller would start there - and admin would be able to create new masters via Machine API

vrutkovs · 2019-07-15T14:59:10Z

@smarterclayton @deads2k should I keep openshift-tests run-dr restore-snapshot / quorum-restore cmd line - or use TEST_FOCUS / TEST_SKIP to match a single test instead?

smarterclayton · 2019-07-15T15:00:18Z

So the most minimal thing we can do right now is you remove the suite stuff (make sure you're in disruptive) and then we update the disruptive job to run the disruptive suite and then run the e2e suite after. That does two things - guarantees the suites run in random order and the basic cluster stays up, and keeps your post condition checking for now.

That'll get the disruptive job up and running and then we can step back and evaluate a deeper integration.

vrutkovs · 2019-07-15T16:23:57Z

you remove the suite stuff (make sure you're in disruptive)

Done

I'll work on updating CI to run each DR test meanwhile

smarterclayton · 2019-07-15T17:03:45Z

Just update the disruptive suite job definition and remove your special suites job definitions

smarterclayton · 2019-07-15T17:04:31Z

test/extended/operators/cluster.go

 	defer g.GinkgoRecover()

-	g.It("have no crashlooping pods in core namespaces over two minutes", func() {
+	g.It("have no crashlooping pods in core namespaces over two minutes [IgnoreInDR]", func() {


Why are things crash looping after DR?

At least openshift-sdn and openshift-apiserver are still running and attempting to reach kube apiserver

How long does it take to clear?

It won't clear - this test would fail if the pod has restarted more than 5 times. In order to clear that the pod should be removed - and we'll lose its logs

No, we would just add a condition to the test that makes this test tolerant of clusters that have been running for a while.

smarterclayton · 2019-07-15T17:04:48Z

test/extended/prometheus/prometheus.go

 			runQueries(tests, oc, ns, execPodName, url, bearerToken)
 		})
-		g.It("should report less than two alerts in firing or pending state", func() {
+		g.It("should report less than two alerts in firing or pending state [IgnoreInDR]", func() {


Why are alerts firing after you recover?

This one is flaking if this test randomly start in the beginning of the suite - usually its either crashlooping openshift-apiserver or kubelet alert

I think in the jobs we'll just create a wait of 1-3 minutes before the e2es kick off. So you can remove this bit of code (excluding these tests) and remove the DR suite in cmd/openshift-tests.

I'd rather do this in a follow up PR, does that sound good?

No, I don't think these labels should be added at all, and I'm trying to get you to remove your suite so you use the disruptive suite as we intended.

Turns out its a regression in 4.1.x (and in 4.2 as test shows) - https://bugzilla.redhat.com/show_bug.cgi?id=1731827

vrutkovs · 2019-07-16T10:53:12Z

/retest

hexfusion

/lgtm

great work @vrutkovs!

vrutkovs · 2019-07-29T07:49:38Z

/test e2e-dr-quorum-tests
/test e2e-dr-snapshot-tests

openshift-ci-robot · 2019-07-29T09:03:29Z

@vrutkovs: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-dr-quorum-tests	`3a744e4`	link	`/test e2e-dr-quorum-tests`
ci/prow/e2e-dr-snapshot-tests	`3a744e4`	link	`/test e2e-dr-snapshot-tests`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

vrutkovs · 2019-07-29T09:36:15Z

Rebased and removed /dr suite, @smarterclayton @hexfusion PTAL

smarterclayton

Fix the filenames and then I’ll merge this

test/e2e/dr/quorum-restore.go

These tests are equivalent to bash function in CI's installer template. It runs in a dedicated subcommand and ensure cluster state is properly being restored after etcd snapshot restore procedure.

smarterclayton · 2019-07-30T13:40:38Z

/lgtm

openshift-ci-robot · 2019-07-30T13:41:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, hexfusion, smarterclayton, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~test/OWNERS~~ [deads2k,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vrutkovs · 2019-07-30T13:48:30Z

/hold cancel

openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 19, 2019

openshift-ci-robot requested review from csrwng and gabemontero June 19, 2019 09:08

vrutkovs mentioned this pull request Jun 19, 2019

Run DR tests from Origin tests openshift/release#4101

Merged

openshift-ci-robot requested review from deads2k, hexfusion and smarterclayton June 19, 2019 09:18

vrutkovs force-pushed the dr-quorum branch 2 times, most recently from 5ffc112 to f0d56f4 Compare June 19, 2019 11:35

vrutkovs mentioned this pull request Jun 20, 2019

origin: pass TEST_SUITE when invoking dr test functions openshift/release#4120

Merged

vrutkovs closed this Jun 20, 2019

vrutkovs reopened this Jun 20, 2019

vrutkovs mentioned this pull request Jun 20, 2019

tests: run restore etcd from snapshot test using a new subcommand #23080

Closed

5 tasks

vrutkovs force-pushed the dr-quorum branch from f0d56f4 to 648b670 Compare June 20, 2019 10:18

hexfusion reviewed Jun 20, 2019

View reviewed changes

vrutkovs force-pushed the dr-quorum branch 2 times, most recently from 875a713 to e5b0e12 Compare June 20, 2019 17:24

vrutkovs mentioned this pull request Jun 24, 2019

Run ssh bastion during openshift-tests openshift/release#4161

Closed

vrutkovs force-pushed the dr-quorum branch 8 times, most recently from b337b58 to 36f2a41 Compare July 5, 2019 09:04

vrutkovs force-pushed the dr-quorum branch from 7ec1aa6 to d30bcd3 Compare July 15, 2019 16:22

smarterclayton reviewed Jul 15, 2019

View reviewed changes

hexfusion reviewed Jul 17, 2019

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 17, 2019

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2019

vrutkovs force-pushed the dr-quorum branch from d30bcd3 to 3a744e4 Compare July 29, 2019 07:44

openshift-ci-robot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 29, 2019

smarterclayton suggested changes Jul 29, 2019

View reviewed changes

test/e2e/dr/quorum-restore.go Show resolved Hide resolved

vrutkovs force-pushed the dr-quorum branch from 3a744e4 to 7530b60 Compare July 29, 2019 13:36

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 29, 2019

tests: run restore etcd from snapshot test and quorum tests

cefd73d

These tests are equivalent to bash function in CI's installer template. It runs in a dedicated subcommand and ensure cluster state is properly being restored after etcd snapshot restore procedure.

vrutkovs force-pushed the dr-quorum branch from 7530b60 to cefd73d Compare July 30, 2019 07:57

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 30, 2019

openshift-ci-robot assigned smarterclayton Jul 30, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 30, 2019

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 30, 2019

openshift-merge-robot merged commit b0c2355 into openshift:master Jul 30, 2019

tests: add e2e tests to verify DR scenarios #23208

tests: add e2e tests to verify DR scenarios #23208

Uh oh!

Conversation

vrutkovs commented Jun 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrutkovs commented Jun 19, 2019

Uh oh!

vrutkovs commented Jun 20, 2019

Uh oh!

vrutkovs commented Jun 20, 2019

Uh oh!

vrutkovs commented Jun 20, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vrutkovs Jun 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vrutkovs commented Jul 15, 2019

Uh oh!

smarterclayton commented Jul 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrutkovs commented Jul 15, 2019

Uh oh!

smarterclayton commented Jul 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vrutkovs commented Jul 16, 2019

Uh oh!

hexfusion left a comment

Choose a reason for hiding this comment

Uh oh!

vrutkovs commented Jul 29, 2019

Uh oh!

openshift-ci-robot commented Jul 29, 2019

Uh oh!

vrutkovs commented Jul 29, 2019

Uh oh!

smarterclayton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

smarterclayton commented Jul 30, 2019

Uh oh!

openshift-ci-robot commented Jul 30, 2019

Uh oh!

vrutkovs commented Jun 19, 2019 •

edited

Loading

vrutkovs Jun 20, 2019 •

edited

Loading

smarterclayton commented Jul 15, 2019 •

edited

Loading

smarterclayton commented Jul 15, 2019 •

edited

Loading