Skip to content

Conversation

@vrutkovs
Copy link
Contributor

@vrutkovs vrutkovs commented Jun 19, 2019

This adds new disruptive tests for DR scenarios: etcd snapshot restore (part of # #23080) and etcd quorum restore

TODO:

No user exists for uid 1475100000
  • Snapshot restore - first update won't complete in time
  • Snapshot restore - Some cluster operators never became available <nil>/marketplace, <nil>/operator-lifecycle-manager-packageserver
  • Fix verify test
  • Other tests are failing after DR scenario. SDN issue?
  • Make sure tests are consistently passing
  • Disable/skip Managed cluster should have no crashlooping pods in core namespaces over two minutes tests - frequently fails on etcd quorum restore

@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 19, 2019
@vrutkovs
Copy link
Contributor Author

/cc @deads2k @smarterclayton @hexfusion

@vrutkovs vrutkovs force-pushed the dr-quorum branch 2 times, most recently from 5ffc112 to f0d56f4 Compare June 19, 2019 11:35
@vrutkovs
Copy link
Contributor Author

/test e2e-dr-quorum-tests

@vrutkovs
Copy link
Contributor Author

/test e2e-dr-snapshot-tests

@vrutkovs
Copy link
Contributor Author

/retest

o.Expect(err).NotTo(o.HaveOccurred())
o.Expect(mapiPods.Items).NotTo(o.BeEmpty())

survivingNodeName := mapiPods.Items[0].Spec.NodeName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question how random is the list here. Meaning are we making any assumptions by always taking the first in the list. I assume mapiPods is in no predetermined order but I wanted to make sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test would pick the first one as there should be just one instance of machine-api-controller pod running

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so we are in a sense not randomizing what survives. I know we need to do this for some reason but the customer does not get to make this choice. Do we need to fix this core problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we nuke machine-api-controller the operator should reschedule it to another master.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a matter of waiting for that to happen? can we force it?

Copy link
Contributor Author

@vrutkovs vrutkovs Jun 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it would reschedule to a different master. The issue is that it won't allow to remove a master its running on via Machine API.

So the test would find out which node machine controller is running on and remove the other two - it can't unfortunately kill random masters.

That should not affect customer scenarios, as once single etcd master is restored machine-api-controller would start there - and admin would be able to create new masters via Machine API

@vrutkovs vrutkovs force-pushed the dr-quorum branch 2 times, most recently from 875a713 to e5b0e12 Compare June 20, 2019 17:24
@vrutkovs vrutkovs force-pushed the dr-quorum branch 8 times, most recently from b337b58 to 36f2a41 Compare July 5, 2019 09:04
@vrutkovs
Copy link
Contributor Author

@smarterclayton @deads2k should I keep openshift-tests run-dr restore-snapshot / quorum-restore cmd line - or use TEST_FOCUS / TEST_SKIP to match a single test instead?

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 15, 2019

So the most minimal thing we can do right now is you remove the suite stuff (make sure you're in disruptive) and then we update the disruptive job to run the disruptive suite and then run the e2e suite after. That does two things - guarantees the suites run in random order and the basic cluster stays up, and keeps your post condition checking for now.

That'll get the disruptive job up and running and then we can step back and evaluate a deeper integration.

@vrutkovs
Copy link
Contributor Author

you remove the suite stuff (make sure you're in disruptive)

Done

I'll work on updating CI to run each DR test meanwhile

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 15, 2019

Just update the disruptive suite job definition and remove your special suites job definitions

defer g.GinkgoRecover()

g.It("have no crashlooping pods in core namespaces over two minutes", func() {
g.It("have no crashlooping pods in core namespaces over two minutes [IgnoreInDR]", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are things crash looping after DR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least openshift-sdn and openshift-apiserver are still running and attempting to reach kube apiserver

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does it take to clear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't clear - this test would fail if the pod has restarted more than 5 times. In order to clear that the pod should be removed - and we'll lose its logs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we would just add a condition to the test that makes this test tolerant of clusters that have been running for a while.

runQueries(tests, oc, ns, execPodName, url, bearerToken)
})
g.It("should report less than two alerts in firing or pending state", func() {
g.It("should report less than two alerts in firing or pending state [IgnoreInDR]", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are alerts firing after you recover?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is flaking if this test randomly start in the beginning of the suite - usually its either crashlooping openshift-apiserver or kubelet alert

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the jobs we'll just create a wait of 1-3 minutes before the e2es kick off. So you can remove this bit of code (excluding these tests) and remove the DR suite in cmd/openshift-tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather do this in a follow up PR, does that sound good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think these labels should be added at all, and I'm trying to get you to remove your suite so you use the disruptive suite as we intended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out its a regression in 4.1.x (and in 4.2 as test shows) - https://bugzilla.redhat.com/show_bug.cgi?id=1731827

@vrutkovs
Copy link
Contributor Author

/retest

Copy link
Contributor

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

great work @vrutkovs!

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 17, 2019
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2019
@openshift-ci-robot openshift-ci-robot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 29, 2019
@vrutkovs
Copy link
Contributor Author

/test e2e-dr-quorum-tests
/test e2e-dr-snapshot-tests

@openshift-ci-robot
Copy link

@vrutkovs: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-dr-quorum-tests 3a744e4 link /test e2e-dr-quorum-tests
ci/prow/e2e-dr-snapshot-tests 3a744e4 link /test e2e-dr-snapshot-tests

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@vrutkovs
Copy link
Contributor Author

Rebased and removed /dr suite, @smarterclayton @hexfusion PTAL

Copy link
Contributor

@smarterclayton smarterclayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the filenames and then I’ll merge this

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 29, 2019
These tests are equivalent to bash function in CI's installer template.
It runs in a dedicated subcommand and ensure cluster state is properly
being restored after etcd snapshot restore procedure.
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 30, 2019
@smarterclayton
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 30, 2019
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, hexfusion, smarterclayton, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vrutkovs
Copy link
Contributor Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 30, 2019
@openshift-merge-robot openshift-merge-robot merged commit b0c2355 into openshift:master Jul 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants