-
Notifications
You must be signed in to change notification settings - Fork 4.8k
tests: add e2e tests to verify DR scenarios #23208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5ffc112 to
f0d56f4
Compare
|
/test e2e-dr-quorum-tests |
|
/test e2e-dr-snapshot-tests |
|
/retest |
| o.Expect(err).NotTo(o.HaveOccurred()) | ||
| o.Expect(mapiPods.Items).NotTo(o.BeEmpty()) | ||
|
|
||
| survivingNodeName := mapiPods.Items[0].Spec.NodeName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question how random is the list here. Meaning are we making any assumptions by always taking the first in the list. I assume mapiPods is in no predetermined order but I wanted to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test would pick the first one as there should be just one instance of machine-api-controller pod running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so we are in a sense not randomizing what survives. I know we need to do this for some reason but the customer does not get to make this choice. Do we need to fix this core problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we nuke machine-api-controller the operator should reschedule it to another master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just a matter of waiting for that to happen? can we force it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, it would reschedule to a different master. The issue is that it won't allow to remove a master its running on via Machine API.
So the test would find out which node machine controller is running on and remove the other two - it can't unfortunately kill random masters.
That should not affect customer scenarios, as once single etcd master is restored machine-api-controller would start there - and admin would be able to create new masters via Machine API
875a713 to
e5b0e12
Compare
b337b58 to
36f2a41
Compare
|
@smarterclayton @deads2k should I keep |
|
So the most minimal thing we can do right now is you remove the suite stuff (make sure you're in disruptive) and then we update the disruptive job to run the disruptive suite and then run the e2e suite after. That does two things - guarantees the suites run in random order and the basic cluster stays up, and keeps your post condition checking for now. That'll get the disruptive job up and running and then we can step back and evaluate a deeper integration. |
Done I'll work on updating CI to run each DR test meanwhile |
|
Just update the disruptive suite job definition and remove your special suites job definitions |
test/extended/operators/cluster.go
Outdated
| defer g.GinkgoRecover() | ||
|
|
||
| g.It("have no crashlooping pods in core namespaces over two minutes", func() { | ||
| g.It("have no crashlooping pods in core namespaces over two minutes [IgnoreInDR]", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are things crash looping after DR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least openshift-sdn and openshift-apiserver are still running and attempting to reach kube apiserver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does it take to clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't clear - this test would fail if the pod has restarted more than 5 times. In order to clear that the pod should be removed - and we'll lose its logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we would just add a condition to the test that makes this test tolerant of clusters that have been running for a while.
| runQueries(tests, oc, ns, execPodName, url, bearerToken) | ||
| }) | ||
| g.It("should report less than two alerts in firing or pending state", func() { | ||
| g.It("should report less than two alerts in firing or pending state [IgnoreInDR]", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are alerts firing after you recover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is flaking if this test randomly start in the beginning of the suite - usually its either crashlooping openshift-apiserver or kubelet alert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the jobs we'll just create a wait of 1-3 minutes before the e2es kick off. So you can remove this bit of code (excluding these tests) and remove the DR suite in cmd/openshift-tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather do this in a follow up PR, does that sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think these labels should be added at all, and I'm trying to get you to remove your suite so you use the disruptive suite as we intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out its a regression in 4.1.x (and in 4.2 as test shows) - https://bugzilla.redhat.com/show_bug.cgi?id=1731827
|
/retest |
hexfusion
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
great work @vrutkovs!
|
/test e2e-dr-quorum-tests |
|
@vrutkovs: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Rebased and removed |
smarterclayton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the filenames and then I’ll merge this
These tests are equivalent to bash function in CI's installer template. It runs in a dedicated subcommand and ensure cluster state is properly being restored after etcd snapshot restore procedure.
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, hexfusion, smarterclayton, vrutkovs The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
This adds new disruptive tests for DR scenarios: etcd snapshot restore (part of # #23080) and etcd quorum restore
TODO:
Some cluster operators never became available <nil>/marketplace, <nil>/operator-lifecycle-manager-packageserververifytestManaged cluster should have no crashlooping pods in core namespaces over two minutestests - frequently fails on etcd quorum restore