-
Notifications
You must be signed in to change notification settings - Fork 2.1k
DR snapshot restore: use scripts provided by MCO #3828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DR snapshot restore: use scripts provided by MCO #3828
Conversation
91ef3d0 to
3eac0c5
Compare
|
oh, that's not good: |
1ce5def to
0d9bc25
Compare
4b86017 to
1aac106
Compare
Seems we need to add a two minute pause to ensure this test passes Other test failures are flakes |
826929a to
f729a7e
Compare
|
/hold Waiting for MCO bugfix to land to make sure all tests pass |
f729a7e to
f3f5594
Compare
|
4f99974 to
a2e03ae
Compare
|
ssh bastion didn't start /test pj-rehearse |
a2e03ae to
e0fd47b
Compare
|
/test pj-rehearse |
1 similar comment
|
/test pj-rehearse |
Looks good, but I will give time for @hexfusion to take a look. |
5717efe to
d062872
Compare
|
/approve |
d062872 to
3f7b01c
Compare
3f7b01c to
0ccd9d9
Compare
|
PR topic references openshift/machine-config-operator#791. Looks like that's been closed in favor of the still-open openshift/machine-config-operator#793? Are we still waiting for that to land? [Edit: sounds like the plan is to land this first to help debug the MCO PR] |
| @@ -182,7 +182,7 @@ presubmits: | |||
| secretName: sentry-dsn | |||
| trigger: '(?m)^/test (?:.*? )?e2e-aws-upgrade(?: .*?)?$' | |||
| - agent: kubernetes | |||
| always_run: false | |||
| always_run: true | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want a longer track-record of success before we do this (although it's really up to the MCO team). Currently, the past 24 hours have three failures and no success for this job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't have passing tests before MCO scripts are debugged - and we can't properly test those without having a dedicated test for DR scenarios (e.g. openshift/machine-config-operator#793 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can /test e2e-restore-cluster-state in that PR and it will run (and rerun after each bump) to help you debug that PR. No need to run this in all other MCO PRs while you debug that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not clear which MCO PR would break DR scenarios. Also, if this test is misbehaving it can be skipped with /skip since its optional
...s/openshift/machine-config-operator/openshift-machine-config-operator-master-presubmits.yaml
Outdated
Show resolved
Hide resolved
|
|
||
| echo "Remove existing openshift-apiserver pods" | ||
| # This would ensure "Pod 'openshift-apiserver/apiserver-xxx' is not healthy: container openshift-apiserver has restarted more than 5 times" test won't fail | ||
| oc delete pod --all -n openshift-apiserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this probably also blows away our logs for those pods? Maybe we want to pull down their logs into the shared artifacts volume before doing this?
This commit updates `restore-cluster-state` function used for DR tests. It leverages scripts, which MCO deploys on the masters. This change also makes all MCO PRs run this test so that we could fix the scripts if necessary
07d04f2 to
a156af2
Compare
7358884 to
a156af2
Compare
|
@vrutkovs: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/approve |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, hexfusion, runcom, vrutkovs The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@vrutkovs: Updated the following 3 configmaps:
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Most etcd scripts are now controlled by MCO, so
restore-cluster-statefunction now uses those instead vendored scripts.
TODO:
Last node doesn't complete recovery process and it does nothing on rerun, so last node doesn't get rebooted correctly