Skip to content

Conversation

@vrutkovs
Copy link
Contributor

@vrutkovs vrutkovs commented Apr 16, 2019

Create a new template which would test cluster disaster recovery when 2 masters are down.

TODO:

  • Use aws cli to remove masters - apparently Machine API won't let me destroy two masters for some reason. Seems to be an etcd issue

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2019
@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 16, 2019
@sdodson sdodson changed the title WIP Add discovery recovery template WIP Add disaster recovery template Apr 16, 2019
@sdodson
Copy link
Member

sdodson commented Apr 16, 2019

Nice start, knowing that we'll have this and a cluster state rollback should we name the file/job so that it's clear this is specific to the infrastructure replacement DR scenario? Otherwise if it works as advertised I think we've completed the first step for the first scenario.

@vrutkovs vrutkovs force-pushed the upi-disaster-recovery-template branch from ebc11c1 to 80c72c2 Compare April 17, 2019 11:40
@vrutkovs
Copy link
Contributor Author

Renamed to cluster-disaster-control-plane-ipi to mention which scenario and install type is being tested

@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 17, 2019
@vrutkovs vrutkovs changed the title WIP Add disaster recovery template Add disaster recovery template Apr 17, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 17, 2019
@sdodson
Copy link
Member

sdodson commented Apr 22, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sdodson, vrutkovs
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: smarterclayton

If they are not already assigned, you can assign the PR to them by writing /assign @smarterclayton in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sdodson
Copy link
Member

sdodson commented Apr 22, 2019

@wking or @staebler can we get an approve on this.

The goal here is to add a CI job that leads to quorum loss and then other teams will contribute tasks to restore the cluster. This job is not currently tied to any repo or scheduled periodically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit with the filename, I think we can assume IPI and only call out UPI in UPI templates (which is our current pattern), so I'd prefer cluster-disaster-control-plane.yaml.

@sdodson sdodson force-pushed the upi-disaster-recovery-template branch from aecb327 to 2bb056c Compare April 22, 2019 18:03
@openshift-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2019
@sdodson
Copy link
Member

sdodson commented Apr 22, 2019

Renamed and squashed down.

done

echo "Destroy two masters"
oc --request-timeout=5s -n openshift-machine-api delete machines ${MASTER_MACHINES_TO_REMOVE[@]}
Copy link
Member

@wking wking Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to address this without a new template would be to drop all of this new stuff into a shell function, and jobs that wanted to trigger it could use:

- name: TEST_COMMAND
  value: |
    destroy-some-control-plane-machines
    TEST_SUITE=openshift/conformance/parallel run-tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say drop this stuff into a new shell function, are you saying the new shell function would be added to an already existing template, like cluster-launch-installer-e2e.yaml or another template? That makes sense, but I wanted to make sure that is your point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say drop this stuff into a new shell function, are you saying the new shell function would be added to an already existing template, like cluster-launch-installer-e2e.yaml or another template?

Yup, right here, with:

function destroy-some-control-plane-machines() {
  ...bla bla...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #3572 to play with this idea

@vrutkovs
Copy link
Contributor Author

Closed in favor of #3572

@vrutkovs vrutkovs closed this Apr 26, 2019
@vrutkovs vrutkovs deleted the upi-disaster-recovery-template branch January 27, 2020 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants