Skip to content

Conversation

@hexfusion
Copy link
Contributor

@hexfusion hexfusion commented May 15, 2019

This PR adds disaster recovery tools used for folks to restore a cluster from failure. We are also providing detailed documentation on how to use these scripts.

/cc @sdodson @patrickdillon @runcom @vrutkovs

Code is pulled from

https://github.com/hexfusion/openshift-recovery/tree/unify

This PR is required to resolve: https://jira.coreos.com/browse/PROD-1027

@openshift-ci-robot openshift-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 15, 2019
Copy link
Contributor

@vrutkovs vrutkovs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cgwalters
Copy link
Member

I thought we'd agreed to ship these in a container?

@cgwalters
Copy link
Member

If we are going to ship them here, we should probably add infrastructure to take literal files from git rather than wrapped in template fragments, as the template goop makes directly reusing them more annoying and confuses editors (e.g. we want editors to detect them as shell, not YAML).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick question: is my understanding correct that this tar file together with the manifest files which are pulled from other github repos will be baked inside the MCO image when a new release is cut and as such will all (including other operators) be wrapped in the CVO ?

Copy link
Contributor Author

@hexfusion hexfusion May 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I am following completely. When this function is called [1] it will download a local copy of etcdctl into the callers local ~./assets/bin this binary will be used by various other functions for disaster recovery. So the tar itself is not baked in only the function to download it. The tar and etcdctl binary will not live on the cluster unlessdl_etcdctl function is called. IN an upgrade $ETCD_VERSION may iterate. We will document that the asset dir should be archived after the restore is complete.

[1] https://github.com/openshift/machine-config-operator/blob/3b266bed1fbb31f5f608c33ad87c8b7d3ccf1472/templates/master/00-master/_base/files/usr-local-bin-etcd-member-recover-sh.yaml#L52

Copy link
Contributor

@DanyC97 DanyC97 May 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hexfusion thanks for taking the time to respond 👍

I'm still new, trying to get my head around how the whole solution is built and packaged, hence apologize for silly questions

The full context for my ask is around: how will this work in a air-gap/ disconnected environment w/o internet access ?

if is not in scope then please ignore me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full context for my ask is around: how will this work in a air-gap/ disconnected environment w/o internet access ?

In general, this is the first iteration of disaster recovery and we will improve on this concept until we get to a point where an operator handles these tasks autonomously. So yes I expect there to be areas for improvement and corner cases to solve. But we wanted to provide some tooling to help in the restoration process. Airgap DR would be a good use case to solve for in 4.2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point, thanks for feedback @hexfusion (i don't have full picture and so excuse my silly questions)

@hexfusion
Copy link
Contributor Author

/retest

@vrutkovs
Copy link
Contributor

we should probably add infrastructure to take literal files from git rather than wrapped in template fragments

MCO can template this script, right? SETUP_ETCD_ENVIRONMENT here is <release>:setup-etcd-environment container. Its worth fixing later on, once we integrate CI for these files

@hexfusion
Copy link
Contributor Author

MCO can template this script, right? SETUP_ETCD_ENVIRONMENT here is :setup-etcd-environment container. Its worth fixing later on, once we integrate CI for these files

That is a great idea and then it gets updated on upgrade right?

@hexfusion
Copy link
Contributor Author

/retest

@hexfusion
Copy link
Contributor Author

/retest

1 similar comment
@runcom
Copy link
Member

runcom commented May 16, 2019

/retest

@vrutkovs
Copy link
Contributor

That is a great idea and then it gets updated on upgrade right?

Yep, it would take care of the upgrade. I'd prefer to setup CI for MCO first and then template this var

@hexfusion hexfusion changed the title templates/master/00-master: add disaster recovery tools BUG 1679215 : templates/master/00-master: add disaster recovery tools May 16, 2019
@hexfusion hexfusion changed the title BUG 1679215 : templates/master/00-master: add disaster recovery tools JIRA PROD-1027 : templates/master/00-master: add disaster recovery tools May 16, 2019
@runcom
Copy link
Member

runcom commented May 16, 2019

/approve
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 16, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, runcom, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 16, 2019
@hexfusion
Copy link
Contributor Author

/hold

Going to run a manual test on this then will let er fly.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 16, 2019
@cgwalters
Copy link
Member

Don't consider this blocking I guess but: I struggle to understand how it's either easier or better to inject them into the host than to ship a container. If they're on the host, people are going to be surprised if we ever delete them and move them to a container.

Having them a container is just infinitely more flexible - and remember one can directly podman run in scenarios where Kube isn't up.

@cgwalters
Copy link
Member

We could even just ship them in our etcd container right?

@vrutkovs
Copy link
Contributor

This container would have to be built by ART and included in the payload. Seems like a good idea for 4.2, but too late for 4.1 probably

@runcom
Copy link
Member

runcom commented May 16, 2019

agreed on shipping this here in the MCO and move those to containers in the Z/4.2 timeframe with a deprecation/both-supported window till we fully remove them from here.

@hexfusion
Copy link
Contributor Author

Manual DR testing against these MCO deliverables was a success.

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 16, 2019
@openshift-merge-robot openshift-merge-robot merged commit 99d419d into openshift:master May 16, 2019
@runcom
Copy link
Member

runcom commented May 16, 2019

we're not backporting this to 4.1 just yet right @hexfusion ?

@hexfusion
Copy link
Contributor Author

@runcom lets have QE sign off?

@runcom
Copy link
Member

runcom commented May 16, 2019

@runcom lets have QE sign off?

sounds good to me, just wanted to understand if we need 4.1 code by today or not for this

@hexfusion hexfusion deleted the add_dr branch May 16, 2019 17:24
@hexfusion
Copy link
Contributor Author

My understanding is 4.1 is EOD Friday.

@hexfusion
Copy link
Contributor Author

/cherrypick release-4.1

@openshift-cherrypick-robot

@hexfusion: new pull request created: #769

Details

In response to this:

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants