-
Notifications
You must be signed in to change notification settings - Fork 462
JIRA PROD-1027 : templates/master/00-master: add disaster recovery tools #757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
vrutkovs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
I thought we'd agreed to ship these in a container? |
|
If we are going to ship them here, we should probably add infrastructure to take literal files from git rather than wrapped in template fragments, as the template goop makes directly reusing them more annoying and confuses editors (e.g. we want editors to detect them as shell, not YAML). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick question: is my understanding correct that this tar file together with the manifest files which are pulled from other github repos will be baked inside the MCO image when a new release is cut and as such will all (including other operators) be wrapped in the CVO ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I am following completely. When this function is called [1] it will download a local copy of etcdctl into the callers local ~./assets/bin this binary will be used by various other functions for disaster recovery. So the tar itself is not baked in only the function to download it. The tar and etcdctl binary will not live on the cluster unlessdl_etcdctl function is called. IN an upgrade $ETCD_VERSION may iterate. We will document that the asset dir should be archived after the restore is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hexfusion thanks for taking the time to respond 👍
I'm still new, trying to get my head around how the whole solution is built and packaged, hence apologize for silly questions
The full context for my ask is around: how will this work in a air-gap/ disconnected environment w/o internet access ?
if is not in scope then please ignore me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full context for my ask is around: how will this work in a air-gap/ disconnected environment w/o internet access ?
In general, this is the first iteration of disaster recovery and we will improve on this concept until we get to a point where an operator handles these tasks autonomously. So yes I expect there to be areas for improvement and corner cases to solve. But we wanted to provide some tooling to help in the restoration process. Airgap DR would be a good use case to solve for in 4.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair point, thanks for feedback @hexfusion (i don't have full picture and so excuse my silly questions)
|
/retest |
MCO can template this script, right? SETUP_ETCD_ENVIRONMENT here is |
That is a great idea and then it gets updated on upgrade right? |
|
/retest |
...e/files/usr-local-share-openshift-recovery-template-kube-etcd-cert-signer-yaml-template.yaml
Outdated
Show resolved
Hide resolved
Signed-off-by: Sam Batschelet <[email protected]>
|
/retest |
1 similar comment
|
/retest |
Yep, it would take care of the upgrade. I'd prefer to setup CI for MCO first and then template this var |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hexfusion, runcom, vrutkovs The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold Going to run a manual test on this then will let er fly. |
|
Don't consider this blocking I guess but: I struggle to understand how it's either easier or better to inject them into the host than to ship a container. If they're on the host, people are going to be surprised if we ever delete them and move them to a container. Having them a container is just infinitely more flexible - and remember one can directly |
|
We could even just ship them in our etcd container right? |
|
This container would have to be built by ART and included in the payload. Seems like a good idea for 4.2, but too late for 4.1 probably |
|
agreed on shipping this here in the MCO and move those to containers in the Z/4.2 timeframe with a deprecation/both-supported window till we fully remove them from here. |
|
Manual DR testing against these MCO deliverables was a success. /hold cancel |
|
we're not backporting this to 4.1 just yet right @hexfusion ? |
|
@runcom lets have QE sign off? |
sounds good to me, just wanted to understand if we need 4.1 code by today or not for this |
|
My understanding is 4.1 is EOD Friday. |
|
/cherrypick release-4.1 |
|
@hexfusion: new pull request created: #769 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR adds disaster recovery tools used for folks to restore a cluster from failure. We are also providing detailed documentation on how to use these scripts.
/cc @sdodson @patrickdillon @runcom @vrutkovs
Code is pulled from
https://github.com/hexfusion/openshift-recovery/tree/unify
This PR is required to resolve: https://jira.coreos.com/browse/PROD-1027