Documentation: Disaster recovery scenarios #432

aaronlevy · 2017-04-14T17:20:54Z

We should document recovery of:

partial control-plane loss
- Discussed in: Simple scheduler & controller-manager disaster recovery #112
loss of all api-servers
- Recover from a backup of the api-server checkpoint data (document the backup process as well)
manual recovery from etcd backup (external etcd)
- Extracting api-server manifest / secret data from etcd, then recovering using temporary static pods.
manual recovery from etcd backup (self-hosted etcd).
- Extracting etcd manifest & api-server manifests, then recovering using temporary static pods.

aaronlevy · 2017-04-18T19:02:35Z

To expand on some of this a bit - and include strawman discussion of building a tool to handle some of the recovery mechanics:

Partial loss of control plane:

We are assuming we still have running apiserver + etcd - and we just need to recover other control plane components that are no longer scheduled for some reason. More specific discussion is in #112 -- but the pseudo process / UX might be:

bootkube recover pod-from deployment/kube-scheduler --target=node

or maybe

bootkube recover pod-from deployment/kube-scheduler --target-label=node-role.kubernetes.io/master

loss of all api-servers

We have two main options here. One option would assume that we are also backing up checkpoints of "critical" pods (e.g. apiserver + anything it relies on like secrets). This seems relatively reasonable - but does make the backup/restore story have a few more moving pieces.

Another option would be to only expect access to etcd - and attempt recovery from there. This would be my initial preference because technically all needed state should exist directly in etcd.

The process would essentially be like a partial bootkube start, but you're pointing it at an existing etcd cluster that already has state that you want to extract:

All pseudo UX:

Extract the api-server manifest from etcd directly

bootkube recover pod-from daemonset/kube-apiserver --etcd-server=http://foo --target=/etc/kubernetes/manifests/kube-apiserver.yaml

This would:

Extract apiserver daemonset manifest + all secrets / configMaps directly from etcd
Place assets on disk (e.g. secrets/configmaps) then place the static pod at target
Wait for self-hosted apiserver to recover, then delete assets off disk?

The above UI might be trying to overload functionality too much -- we could also make this more explicit rather than generic:

bootkube recover apiserver --etcd-server=http://foo --target=/etc/kubernetes/manifests

Or maybe even be a new option in bootkube-start (instead of --asset-dir):

bootkube start --from-etcd-server=http://foo

Recovery from etcd backup (external etcd)

This would be the similar process as above, but following normal etcd-recovery documentation (e.g. start a new etcd cluster from backup). The change might be that we need to modify the apiserver manifest if the network addressability of the etcd cluster has changed (this could essentially be documented).

Recovery from etcd backup (self-hosted etcd)

On a node with the backup, start a single etcd-static manifest (similar to the bootstrap etcd pod we run with bootkube start).
Once etcd is running, we extract an api-server manifest (directly from etcd) and start the temporary api-server pod.
Now we have an api-server to interact with (don't need to directly interact with etcd)
We likely need to remove the existing etcd-cluster TPR, then create a new one setting our temporary etcd pod as the seed.
Let the etcd operator pivot the seed into a self-hosted etcd cluster
Recover scheduler / controller-manager similarly to above.

Pseudo UX:

bootkube recover from-etcd-backup --target=/etc/kubernetes/manifests --datadir=/var/lib/etcd

Or again, could be a new bootkube start option:

bootkube start --etcd-data-dir=/foo

More related discussion: #333 (comment)

coresolve · 2017-04-20T00:40:48Z

this is an awesome awesome idea!

aaronlevy · 2017-05-10T00:02:16Z

Adding another failure case to consider:

From @dghubble in #498

Simply deleting the etcd pod in a single master setup (most common) seems to not reconcile and create a new pod, which causes a cluster outage. This is less resilient than the default on-host setup and other control plane pods, which can recover from pod failures. In the on-host case, one can repair any etcd issues (or systemd restarts them) as long as data is still present on disk. Longer discussion: coreos/tectonic-installer#601

Not sure best option here. If the pod is deleted -- you're essentially saying it should no longer run on the node.

As I mentioned in the original PR:

the checkpointer looks and says "I was able to contact an api-server, and looks like this pod is no longer scheduled -- it should be removed". Essentially GC (otherwise a checkpoint would exist forever -- even if the parent has successfully moved elsewhere).

We might be able to do something like add a mode for "don't remove even if unscheduled" -- but that could get weird (when should it be removed)?

hongchaodeng · 2017-05-10T06:50:45Z

Simply deleting the etcd pod in a single master setup

What @dghubble described doesn't make any sense. In another word, is there any real scenario that does that?

All system pods would be managed by our control plane. Our bottom line is that whatever error happened, we could still recover everything. But we should not optimize such extreme, not-real scenario.

dghubble · 2017-05-10T07:36:19Z

Just because it hasn't been considered doesn't mean it won't happen. This WILL happen to users somewhere, somehow and there will need to be an answer about how to recover. The only thing you can count on in real-world scenarios is the unexpected. Let's not dismiss this as extreme.

Plenty of the resiliency tests simulate a failure by deleting a pod to ensure it recovers. People demo recovery this way as well. Or accidents will happen. This was nearly the first thing I tried when assessing resiliency of self-hosted etcd.

Perhaps the better questions here are whether a checkpointed deployment per pod or a StatefulSet could be of help or how recovery from the data that exists on disk could be done manually or automatically.

aaronlevy · 2017-05-10T17:59:53Z

This WILL happen to users somewhere, somehow and there will need to be an answer about how to recover.

Overall I agree, and the recovery tool will support this scenario (point it at an etcd data dir). I think the open-ended question is if there is some way we might headlessly recover from this scenario (without a user running a recovery step).

One option might be to simply delay the checkpoint garbage collection decision. In this scenario the checkpointer saw that the etcd pod had been removed from api (so GC'd the local checkpoint) - but the etcd pod was still running at that moment (otherwise it wouldn't have been able to contact the api). If we delayed the decision to GC, then we wouldn't remove the etcd checkpoint (because the apiserver would shortly become unavailable).

Essentially we're saying "delay the GC until some window of 'still working'".

whether a checkpointed deployment per pod or a StatefulSet could be of help

Can you expand on this? I don't follow what you mean

aaronlevy added kind/documentation Categorizes issue or PR as related to documentation. priority/P1 labels Apr 14, 2017

philips mentioned this issue Apr 18, 2017

Documentation: enumerate self-hosted etcd operator failure scenarios coreos/tectonic-installer#257

Closed

4 tasks

diegs self-assigned this Apr 19, 2017

diegs mentioned this issue May 3, 2017

Preliminary recovery command for bootkube. #491

Merged

hongchaodeng mentioned this issue May 9, 2017

Self-hosted etcd cannot survive pod deletion #498

Closed

jamiehannaford mentioned this issue May 24, 2017

Finding a solution for etcd kubernetes/kubeadm#277

Closed

diegs mentioned this issue Jun 15, 2017

Add disaster recovery documentation. #584

Merged

diegs closed this as completed in #584 Jun 19, 2017

mrbobbytables unassigned diegs Nov 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation: Disaster recovery scenarios #432

Documentation: Disaster recovery scenarios #432

aaronlevy commented Apr 14, 2017

aaronlevy commented Apr 18, 2017 •

edited

Loading

coresolve commented Apr 20, 2017

aaronlevy commented May 10, 2017

hongchaodeng commented May 10, 2017 •

edited

Loading

dghubble commented May 10, 2017 •

edited

Loading

aaronlevy commented May 10, 2017

Documentation: Disaster recovery scenarios #432

Documentation: Disaster recovery scenarios #432

Comments

aaronlevy commented Apr 14, 2017

aaronlevy commented Apr 18, 2017 • edited Loading

Partial loss of control plane:

loss of all api-servers

Recovery from etcd backup (external etcd)

Recovery from etcd backup (self-hosted etcd)

coresolve commented Apr 20, 2017

aaronlevy commented May 10, 2017

hongchaodeng commented May 10, 2017 • edited Loading

dghubble commented May 10, 2017 • edited Loading

aaronlevy commented May 10, 2017

aaronlevy commented Apr 18, 2017 •

edited

Loading

hongchaodeng commented May 10, 2017 •

edited

Loading

dghubble commented May 10, 2017 •

edited

Loading