Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.

Documentation: Disaster recovery scenarios #432

Closed
aaronlevy opened this issue Apr 14, 2017 · 6 comments
Closed

Documentation: Disaster recovery scenarios #432

aaronlevy opened this issue Apr 14, 2017 · 6 comments
Labels
kind/documentation Categorizes issue or PR as related to documentation. priority/P1

Comments

@aaronlevy
Copy link
Contributor

We should document recovery of:

  • partial control-plane loss
  • loss of all api-servers
    • Recover from a backup of the api-server checkpoint data (document the backup process as well)
  • manual recovery from etcd backup (external etcd)
    • Extracting api-server manifest / secret data from etcd, then recovering using temporary static pods.
  • manual recovery from etcd backup (self-hosted etcd).
    • Extracting etcd manifest & api-server manifests, then recovering using temporary static pods.
@aaronlevy
Copy link
Contributor Author

aaronlevy commented Apr 18, 2017

To expand on some of this a bit - and include strawman discussion of building a tool to handle some of the recovery mechanics:

Partial loss of control plane:

We are assuming we still have running apiserver + etcd - and we just need to recover other control plane components that are no longer scheduled for some reason. More specific discussion is in #112 -- but the pseudo process / UX might be:

bootkube recover pod-from deployment/kube-scheduler --target=node

or maybe

bootkube recover pod-from deployment/kube-scheduler --target-label=node-role.kubernetes.io/master

loss of all api-servers

We have two main options here. One option would assume that we are also backing up checkpoints of "critical" pods (e.g. apiserver + anything it relies on like secrets). This seems relatively reasonable - but does make the backup/restore story have a few more moving pieces.

Another option would be to only expect access to etcd - and attempt recovery from there. This would be my initial preference because technically all needed state should exist directly in etcd.

The process would essentially be like a partial bootkube start, but you're pointing it at an existing etcd cluster that already has state that you want to extract:

All pseudo UX:

  • Extract the api-server manifest from etcd directly
bootkube recover pod-from daemonset/kube-apiserver --etcd-server=http://foo --target=/etc/kubernetes/manifests/kube-apiserver.yaml

This would:

  1. Extract apiserver daemonset manifest + all secrets / configMaps directly from etcd
  2. Place assets on disk (e.g. secrets/configmaps) then place the static pod at target
  3. Wait for self-hosted apiserver to recover, then delete assets off disk?

The above UI might be trying to overload functionality too much -- we could also make this more explicit rather than generic:

bootkube recover apiserver --etcd-server=http://foo --target=/etc/kubernetes/manifests

Or maybe even be a new option in bootkube-start (instead of --asset-dir):

bootkube start --from-etcd-server=http://foo

Recovery from etcd backup (external etcd)

This would be the similar process as above, but following normal etcd-recovery documentation (e.g. start a new etcd cluster from backup). The change might be that we need to modify the apiserver manifest if the network addressability of the etcd cluster has changed (this could essentially be documented).

Recovery from etcd backup (self-hosted etcd)

  1. On a node with the backup, start a single etcd-static manifest (similar to the bootstrap etcd pod we run with bootkube start).
  2. Once etcd is running, we extract an api-server manifest (directly from etcd) and start the temporary api-server pod.
  3. Now we have an api-server to interact with (don't need to directly interact with etcd)
  4. We likely need to remove the existing etcd-cluster TPR, then create a new one setting our temporary etcd pod as the seed.
  5. Let the etcd operator pivot the seed into a self-hosted etcd cluster
  6. Recover scheduler / controller-manager similarly to above.

Pseudo UX:

bootkube recover from-etcd-backup --target=/etc/kubernetes/manifests --datadir=/var/lib/etcd

Or again, could be a new bootkube start option:

bootkube start --etcd-data-dir=/foo

More related discussion: #333 (comment)

@diegs diegs self-assigned this Apr 19, 2017
@coresolve
Copy link
Contributor

this is an awesome awesome idea!

@aaronlevy
Copy link
Contributor Author

Adding another failure case to consider:

From @dghubble in #498

Simply deleting the etcd pod in a single master setup (most common) seems to not reconcile and create a new pod, which causes a cluster outage. This is less resilient than the default on-host setup and other control plane pods, which can recover from pod failures. In the on-host case, one can repair any etcd issues (or systemd restarts them) as long as data is still present on disk. Longer discussion: coreos/tectonic-installer#601

Not sure best option here. If the pod is deleted -- you're essentially saying it should no longer run on the node.

As I mentioned in the original PR:

the checkpointer looks and says "I was able to contact an api-server, and looks like this pod is no longer scheduled -- it should be removed". Essentially GC (otherwise a checkpoint would exist forever -- even if the parent has successfully moved elsewhere).

We might be able to do something like add a mode for "don't remove even if unscheduled" -- but that could get weird (when should it be removed)?

@hongchaodeng
Copy link
Contributor

hongchaodeng commented May 10, 2017

Simply deleting the etcd pod in a single master setup

What @dghubble described doesn't make any sense. In another word, is there any real scenario that does that?

All system pods would be managed by our control plane. Our bottom line is that whatever error happened, we could still recover everything. But we should not optimize such extreme, not-real scenario.

@dghubble
Copy link
Contributor

dghubble commented May 10, 2017

Just because it hasn't been considered doesn't mean it won't happen. This WILL happen to users somewhere, somehow and there will need to be an answer about how to recover. The only thing you can count on in real-world scenarios is the unexpected. Let's not dismiss this as extreme.

Plenty of the resiliency tests simulate a failure by deleting a pod to ensure it recovers. People demo recovery this way as well. Or accidents will happen. This was nearly the first thing I tried when assessing resiliency of self-hosted etcd.

Perhaps the better questions here are whether a checkpointed deployment per pod or a StatefulSet could be of help or how recovery from the data that exists on disk could be done manually or automatically.

@aaronlevy
Copy link
Contributor Author

This WILL happen to users somewhere, somehow and there will need to be an answer about how to recover.

Overall I agree, and the recovery tool will support this scenario (point it at an etcd data dir). I think the open-ended question is if there is some way we might headlessly recover from this scenario (without a user running a recovery step).

One option might be to simply delay the checkpoint garbage collection decision. In this scenario the checkpointer saw that the etcd pod had been removed from api (so GC'd the local checkpoint) - but the etcd pod was still running at that moment (otherwise it wouldn't have been able to contact the api). If we delayed the decision to GC, then we wouldn't remove the etcd checkpoint (because the apiserver would shortly become unavailable).

Essentially we're saying "delay the GC until some window of 'still working'".

whether a checkpointed deployment per pod or a StatefulSet could be of help

Can you expand on this? I don't follow what you mean

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/documentation Categorizes issue or PR as related to documentation. priority/P1
Projects
None yet
Development

No branches or pull requests

5 participants