Skip to content
This repository has been archived by the owner on Feb 5, 2020. It is now read-only.

Documentation: enumerate self-hosted etcd operator failure scenarios #257

Closed
4 tasks
philips opened this issue Apr 18, 2017 · 10 comments
Closed
4 tasks

Documentation: enumerate self-hosted etcd operator failure scenarios #257

philips opened this issue Apr 18, 2017 · 10 comments
Assignees

Comments

@philips
Copy link
Contributor

philips commented Apr 18, 2017

Currently we can bringup self-hosted etcd operator setups but we don't document failure or recovery scenarios even though we handle many of them.

  • master cluster down and power-on
  • master cluster API server failure
  • disk loss failure of entire master cluster
  • pod checkpoint checkpoints bad versions

cc @jbeda @justinsb @luxas

@philips
Copy link
Contributor Author

philips commented Apr 18, 2017

@justinsb what others am I missing?

@philips
Copy link
Contributor Author

philips commented Apr 18, 2017

xref kubernetes-retired/bootkube#432

@justinsb
Copy link

That's a good start. A few more off-the-top-of-my-head:

  • power-off recovery during or soon after etcd2 -> etcd3 upgrade (or any major failure), where "bootstrap" version is older than local version

  • etcd running but not responding to queries (e.g. disk full?)

  • apiserver runs but does not make progress after etcd operation

  • k-c-m runs but does not make progress after etcd operation

  • kube-scheduler runs but does not make progress after etcd operation

  • kubelet upgrade runs but does not make progress after etcd operation

(These are the standard gotchas of self-hosting, but I guess there's a particularly likely scenario for etcd upgrades)

  • understanding recovery semantics in terms of data loss from catastrophic failure scenario, and giving users a choice as to whether they prefer downtime or data-loss, or at least define what choice has been made

  • recovery from SSL key expiry

@philips
Copy link
Contributor Author

philips commented May 16, 2017

There are recovery tools being built in bootkube now: https://github.com/kubernetes-incubator/bootkube#recover-a-downed-cluster

@philips
Copy link
Contributor Author

philips commented May 16, 2017

Quick notes on places to start on writing these docs:

  • master cluster down and power-on: uses pod checkpointing and network checkpointing
  • master cluster API server failure: if non-HA you need to recover the load balancer or pods via static pods
  • disk loss failure of entire master cluster: you need to recover from backups, see bootkube recovery
  • pod checkpoint checkpoints bad versions: you need to manually fix the static manifests checkpointed in /etc/kubernetes/inactive-manifests

@xiang90
Copy link

xiang90 commented May 16, 2017

@philips

I agree it is a good idea to write docs about handling the failure cases for self hosted etcd. The items you listed are good starts! I believe the doc should focus on the difference between self hosted and external etcd, and highlights the potential risks self hosted etcd might introduce and how we solve them.

@justinsb

The a lot of items you listed are not really specific to self hosted etcd in my opinion.

etcd running but not responding to queries (e.g. disk full?)
apiserver runs but does not make progress after etcd operation
k-c-m runs but does not make progress after etcd operation
kube-scheduler runs but does not make progress after etcd operation
kubelet upgrade runs but does not make progress after etcd operation
recovery from SSL key expiry

If you manually operator etcd, you might have these issues too. They are not introduced by self hosted etcd.

power-off recovery during or soon after etcd2 -> etcd3 upgrade (or any major failure), where "bootstrap" version is older than local version

understanding recovery semantics in terms of data loss from catastrophic failure scenario, and giving users a choice as to whether they prefer downtime or data-loss, or at least define what choice has been made

These two are relevant. We will cover it when writing the doc. But I would suggest you to give self hosted etcd a try if you are interested. So we can discuss in more depth.

@sym3tri sym3tri modified the milestones: Sprint 2: Overall cleanup and stability, Sprint 3: Continued Test Automation May 23, 2017
@zbwright
Copy link
Contributor

zbwright commented Jun 6, 2017

@radhikapc is working on etcd docs now. assigning to her, @xiang90

@zbwright zbwright assigned radhikapc and xiang90 and unassigned xiang90 and zbwright Jun 6, 2017
@Quentin-M
Copy link
Contributor

Hi @radhikapc, any update on that one?

@xiang90
Copy link

xiang90 commented Jun 21, 2017

@Quentin-M hongchao or I need to write something similar to https://github.com/coreos/etcd/blob/master/Documentation/op-guide/failures.md. Then @radhikapc can start to help cleaning things up. we will get started after finishing up the TLS thing.

radhikapc pushed a commit to radhikapc/tectonic-installer that referenced this issue Jun 28, 2017
@sym3tri sym3tri modified the milestones: Sprint 4, Sprint 5 Jun 30, 2017
@sym3tri sym3tri removed this from the Sprint 5 milestone Aug 23, 2017
@sym3tri sym3tri closed this as completed Sep 5, 2017
@justinsb
Copy link

justinsb commented Sep 5, 2017

Where was this moved to?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants