Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

Handle volume failures gracefully #339

Open
hakman opened this issue Aug 26, 2020 · 0 comments
Open

Handle volume failures gracefully #339

hakman opened this issue Aug 26, 2020 · 0 comments

Comments

@hakman
Copy link
Contributor

hakman commented Aug 26, 2020

In most cases, when a control plane node dies, reboots, is replaced, ..., the EBS disks for etcd-manager are just detached and mounted on the new instance. This is the happy scenario.

There are those cases where one of the volumes fails or is deleted (by mistake or during failure simulations). In this case, etcd-manager will not know what to do and go into a failure loop. Even adding a new blank volume won't help, because it doesn't know what to do with it.

Reproduce:

  1. create a cluster with 3 control plane nodes
  2. foce delete the EBS volumes of one of the control plane nodes and terminate the node
  3. wait for the replacement node to come up (ASG will do that)
  4. check the error logs of the etcd-manager pods
  5. run kops update cluster --yes to create new blank volumes for etcd
  6. check the error logs to see volumes are attached and formatted, but etcd-manager is still lost

Workaround:

  1. Delete s3://<bucket>/<cluster>/backups/etcd/main/control/etcd-cluster-created
  2. Wait for the main etcd cluster to reset
  3. Restore latest main backup:
    etcd-manager-ctl --backup-store=s3://my.clusters/test.my.clusters/backups/etcd/main restore-backup
  4. Delete s3://<bucket>/<cluster>/backups/etcd/events/control/etcd-cluster-created
  5. Wait for the events etcd cluster to reset
  6. Restore latest events backup:
    etcd-manager-ctl --backup-store=s3://my.clusters/test.my.clusters/backups/etcd/events restore-backup
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant