Handle volume failures gracefully #339

hakman · 2020-08-26T05:46:06Z

In most cases, when a control plane node dies, reboots, is replaced, ..., the EBS disks for etcd-manager are just detached and mounted on the new instance. This is the happy scenario.

There are those cases where one of the volumes fails or is deleted (by mistake or during failure simulations). In this case, etcd-manager will not know what to do and go into a failure loop. Even adding a new blank volume won't help, because it doesn't know what to do with it.

Reproduce:

create a cluster with 3 control plane nodes
foce delete the EBS volumes of one of the control plane nodes and terminate the node
wait for the replacement node to come up (ASG will do that)
check the error logs of the etcd-manager pods
run kops update cluster --yes to create new blank volumes for etcd
check the error logs to see volumes are attached and formatted, but etcd-manager is still lost

Workaround:

Delete s3://<bucket>/<cluster>/backups/etcd/main/control/etcd-cluster-created
Wait for the main etcd cluster to reset

Restore latest main backup:

etcd-manager-ctl --backup-store=s3://my.clusters/test.my.clusters/backups/etcd/main restore-backup

Delete s3://<bucket>/<cluster>/backups/etcd/events/control/etcd-cluster-created
Wait for the events etcd cluster to reset

Restore latest events backup:

etcd-manager-ctl --backup-store=s3://my.clusters/test.my.clusters/backups/etcd/events restore-backup

The text was updated successfully, but these errors were encountered:

hakman mentioned this issue Sep 1, 2020

Kops fails to join new master when 1 master lost etcd events data volume kubernetes/kops#9264

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle volume failures gracefully #339

Handle volume failures gracefully #339

hakman commented Aug 26, 2020

Handle volume failures gracefully #339

Handle volume failures gracefully #339

Comments

hakman commented Aug 26, 2020

Reproduce:

Workaround: