Skip to content
This repository has been archived by the owner on Mar 28, 2020. It is now read-only.

Persistent/Durable etcd cluster #1323

Open
sgotti opened this issue Jul 25, 2017 · 23 comments · May be fixed by #2097
Open

Persistent/Durable etcd cluster #1323

sgotti opened this issue Jul 25, 2017 · 23 comments · May be fixed by #2097

Comments

@sgotti
Copy link
Member

sgotti commented Jul 25, 2017

I'd like to be able to use etcd-operator to achieve a persistent/durable etcd cluster. I'd like to avoid as much as possible (only when I know that my members will never come back) the need to restore from a backup only because a majority of member dies at the same time (this could happen a lot of times for a lot of unpredictable reasons).
If I know that they will come back (rescheduled by etcd-operator) I'll prefer losing availability for some time (while waiting for them to come back).

Right now looks like etcd operator cannot achieve this. From the README.md:

If the majority of etcd members crash, but at least one backup exists for the cluster, the etcd operator can restore the entire cluster from the backup.

I did a little analysis on how etcd-operator works in this post https://sgotti.me/post/kubernetes-persistent-etcd/

Now etcd-operator directly schedules pods (acts like a custom controller) and uses a k8s emptyDir volume for member data. Everytime a pod dies (or is manually deleted) a new replacement pod is created and the old member removed from the etcd cluster. When a majority of pods dies at the same time the cluster cannot be restored for two primary reasons:

  • etcd member data not available anymore
  • replacement pods have different ip addresses and you cannot update the peer member urls in a unquorate cluster.

To fix these points I can see the below solutions, but, probably, this will require a big change in the current etcd-operator architecture:

Use "persistent" etcd member data.

  • Use persistent volumes to not lose etcd member data after a pod dies (or is deleted).
  • When a pod is crashed or deleted etcd-operator should create a new replacement pod reusing the pv previously associated to the previous dead pod.
  • As an additional note, using persistent volumes will require an at most once scheduling like done by k8s statefulset (see Proposal: Pod Safety Guarantees kubernetes/community#124) to avoid, when using a pv provider that doesn't have an inner guarantee of mouting a pv only on one node at a time, possible data corruptions.

Persistent peer URLs addresses.

The cluster peerURLs contains the pod names, so this list will change after every replacement pod is created (with increasing numbers starting from 0000).
When creating a replacement pod you'll end with a pod with a new fqdn. So the etcd-operator will remove the old etcd member and add a new member. But this will work only when losing a minority of the etcd members. If you lose a majority of the etcd members the etcd cluster will be unquorate and won't accept a member update. Instead using stable peer URLs addresses (something like etcd-$i with i from 0 to cluster size-1) will avoid this problem.

One option to use "persistent" network names could be:

  • use an headless service with a selector like already done now by etcd-operator but change pod naming to be persistent (not an increasing number at every new pod)
  • use an headless service without a selector keeping current pod increasing pod naming logic and manually create endpoints to generate fixed hostname endpoints (etcd-0, etcd-1 etc...)

About this last point, before etcd-io/etcd#6336 you was able to provide a peer url containing a domain name. So instead of using "persistent" peer ip addresses you could use "persistent" network names for example creating an headless service with an endpoint for every member pod that will resolve a fixed member name (like etcd-0, etcd-1 etc...) to the ip of the current pod (like done by a k8s statefulset).

But since etcd-io/etcd#6336 a peer url accepts only an ip address so the above solution won't work. Another solution will be to define a service with a cluster ip for every member pod with a label selector pointing to just that pod and use these service ips in the cluster peers list that will never change (except when resizing the cluster).
Some possible downsides are that now the etcd member packets need to pass through the kube-proxy (but when using the default iptables based kube proxy the overhead should be negligible) and that pod1 -> service -> pod1 packets requires enabling kubelet hairpin mode (see a better explanation in the above post).

If this solutions looks unpraticable/not clean another solution will be to add a way in etcd to force a peerURL update also when the cluster is unquorate (to be done when all members are stopped?) that should be used by etcd-operator when trying to replace a majority of pods.~~

@xiang90
Copy link
Collaborator

xiang90 commented Jul 25, 2017

@sgotti

(this could happen a lot of times for a lot of unpredictable reasons).

Frequent majority loss MUST be avoided. Or you should NOT use etcd at all. etcd is not designed for handling frequent majority loss.

Also when etcd loses majority, etcd operator will try to seed from one existing member's data if possible. Or it will try to recover from existing backups if any. An on-going effort is to support continuous backup.

Remember that PV is just another cloud managed backup for etcd. It is not a magic. I am fine with add PV support. It should not be a hard feature to add. PV just makes thing a little bit easier to reason about and makes management easier. However, it is not the best way to run etcd.

etcd definitely supports domain address. And in etcd operator, we use FQDN all over the place. The issue you pointed out has nothing to do with peerURL in update. ListenURL and advertise URL are different things. I do not think you understand it well.

@xiang90
Copy link
Collaborator

xiang90 commented Jul 25, 2017

@sgotti

Do you mind explaining your use case? When people want PV, usually creating a one member etcd deployment in k8s with PV is good enough for them. Clustering might not even be needed.

@sgotti
Copy link
Member Author

sgotti commented Jul 25, 2017

Frequent majority loss MUST be avoided. Or you should NOT use etcd at all. etcd is not designed for handling frequent majority loss.

I didn't mean frequent as N times a week, but in one year of production infrastructure this could happen few times. If my DC has a major problem or during an upgrade someone shut down some k8s nodes hosting a major number of etcd pods I'd like that my services restart cleanly after all is returned active without the need to recover from a backup.

Also when etcd loses majority, etcd operator will try to seed from one existing member's data if possible

That's a good thing but I haven't noticed this behavior, I'll test better this (and look deeper at the code) since the README doesn't says this (is it documented somewhere?)

An on-going effort is to support continuous backup.

This will be a great thing. I imagine it will be continuous but asynchronous right?

etcd definitely supports domain address. And in etcd operator, we use FQDN all over the place. The issue you pointed out has nothing to do with peerURL in update. ListenURL and advertise URL are different things. I do not think you understand it well.

D'oh! looks like I was wrong and overlooked etcd-io/etcd#6336 because I got errors when putting FQDN in --initial-advertise-peer-urls but it was probably the domain in the listen peer url...

So this will make easier to implement the second part of my proposal (I updated it) if you can provide persistent domain names in peerURLs.

Do you mind explaining your use case?

stolon saves the cluster state (the main information is the current pg primary/master instance) inside etcd. Restoring an etcd cluster from a backup will mean restoring an old cluster state if between the backup and the disaster a new master was elected causing some problems.

I'll try to do another example: consider the etcd cluster backing the k8s api. What will you do if all the 3 nodes restarts? I'll personally just wait for them to come back since the cluster will continue working (you can't just do changes), the etcd members have persistent data and when they will come back the cluster will become functional (If I haven't permanently destroyed a majority of the nodes). I won't create a new cluster and restore a backup if possible since I can end with ugly situations if I did some changes between the backup and the reboot.

I'd like to achieve the same with an etcd cluster inside k8s using the etcd-operator.

When people want PV, usually creating a one member etcd deployment in k8s with PV is good enough for them

I can already achieve a multi member etcd cluster inside k8s without etcd-operator using instead aone or more statefulset. I was just proposing to also make etcd-operator achieve the same goal while keeping all the other etcd-operator great features.

Related the the single etcd member cluster suggestion: unfortunately using a single member with a PV will mean, with the current k8s state, if the k8s nodes dies, waiting some minutes for it to being declared as dead and detaching the pv (if using a block device based PV like AWS EBS) from that node and attaching it to a new node. In addition if for some reason detaching fails due to different problems you'll need to wait more or do some manual operations (there're many other possible cases depending on how you're deploying your k8s cluster and the underling storage used for PVs).

@xiang90
Copy link
Collaborator

xiang90 commented Jul 25, 2017

I didn't mean frequent as N times a week, but in one year of production infrastructure this could happen few times. If my DC has a major problem or during an upgrade someone shut down some k8s nodes hosting a major number of etcd pods I'd like that my services restart cleanly after all is returned active without the need to recover from a backup.

Affinity is the way to solve this.

I'll try to do another example: consider the etcd cluster backing the k8s api. What will you do if all the 3 nodes restarts? I'll personally just wait for them to come back since the cluster will continue working (you can't just do changes), the etcd members have persistent data and when they will come back the cluster will become functional

This is a totally different story. etcd operator does not have to remove the pod immediately. It is about how long we want to wait the pod to come back. There is not prefect solution for this. After adding local volume support, we can let user to configure how long they want to wait.

I can already achieve a multi member etcd cluster inside k8s without etcd-operator using instead aone or more statefulset. I was just proposing to also make etcd-operator achieve the same goal while keeping all the other etcd-operator great features.

There are tons of things your stuff do not handle I assume, or it becomes another etcd operator. The complexity is not really about the initial deployment. It is about the on-going maintain, failure detection, and failure handling. For example, how do you add member to scale up the cluster? How do you backup the cluster for recovering bad state? Statefulset is not flexible enough to achieve quite a few things easily, and the benefits it bring in right now are not significant.

, if the k8s nodes dies, waiting some minutes for it to being declared as dead

If you want to cycle it faster, you can write a simple monitoring script to do it. k8s will still handle the PV for you. If you do not trust PV, then 3 nodes with PV wont help either.

After reading through your opinions and use case, I feel all you want is PV support. I am fine with adding this feature, all we need to do is to change the pod restart policy to always and change emptyDir to PV initially. @sgotti it would be great if you can work on it.

@sgotti
Copy link
Member Author

sgotti commented Jul 28, 2017

There are tons of things your stuff do not handle I assume, or it becomes another etcd operator.

Right. That's the reason I would like to improve etcd-operator to be able to handle cases where you lose the majority of member without being forced to restore from a backup 😄

After reading through your opinions and use case, I feel all you want is PV support. I am fine with adding this feature, all we need to do is to change the pod restart policy to always and change emptyDir to PV initially. @sgotti it would be great if you can work on it.

Correct me if I'm wrong, but I'm not sure that just adding PV support with the current etcd-operator pod management logic will be enough, that's why my proposal feels a bit invasive as it's trying to change the etcd-operator pod managament logic to something similar to the one of statefulsets. Let me explain:

If we keep the current operator logic (also with restartpolicy always), if node01 that is executing etcd-0000 dies/is partitioned for some minutes, the node controller will start pod eviction marking all its pods for deletion (deletion that is blocked since node01 is considered partitioned), if node01 comes back or is permanently removed from the cluster (or we force delete the pod etcd-0000 using grace-period=0) the pod will be deleted. So etcd-operator will schedule a new pod (etcd-0004) (to be sure I just tried this now and I see this behavior on k8s 1.7).

If we just add the ability to define PVs (I think using something like a pvc template so we can handle dynamic provisioning, multizone etc...) that will be attached to etcd pods (say etcd-0000 to 0003), with the above example, etcd-0004 will get a new PV (not the one previously attached to etcd-000). So I don't see the difference and gain using a PV in this way.

Instead with the statefulset logic, you always have pods with fixed names and fixed PV. If a pod is deleted from a node the statefulset controller will recreate a new pod with same name and same PV that will be scheduled on another node.

This is a totally different story. etcd operator does not have to remove the pod immediately. It is about how long we want to wait the pod to come back.

as above, in case died/partitioned (for whatever reason) node the pod is automatically marked for deletion by the node controller when a node dies/is partitioned (for whatever reason) so etcd-operator, currently, will create a new pod with a different name.

@xiang90
Copy link
Collaborator

xiang90 commented Jul 28, 2017

@sgotti

I do not think it is as complicated as you described. There are only four cases:

  1. pod restart - still use previous PV
  2. pod movement - detach the PV on the pod to move (or dead), and reattach that PV to the new pod
  3. pod addition due to scale up - create a new PV
  4. pod deletion due to scale down - delete the old PV

We can start with 1 as I mentioned:

all we need to do is to change the pod restart policy to always and change emptyDir to PV initially

Some membership updates might be involved for 2,3,4, but none of them are complicated I think.

@xiang90
Copy link
Collaborator

xiang90 commented Jul 28, 2017

@sgotti

as above, in case died/partitioned (for whatever reason) node the pod is automatically marked for deletion by the node controller when a node dies/is partitioned (for whatever reason) so etcd-operator, currently, will create a new pod with a different name.

This is really different from what you originally described. And, yes, I am aware of this. I do not thing it changes anything as I described. This is just another case of a "forced" pod movement.

@sgotti
Copy link
Member Author

sgotti commented Jul 31, 2017

I do not think it is as complicated as you described. There are only four cases:

  1. pod restart - still use previous PV
  2. pod movement - detach the PV on the pod to move (or dead), and reattach that PV to the new pod
  3. pod addition due to scale up - create a new PV
  4. pod deletion due to scale down - delete the old PV

Yeah I think these cases will help achieving the goal. I started implementing 1 and the needed changes to the reconcile logic for 2,3,4 to just check if all fits together. I'll open an RFC PR for 1 in the next days (or next week) since I'd like to see if you agree on some choices.

@xiang90
Copy link
Collaborator

xiang90 commented Jul 31, 2017

@sgotti That is great! Thanks!

@hongchaodeng
Copy link
Member

hongchaodeng commented Nov 21, 2017

xref: #1434

@hongchaodeng
Copy link
Member

hongchaodeng commented Nov 24, 2017

Plan:

Let me clarify the four points in the plan:

    1. Add PVC for etcd pod. This is basically what is achieved in Add PersistVolumne Support #1861.
    1. When node restart happens, current etcd pods would fail and won't come up anymore. Besides changing the restart policy to "Always", following cases need be addressed:
    • 2.1. kubelet couldn't report to master and node would become "unready". etcd-operator would need to tolerate this case for some period of time. If node restarts and then etcd pods restarts and everything turns to healthy state, then etcd-operator wouldn't do anything. Otherwise, etcd-operator needs to take the nodes and thus etcd pods on it as unhealthy and do healing.
    • 2.2. Previous case assumes etcd pods still exist. If not, e.g. evicted, we will handle it like case 4.
    1. If node partition happens, it could encounter case 2.1 that node would become "unready" and case 2.2 that etcd pods get deleted. The solution applies the same. But we need to think about and test this case.
    1. Node eviction happens, and etcd pods get deleted. Let's assume PV are still there. In this case, etcd-operator should have the knowledge of current membership and reschedule the "failed" members onto healthy nodes, and mount its corresponding PV back, and then everything should be fine because network and storage are still the same.

@hongchaodeng hongchaodeng modified the milestones: 0.7.1, 0.7.2 Dec 6, 2017
@hongchaodeng hongchaodeng modified the milestones: 0.8.0, 0.8.1 Jan 3, 2018
@alexandrem
Copy link

I'm very interested in this feature for more resiliency under extreme outage.

I'd rather lose some availability but have everything be back automatically without losing any data than having to load backups manually.

I tested the patch in @hongchaodeng branch merged with master yesterday. All seem good so far.

I'm interested in the handling cases of volumes for pods replacement, during restart/partition/eviction phases described above.

Is the milestone 0.8.1 still accurate?

@hongchaodeng
Copy link
Member

Note
Previously, we can't set non-root user for container because persistent volume will belong to root user. See #806.

New finding suggests that we can use fsGroup and access the mounted dir in non-root user. See https://kubernetes.io/docs/tasks/configure-pod-container/security-context/.

@hongchaodeng
Copy link
Member

@alexandrem
Sorry that I dismiss your comment in floods of notifications.

I would definitely encourage you to test those failure scenarios in order to prove PV fixes those issues. If any functionality is missing, please communicate and we are more than glad to merge any fix.

I would expect this feature to be a stable release blocker.

@alexandrem
Copy link

PV helped a bit under those stress tests, but didn't solve it entirely.

Another thing that is required while using that PV feature is to change the restartPolicy of the pod members to Always. Otherwise, operator will attempt to replace the pods and obviously we don't have the logic to move volumes around yet.

I decided to actually replace the etcd-operator with a statefulset implementation for my specific use case.

I still believe that PV is an interesting addition to the operator though.

@hongchaodeng
Copy link
Member

hongchaodeng commented Jan 20, 2018

Another thing that is required while using that PV feature is to change the restartPolicy of the pod members to Always.

Yes. We are aware of this. See #1861 (comment).
There are more logic to be implemented in order to manage PV/local PV to best fit with etcd lifecycle.

I decided to actually replace the etcd-operator with a statefulset implementation for my specific use case.

What's your specific reason to switch to statefulset? What advantages does statefulset provide? We would like to have issues to track that.

@alexandrem
Copy link

My principal use case is to host Kubernetes control plane components on a Kubernetes clusters (kubeception like).

We want to offer managed Kubernetes clusters on-demand to different teams inside our company. I have built a solution to do lifecycle operations of Kubernetes cluster resources via an API that are hosted on a global cluster. We don't have a global shared Kubernetes cluster for everyone. We instead want to offer dedicated clusters on-demand, something similar to GKE.

One of those master components is obviously the etcd cluster. We need to have persistent and very resilient etcd clusters. We can't afford to lose data; that would be catastrophic for users to have an outage and have all their pod members removed, since they would lose their Kubernetes cluster entirely. Obviously we need backups, but we also need to automate the whole thing as much as possible. I could have hundreds of clusters hosted and cannot afford to manually restore the cluster state for each of them if something goes wrong.

A second use case we have is hosting the Quay docker registry on-prem using a postgreSQL database on Kubernetes. There is a database proxy system via stolon that uses an etcd cluster to do leader election and route to the master database member. If the etcd cluster is unavailable, then no access to the database is possible and this creates a global disruption of service.

Lately, we've suffered from a networking outage which has impacted the etcd cluster managed by the operator for both of those use cases. We are currently hosting this in a private openstack cloud and there was a bad configuration in the networking layer on the hypervisors that was pushed for a short time. This created a global outage for both Kubernetes and the etcd-operator hosted on top of it.

When etcd-operator loses its quorum, then bad things ensue and etcd pod members get deleted after a few minutes. At this point, nothing recovers the cluster automatically.

This was tested on 0.5+ up to 0.7.x.

Fortunately, there weren't production Kubernetes clusters impacted (all of this is very experimental so far), but there was a big outage on the docker registry instance and it required manual operation to recover it from backups.

I have found that using a statefulset with PV is more resilient. It would always recover itself automatically following either partial or global networking outages.

I think there has to be a lot of improvements in the way etcd-operator handles split brain scenarios.

Losing the quorum of members will translate in the cluster being considered as dead, then member deletions happen. I believe another problem can arise if only the etcd-operator is separated under a network split. I think if it cannot communicate to kube apiserver then it might fall into a logic loop where it considers the cluster as dead, then will attempt to issue a few pod delete operations, then when kube-apiserver communication is restored it will still proceed to delete the cluster regardless. Would need to double check this particular case in the code.

I believe it would help to introduce configurable strategies to handle split brain scenarios in the operator (new fields in the cluster resource definition). For instance, if members are disrupted, then one strategy could be to not delete pods beyond the quorum size.

Akka has implemented and documented those use cases, maybe something we could get inspiration from. My strategy above is essential what they call "static quorum".

https://developer.lightbend.com/docs/akka-commercial-addons/current/split-brain-resolver.html

@hongchaodeng
Copy link
Member

@alexandrem

I think your use cases fit with what etcd-operator is designed to do. I feel sorry that you encountered troubles and we didn't resolve them in time. Last week we just made a step forward in #1323 (comment). The use cases and issues you described are on progress right now. Please keep following and communicate feedback and I would love to hear them all.

I believe another problem can arise if only the etcd-operator is separated under a network split.

When losing connection to apiserver, etcd-operator wouldn't do anything. Once connected again, it would compare the current state with desired state and reconcile.

I believe it would help to introduce configurable strategies to handle split brain scenarios in the operator (new fields in the cluster resource definition). For instance, if members are disrupted, then one strategy could be to not delete pods beyond the quorum size.

I don't understand what split brain issues you found. Sounds like orthogonal to this issue. Could you open a new issue and describe it more detailedly?

@raoofm
Copy link

raoofm commented Feb 21, 2018

@hongchaodeng @xiang90 just wanted to track this. Is it on the internal roadmap with a timeline?

@akauppi
Copy link

akauppi commented Jun 27, 2018

@xiang90 wrote 11 months ago:

After adding local volume support, we can let user to configure how long they want to wait.

Did this happen?

@hongchaodeng You wrote:

Last week we just made a step forward in #1323 (comment).

..but this is #1323. Would you remember from Jan, if you intended to refer to some other issue/PR?

@JohnStrunk
Copy link

I would like to express interest in seeing etcd w/ durability.
The glusterd2 project and associated operator will be using an etcd cluster (via the etcd operator) for maintaining the configuration state of the Gluster storage system when deployed in kube environments. As such, if etcd loses the storage cluster's configuration state, user data may not be recoverable.

cc: @robszumski @atinmu @kshlm

@davinchia
Copy link

davinchia commented Sep 19, 2018

Can someone provide a brief update on the status? Been reading through the comments + linked issues and am confused at what state we are at now. It seems the PV changes have been merged in, but not released for stable and the logic for handling all three failure cases don't seem to have been finished. Trying to get a sense of stability to better assess whether etcd-operator is a good fit for our use case.

@rusenask
Copy link

rusenask commented Nov 9, 2018

Not really usable in production. It doesn't do much, only deploying etcd but if something goes wrong (and it will), it won't do anything about it. I initially thought that the idea behind having an operator is that it would try to bring it back online.

xiwenc added a commit to xiwenc/etcd-operator that referenced this issue Jun 20, 2019
This is a simple fix that addresses Case C from
https://github.com/coreos/etcd-operator/blob/master/doc/design/persistent_volumes_etcd_data.md

It makes the etcd cluster with PVC able to recover from full k8s cluster
outage. This fixes coreos#1323 inspired by coreos#1323 (comment)
xiwenc added a commit to xiwenc/etcd-operator that referenced this issue Jun 20, 2019
This is a simple fix that addresses Case C from
https://github.com/coreos/etcd-operator/blob/master/doc/design/persistent_volumes_etcd_data.md

It makes the etcd cluster with PVC able to recover from full k8s cluster
outage. Inspired by coreos#1323 (comment)

Fixes coreos#1323
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants