[WIP] *: Add initial PersistentVolume feature #1349

sgotti · 2017-08-07T16:04:20Z

this patch adds initial persistent volume feature for etcd data.

When the usePV option is enabled in the etcd cluster spec every
created pod will get a related persistent volume claim and the pod
restart policy will be set to RestartAlways.

This initial patch only covers the case where a pod crashes (or a node
restart before the related pod is evicted by the k8s node
controller). When using a persistent volume, instead of deleting the pod
and creating a new one, the pod will be restarted by k8s.

A future patch will change the reconcile logic to also handle cases
where one or more pods are deleted (manually or by k8s) recreating the
same pod using the same pvc.

The etcd data pvc will be garbage collected like the other cluster
resources.

Questions:

I'm using the same backup storage class (the name isn't great since it starts with etcd-backup-. Should we define different storage classes using the same provisioner or permit different provisioner for backup and data?
I'll add some questions inline in the code.

TODO:

Check all the tests on your infrastructure (I'm testing on a local k8s cluster disabling tests that needs S3 and updating the generated storage class with another pv provisioner).

this patch adds initial persistent volume feature for etcd data. When the usePV option is enabled in the etcd cluster spec every created pod will get a related persistent volume claim and the pod restart policy will be set to RestartAlways. This initial patch only covers the case where a pod crashes (or a node restart before the related pod is evicted by the k8s node controller). When using a persistent volume, instead of deleting the pod and creating a new one, the pod will be restarted by k8s. A future patch will change the reconcile logic to also handle cases where one or more pods are deleted (manually or by k8s) recreating the same pod using the same pvc. The persistent volume claim will use the same pv provisioner defined for the backup pvc. The etcd data pvc will be garbage collected like the other cluster resources.

sgotti · 2017-08-07T16:04:53Z

pkg/spec/cluster.go

@@ -28,6 +28,8 @@ import (
 const (
 	defaultBaseImage = "quay.io/coreos/etcd"
 	defaultVersion   = "3.1.8"
+
+	minPVSize = 512 // 512MiB


Is this a sane min size default?

sgotti · 2017-08-07T16:06:53Z

pkg/util/k8sutil/k8sutil.go

+			},
+		},
+		Spec: v1.PersistentVolumeClaimSpec{
+			StorageClassName: &storageClassName,


Here I'm using Spec.StorageClassName instead of the old annotation volume.beta.kubernetes.io/storage-class like it's used in the backup pvc: https://github.com/coreos/etcd-operator/blob/master/pkg/util/k8sutil/backup.go#L74 . Should this be updated?

sgotti · 2017-08-07T16:09:44Z

hack/test

+	# Run tests with PV support disabled
+	go test -v "./test/e2e/" -run "$E2E_TEST_SELECTOR" -timeout 30m --race --kubeconfig $KUBECONFIG --operator-image $OPERATOR_IMAGE --namespace ${TEST_NAMESPACE}
+	# Run tests with PV support enabled
+	PV_TEST=true go test -v "./test/e2e/" -run "$E2E_TEST_SELECTOR" -timeout 30m --race --kubeconfig $KUBECONFIG --operator-image $OPERATOR_IMAGE --namespace ${TEST_NAMESPACE}


Instead of changing all the test functions to use a common function that accept a usePV bool and create a parent function calling it with it to true and to false, I just (for speed) added an env variable that changes the behavior of NewCluster. Let me know your preference.

sgotti · 2017-08-07T16:11:21Z

example/example-etcd-cluster-with-pv.yaml

+  pod:
+    usePV: true
+    pvPolicy:
+      volumeSizeInMB: 1024


This is just a proposed spec. I put it under the PodPolicy but I'm not sure about its location. I'm also not sure about the usePV bool name and the related pvPolicy. I'm open for any better idea.

hongchaodeng · 2017-08-07T16:20:35Z

Please hold on.
I'm still not sure if we are going to use PVC for etcd storage.
Instead, we would like to go the path of local PV: #1201

xiang90 · 2017-08-07T16:22:29Z

Please hold on.
I'm still not sure if we are going to use PVC for etcd storage.
Instead, we would like to go the path of local PV: #1201

Local PV and PVC are different. I am fine with adding PV as long as it does not interleave too much with the current logic. Or we have to refactor the code before adding it.

hongchaodeng · 2017-08-07T16:27:17Z

Local PV and PVC are different.

Well. They are somehow the same. They are both PV abstractions. They can share the same code path for tolerating etcd downtime, GC, and management code in etcd operator. At least, we should have decided down what local PV looks like before adding option for non-local PV. Local PV should take higher priority.

xiang90 · 2017-08-07T16:29:43Z

pkg/spec/cluster.go

@@ -154,6 +156,14 @@ type PodPolicy struct {
 	// bootstrap the cluster (for example `--initial-cluster` flag).
 	// This field cannot be updated.
 	EtcdEnv []v1.EnvVar `json:"etcdEnv,omitempty"`
+
+	UsePV    bool      `json:"usePV,omitempty"`


kill this bool? we can check if PVPolicy is empty.

Yeah, that was just one idea: use a bool to enable PV usage and the user could leave an nil PVPolicy (that will use the default VolumeSizeInMB value). Removing it will require users to always set PVPolicy.VolumeSizeInMB but I think it's not so complicated from the user perspective.

hongchaodeng · 2017-08-07T16:56:58Z

Actually it is fine to have PVC support. But it is very important to make sure the abstractions are right so that we can share the code path with local PV as well.

xiang90 · 2017-08-08T14:45:40Z

pkg/util/etcdutil/member.go

@@ -188,3 +192,11 @@ func clusterNameFromMemberName(mn string) string {
 	}
 	return mn[:i]
 }
+
+func MemberNameFromPVCName(pn string) string {


doc string.

xiang90 · 2017-08-08T14:47:06Z

pkg/util/k8sutil/k8sutil.go

+	return pvc
+}
+
+func NewEtcdPod(m *etcdutil.Member, initialCluster []string, clusterName, state, token string, cs spec.ClusterSpec, usePVC bool, owner metav1.OwnerReference) *v1.Pod {


can we infer PVC from cluster spec?

In this PR it could be removed because not needed.

It's here because in the second PR where I implemented POD recreation using an existing PVC I wanted to handle the case where the user disabled PVC in the cluster spec but I'd like to continue using existing PVC for pod recreation. But now that I think about it another implementation could just ignore existing PVC when PV support is disabled. So I'll remove it.

sgotti · 2017-08-08T15:39:49Z

Well. They are somehow the same. They are both PV abstractions. They can share the same code path for tolerating etcd downtime, GC, and management code in etcd operator. At least, we should have decided down what local PV looks like before adding option for non-local PV. Local PV should take higher priority.

@hongchaodeng @xiang90 My opinion is that they are both PV and both can be abstracted using a PVC. I'll always use a PVC instead of directly creating a PV since a PVC adds some goodies for PV management (the storageclass => provisioning and volume node affinity).
The main difference between them is the node affinity. One is "local" (available only on one node) and others (EBS, GCE PD etc...) are "global" (available on more than one node (one node at a time)), though they can be available only on some subset of the total k8s cluster nodes like EBS that are per availability zone).

This difference in node affinity can bring to some differences in the "reconcile" logic. Taking from #1323 (comment):

pod restart - still use previous PV

pod movement - detach the PV on the pod to move (or dead), and reattach that PV to the new pod

(NOTE: this PR implements 1).

Point 2, if intended as pod movement between different nodes, can be done only with "global" PV.

Another feature (that is one of the main reason I'm trying to implement PV support) is that, instead of doing "disaster recovery" when there're less than N/2+1 RUNNING etcd pods due to some k8s nodes going down, one could just wait for them to be come back (node reentered the k8s cluster that will restart the pod since the policy is RestartAlways) or wait for pods to be deleted from the API (by the node controller + manual intervention or a fence controller) and recreate them using the previous PV that will be scheduled on another node.

This is the second part of this saga. I already implemented a part of it (but with a lot of assumption since with PV you have multiple choices) so If you want I'll be happy to open another RFC PR if it can be helpful to better explain the overall idea.

hongchaodeng · 2017-08-08T16:45:23Z

@sgotti
So I think the purpose of this PR is to add a PVC for each etcd pod and change restartPolicy to Always, right? Let's not handle too many stuff at once and do it step by step:

Assume the etcd pod will always restart on the same node for now
For next PR, it will handle the toleration, pod safety guarantee, disaster recovery, etc. Needs discussion in Persistent/Durable etcd cluster #1323 .

So this PR will handle:

create PVC when scaling up etcd
delete PVC when scaling down etcd
wait etcd to restart instead of removing it

For this to move forward, this is what we should do:

A PR for only spec change. How would user tell us it expects each etcd pod to uses a PVC, and what kind of PV it would use?
A PR for documentation, how could users use it.
A PR for internal changes, some heavy-lifting refactoring of the code to make sure good design
A PR for testing.

xiang90 · 2017-08-08T17:47:53Z

@sgotti

I think the direction is good in general. As @hongchaodeng mentions, can we split this effort into multiple PRs? Thank you!

xiang90 · 2017-08-16T17:29:11Z

@sgotti kindly ping on this one. is there anything we could help to move this forward?

sgotti · 2017-08-16T21:03:32Z

@xiang90 @hongchaodeng sorry for being late, I'm in vacation until next week. OK for the plan, I'm not sure if you already have in mind which refactoring will be needed, and I'll keep implementation and tests in the same PR for obvious reasons.

As a first step I'd like to propose some possible changes on the way volume management is handled in the etcd-operator. Is it ok if I open a new dedicated issue (since I have in mind different option with different pros and cons)?

xiang90 · 2017-08-16T21:16:48Z

Is it ok if I open a new dedicated issue (since I have in mind different option with different pros and cons)?

sure.

georgekuruvillak · 2017-08-29T12:14:27Z

@sgotti I had a query regarding keeping restartPolicy to Always. I would like to know if a data-corruption in etcd data volume will result in the pod getting restarted unnecessarily. Would it not be better if we can handle it using the reconciliation logic? We could know the memberId thats down and in turn the pvc associated. Could we not create the pod with the same memberId and pvc attached?

While replacing a failed member, its mentioned that we remove the instance and add a new instance. Just restarting the pod does not conform to what is mentioned in this documentation.

sgotti · 2017-08-29T13:40:06Z

Is it ok if I open a new dedicated issue (since I have in mind different option with different pros and cons)?
sure.

I wanted to propose to directly use StorageClass in the spec instead of the --pv-provisioner option but I noticed this has been already accepted during my vacations in #1361. That's really good. So I just went ahead opening PR #1373 to discuss just spec changes.

sgotti · 2017-08-29T13:49:29Z

@sgotti I had a query regarding keeping restartPolicy to Always. I would like to know if a data-corruption in etcd data volume will result in the pod getting restarted unnecessarily.

I agree that adding etcd data to a persistent volume adds the requirement to also handle a data corruptions case that previously wasn't needed.

Would it not be better if we can handle it using the reconciliation logic? We could know the memberId thats down and in turn the pvc associated. Could we not create the pod with the same memberId and pvc attached?

I think that restartPolicy Always is a fast way (as proposed by @xiang90 ) to use k8s features to restart a pod if it has crashed or the node reboots without adding this logic to the reconcile loop. Another way will be as you proposed to do it inside the operator reconcile loop.

But I don't see how this could help detecting corrupted data vs pod/etcd not starting for other reasons. I'm not sure which is the best way to detect a corruption.

My intial idea was to just remove and add a new member (with a new empty PV) if it's not healthy for some time. This will work also with restartPolicy to true and could be done only if the etcd cluster is healthy so won't create problem if all the etcd pods (or nodes) goes down at the same time achieving the goal of this PR.

Quite probably @hongchaodeng @xiang90 have better ideas.

georgekuruvillak · 2017-09-06T13:19:46Z

@sgotti, @xiang90 Is the first part of the PR without PVC reuse going to be incorporated upstream? When can it be roughly expected to be merged?

sgotti · 2017-09-08T08:56:52Z

closing this in favor of multiple PRs (first is #1373)

sgotti commented Aug 7, 2017

View reviewed changes

xiang90 reviewed Aug 7, 2017

View reviewed changes

xiang90 reviewed Aug 8, 2017

View reviewed changes

sgotti mentioned this pull request Aug 29, 2017

spec: add PVSource to PodPolicy #1373

Merged

sgotti closed this Sep 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] *: Add initial PersistentVolume feature #1349

[WIP] *: Add initial PersistentVolume feature #1349

sgotti commented Aug 7, 2017

sgotti Aug 7, 2017

xiang90 Aug 8, 2017

sgotti Aug 7, 2017

sgotti Aug 7, 2017

sgotti Aug 7, 2017

hongchaodeng commented Aug 7, 2017 •

edited

Loading

xiang90 commented Aug 7, 2017

hongchaodeng commented Aug 7, 2017

xiang90 Aug 7, 2017

sgotti Aug 8, 2017

hongchaodeng commented Aug 7, 2017

xiang90 Aug 8, 2017

xiang90 Aug 8, 2017

sgotti Aug 8, 2017

sgotti commented Aug 8, 2017

hongchaodeng commented Aug 8, 2017 •

edited

Loading

xiang90 commented Aug 8, 2017

xiang90 commented Aug 16, 2017

sgotti commented Aug 16, 2017

xiang90 commented Aug 16, 2017

georgekuruvillak commented Aug 29, 2017

sgotti commented Aug 29, 2017

sgotti commented Aug 29, 2017

georgekuruvillak commented Sep 6, 2017

sgotti commented Sep 8, 2017

[WIP] *: Add initial PersistentVolume feature #1349

[WIP] *: Add initial PersistentVolume feature #1349

Conversation

sgotti commented Aug 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hongchaodeng commented Aug 7, 2017 • edited Loading

xiang90 commented Aug 7, 2017

hongchaodeng commented Aug 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hongchaodeng commented Aug 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgotti commented Aug 8, 2017

hongchaodeng commented Aug 8, 2017 • edited Loading

xiang90 commented Aug 8, 2017

xiang90 commented Aug 16, 2017

sgotti commented Aug 16, 2017

xiang90 commented Aug 16, 2017

georgekuruvillak commented Aug 29, 2017

sgotti commented Aug 29, 2017

sgotti commented Aug 29, 2017

georgekuruvillak commented Sep 6, 2017

sgotti commented Sep 8, 2017

hongchaodeng commented Aug 7, 2017 •

edited

Loading

hongchaodeng commented Aug 8, 2017 •

edited

Loading