-
Notifications
You must be signed in to change notification settings - Fork 740
[WIP] *: Add initial PersistentVolume feature #1349
Conversation
this patch adds initial persistent volume feature for etcd data. When the usePV option is enabled in the etcd cluster spec every created pod will get a related persistent volume claim and the pod restart policy will be set to RestartAlways. This initial patch only covers the case where a pod crashes (or a node restart before the related pod is evicted by the k8s node controller). When using a persistent volume, instead of deleting the pod and creating a new one, the pod will be restarted by k8s. A future patch will change the reconcile logic to also handle cases where one or more pods are deleted (manually or by k8s) recreating the same pod using the same pvc. The persistent volume claim will use the same pv provisioner defined for the backup pvc. The etcd data pvc will be garbage collected like the other cluster resources.
@@ -28,6 +28,8 @@ import ( | |||
const ( | |||
defaultBaseImage = "quay.io/coreos/etcd" | |||
defaultVersion = "3.1.8" | |||
|
|||
minPVSize = 512 // 512MiB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a sane min size default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems ok
}, | ||
}, | ||
Spec: v1.PersistentVolumeClaimSpec{ | ||
StorageClassName: &storageClassName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I'm using Spec.StorageClassName instead of the old annotation volume.beta.kubernetes.io/storage-class
like it's used in the backup pvc: https://github.com/coreos/etcd-operator/blob/master/pkg/util/k8sutil/backup.go#L74 . Should this be updated?
# Run tests with PV support disabled | ||
go test -v "./test/e2e/" -run "$E2E_TEST_SELECTOR" -timeout 30m --race --kubeconfig $KUBECONFIG --operator-image $OPERATOR_IMAGE --namespace ${TEST_NAMESPACE} | ||
# Run tests with PV support enabled | ||
PV_TEST=true go test -v "./test/e2e/" -run "$E2E_TEST_SELECTOR" -timeout 30m --race --kubeconfig $KUBECONFIG --operator-image $OPERATOR_IMAGE --namespace ${TEST_NAMESPACE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of changing all the test functions to use a common function that accept a usePV bool
and create a parent function calling it with it to true and to false, I just (for speed) added an env variable that changes the behavior of NewCluster
. Let me know your preference.
pod: | ||
usePV: true | ||
pvPolicy: | ||
volumeSizeInMB: 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a proposed spec. I put it under the PodPolicy but I'm not sure about its location. I'm also not sure about the usePV
bool name and the related pvPolicy. I'm open for any better idea.
Please hold on. |
Local PV and PVC are different. I am fine with adding PV as long as it does not interleave too much with the current logic. Or we have to refactor the code before adding it. |
Well. They are somehow the same. They are both PV abstractions. They can share the same code path for tolerating etcd downtime, GC, and management code in etcd operator. At least, we should have decided down what local PV looks like before adding option for non-local PV. Local PV should take higher priority. |
@@ -154,6 +156,14 @@ type PodPolicy struct { | |||
// bootstrap the cluster (for example `--initial-cluster` flag). | |||
// This field cannot be updated. | |||
EtcdEnv []v1.EnvVar `json:"etcdEnv,omitempty"` | |||
|
|||
UsePV bool `json:"usePV,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kill this bool? we can check if PVPolicy is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was just one idea: use a bool to enable PV usage and the user could leave an nil PVPolicy (that will use the default VolumeSizeInMB value). Removing it will require users to always set PVPolicy.VolumeSizeInMB but I think it's not so complicated from the user perspective.
Actually it is fine to have PVC support. But it is very important to make sure the abstractions are right so that we can share the code path with local PV as well. |
@@ -188,3 +192,11 @@ func clusterNameFromMemberName(mn string) string { | |||
} | |||
return mn[:i] | |||
} | |||
|
|||
func MemberNameFromPVCName(pn string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc string.
return pvc | ||
} | ||
|
||
func NewEtcdPod(m *etcdutil.Member, initialCluster []string, clusterName, state, token string, cs spec.ClusterSpec, usePVC bool, owner metav1.OwnerReference) *v1.Pod { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we infer PVC from cluster spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR it could be removed because not needed.
It's here because in the second PR where I implemented POD recreation using an existing PVC I wanted to handle the case where the user disabled PVC in the cluster spec but I'd like to continue using existing PVC for pod recreation. But now that I think about it another implementation could just ignore existing PVC when PV support is disabled. So I'll remove it.
@hongchaodeng @xiang90 My opinion is that they are both PV and both can be abstracted using a PVC. I'll always use a PVC instead of directly creating a PV since a PVC adds some goodies for PV management (the storageclass => provisioning and volume node affinity). This difference in node affinity can bring to some differences in the "reconcile" logic. Taking from #1323 (comment):
(NOTE: this PR implements 1). Point 2, if intended as pod movement between different nodes, can be done only with "global" PV. Another feature (that is one of the main reason I'm trying to implement PV support) is that, instead of doing "disaster recovery" when there're less than N/2+1 RUNNING etcd pods due to some k8s nodes going down, one could just wait for them to be come back (node reentered the k8s cluster that will restart the pod since the policy is RestartAlways) or wait for pods to be deleted from the API (by the node controller + manual intervention or a fence controller) and recreate them using the previous PV that will be scheduled on another node. This is the second part of this saga. I already implemented a part of it (but with a lot of assumption since with PV you have multiple choices) so If you want I'll be happy to open another RFC PR if it can be helpful to better explain the overall idea. |
@sgotti
So this PR will handle:
For this to move forward, this is what we should do:
|
I think the direction is good in general. As @hongchaodeng mentions, can we split this effort into multiple PRs? Thank you! |
@sgotti kindly ping on this one. is there anything we could help to move this forward? |
@xiang90 @hongchaodeng sorry for being late, I'm in vacation until next week. OK for the plan, I'm not sure if you already have in mind which refactoring will be needed, and I'll keep implementation and tests in the same PR for obvious reasons. As a first step I'd like to propose some possible changes on the way volume management is handled in the etcd-operator. Is it ok if I open a new dedicated issue (since I have in mind different option with different pros and cons)? |
sure. |
@sgotti I had a query regarding keeping restartPolicy to Always. I would like to know if a data-corruption in etcd data volume will result in the pod getting restarted unnecessarily. Would it not be better if we can handle it using the reconciliation logic? We could know the memberId thats down and in turn the pvc associated. Could we not create the pod with the same memberId and pvc attached? While replacing a failed member, its mentioned that we remove the instance and add a new instance. Just restarting the pod does not conform to what is mentioned in this documentation. |
I wanted to propose to directly use StorageClass in the spec instead of the |
I agree that adding etcd data to a persistent volume adds the requirement to also handle a data corruptions case that previously wasn't needed.
I think that restartPolicy Always is a fast way (as proposed by @xiang90 ) to use k8s features to restart a pod if it has crashed or the node reboots without adding this logic to the reconcile loop. Another way will be as you proposed to do it inside the operator reconcile loop. But I don't see how this could help detecting corrupted data vs pod/etcd not starting for other reasons. I'm not sure which is the best way to detect a corruption. My intial idea was to just remove and add a new member (with a new empty PV) if it's not healthy for some time. This will work also with restartPolicy to true and could be done only if the etcd cluster is healthy so won't create problem if all the etcd pods (or nodes) goes down at the same time achieving the goal of this PR. Quite probably @hongchaodeng @xiang90 have better ideas. |
closing this in favor of multiple PRs (first is #1373) |
this patch adds initial persistent volume feature for etcd data.
When the usePV option is enabled in the etcd cluster spec every
created pod will get a related persistent volume claim and the pod
restart policy will be set to RestartAlways.
This initial patch only covers the case where a pod crashes (or a node
restart before the related pod is evicted by the k8s node
controller). When using a persistent volume, instead of deleting the pod
and creating a new one, the pod will be restarted by k8s.
A future patch will change the reconcile logic to also handle cases
where one or more pods are deleted (manually or by k8s) recreating the
same pod using the same pvc.
The etcd data pvc will be garbage collected like the other cluster
resources.
Questions:
I'm using the same backup storage class (the name isn't great since it starts with
etcd-backup-
. Should we define different storage classes using the same provisioner or permit different provisioner for backup and data?I'll add some questions inline in the code.
TODO: