Skip to content
This repository has been archived by the owner on Mar 28, 2020. It is now read-only.

[WIP] *: Add initial PersistentVolume feature #1349

Closed
wants to merge 1 commit into from

Conversation

sgotti
Copy link
Member

@sgotti sgotti commented Aug 7, 2017

this patch adds initial persistent volume feature for etcd data.

When the usePV option is enabled in the etcd cluster spec every
created pod will get a related persistent volume claim and the pod
restart policy will be set to RestartAlways.

This initial patch only covers the case where a pod crashes (or a node
restart before the related pod is evicted by the k8s node
controller). When using a persistent volume, instead of deleting the pod
and creating a new one, the pod will be restarted by k8s.

A future patch will change the reconcile logic to also handle cases
where one or more pods are deleted (manually or by k8s) recreating the
same pod using the same pvc.

The etcd data pvc will be garbage collected like the other cluster
resources.

Questions:

  • I'm using the same backup storage class (the name isn't great since it starts with etcd-backup-. Should we define different storage classes using the same provisioner or permit different provisioner for backup and data?

  • I'll add some questions inline in the code.

TODO:

  • Check all the tests on your infrastructure (I'm testing on a local k8s cluster disabling tests that needs S3 and updating the generated storage class with another pv provisioner).

this patch adds initial persistent volume feature for etcd data.

When the usePV option is enabled in the etcd cluster spec every
created pod will get a related persistent volume claim and the pod
restart policy will be set to RestartAlways.

This initial patch only covers the case where a pod crashes (or a node
restart before the related pod is evicted by the k8s node
controller). When using a persistent volume, instead of deleting the pod
and creating a new one, the pod will be restarted by k8s.

A future patch will change the reconcile logic to also handle cases
where one or more pods are deleted (manually or by k8s) recreating the
same pod using the same pvc.

The persistent volume claim will use the same pv provisioner defined for the
backup pvc.

The etcd data pvc will be garbage collected like the other cluster
resources.
@@ -28,6 +28,8 @@ import (
const (
defaultBaseImage = "quay.io/coreos/etcd"
defaultVersion = "3.1.8"

minPVSize = 512 // 512MiB
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a sane min size default?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems ok

},
},
Spec: v1.PersistentVolumeClaimSpec{
StorageClassName: &storageClassName,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'm using Spec.StorageClassName instead of the old annotation volume.beta.kubernetes.io/storage-class like it's used in the backup pvc: https://github.com/coreos/etcd-operator/blob/master/pkg/util/k8sutil/backup.go#L74 . Should this be updated?

# Run tests with PV support disabled
go test -v "./test/e2e/" -run "$E2E_TEST_SELECTOR" -timeout 30m --race --kubeconfig $KUBECONFIG --operator-image $OPERATOR_IMAGE --namespace ${TEST_NAMESPACE}
# Run tests with PV support enabled
PV_TEST=true go test -v "./test/e2e/" -run "$E2E_TEST_SELECTOR" -timeout 30m --race --kubeconfig $KUBECONFIG --operator-image $OPERATOR_IMAGE --namespace ${TEST_NAMESPACE}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of changing all the test functions to use a common function that accept a usePV bool and create a parent function calling it with it to true and to false, I just (for speed) added an env variable that changes the behavior of NewCluster. Let me know your preference.

pod:
usePV: true
pvPolicy:
volumeSizeInMB: 1024
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a proposed spec. I put it under the PodPolicy but I'm not sure about its location. I'm also not sure about the usePV bool name and the related pvPolicy. I'm open for any better idea.

@hongchaodeng
Copy link
Member

hongchaodeng commented Aug 7, 2017

Please hold on.
I'm still not sure if we are going to use PVC for etcd storage.
Instead, we would like to go the path of local PV: #1201

@xiang90
Copy link
Collaborator

xiang90 commented Aug 7, 2017

Please hold on.
I'm still not sure if we are going to use PVC for etcd storage.
Instead, we would like to go the path of local PV: #1201

Local PV and PVC are different. I am fine with adding PV as long as it does not interleave too much with the current logic. Or we have to refactor the code before adding it.

@hongchaodeng
Copy link
Member

Local PV and PVC are different.

Well. They are somehow the same. They are both PV abstractions. They can share the same code path for tolerating etcd downtime, GC, and management code in etcd operator. At least, we should have decided down what local PV looks like before adding option for non-local PV. Local PV should take higher priority.

@@ -154,6 +156,14 @@ type PodPolicy struct {
// bootstrap the cluster (for example `--initial-cluster` flag).
// This field cannot be updated.
EtcdEnv []v1.EnvVar `json:"etcdEnv,omitempty"`

UsePV bool `json:"usePV,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kill this bool? we can check if PVPolicy is empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was just one idea: use a bool to enable PV usage and the user could leave an nil PVPolicy (that will use the default VolumeSizeInMB value). Removing it will require users to always set PVPolicy.VolumeSizeInMB but I think it's not so complicated from the user perspective.

@hongchaodeng
Copy link
Member

Actually it is fine to have PVC support. But it is very important to make sure the abstractions are right so that we can share the code path with local PV as well.

@@ -188,3 +192,11 @@ func clusterNameFromMemberName(mn string) string {
}
return mn[:i]
}

func MemberNameFromPVCName(pn string) string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc string.

return pvc
}

func NewEtcdPod(m *etcdutil.Member, initialCluster []string, clusterName, state, token string, cs spec.ClusterSpec, usePVC bool, owner metav1.OwnerReference) *v1.Pod {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we infer PVC from cluster spec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR it could be removed because not needed.

It's here because in the second PR where I implemented POD recreation using an existing PVC I wanted to handle the case where the user disabled PVC in the cluster spec but I'd like to continue using existing PVC for pod recreation. But now that I think about it another implementation could just ignore existing PVC when PV support is disabled. So I'll remove it.

@sgotti
Copy link
Member Author

sgotti commented Aug 8, 2017

Well. They are somehow the same. They are both PV abstractions. They can share the same code path for tolerating etcd downtime, GC, and management code in etcd operator. At least, we should have decided down what local PV looks like before adding option for non-local PV. Local PV should take higher priority.

@hongchaodeng @xiang90 My opinion is that they are both PV and both can be abstracted using a PVC. I'll always use a PVC instead of directly creating a PV since a PVC adds some goodies for PV management (the storageclass => provisioning and volume node affinity).
The main difference between them is the node affinity. One is "local" (available only on one node) and others (EBS, GCE PD etc...) are "global" (available on more than one node (one node at a time)), though they can be available only on some subset of the total k8s cluster nodes like EBS that are per availability zone).

This difference in node affinity can bring to some differences in the "reconcile" logic. Taking from #1323 (comment):

  1. pod restart - still use previous PV
  2. pod movement - detach the PV on the pod to move (or dead), and reattach that PV to the new pod

(NOTE: this PR implements 1).

Point 2, if intended as pod movement between different nodes, can be done only with "global" PV.

Another feature (that is one of the main reason I'm trying to implement PV support) is that, instead of doing "disaster recovery" when there're less than N/2+1 RUNNING etcd pods due to some k8s nodes going down, one could just wait for them to be come back (node reentered the k8s cluster that will restart the pod since the policy is RestartAlways) or wait for pods to be deleted from the API (by the node controller + manual intervention or a fence controller) and recreate them using the previous PV that will be scheduled on another node.

This is the second part of this saga. I already implemented a part of it (but with a lot of assumption since with PV you have multiple choices) so If you want I'll be happy to open another RFC PR if it can be helpful to better explain the overall idea.

@hongchaodeng
Copy link
Member

hongchaodeng commented Aug 8, 2017

@sgotti
So I think the purpose of this PR is to add a PVC for each etcd pod and change restartPolicy to Always, right? Let's not handle too many stuff at once and do it step by step:

  • Assume the etcd pod will always restart on the same node for now
  • For next PR, it will handle the toleration, pod safety guarantee, disaster recovery, etc. Needs discussion in Persistent/Durable etcd cluster #1323 .

So this PR will handle:

  • create PVC when scaling up etcd
  • delete PVC when scaling down etcd
  • wait etcd to restart instead of removing it

For this to move forward, this is what we should do:

  • A PR for only spec change. How would user tell us it expects each etcd pod to uses a PVC, and what kind of PV it would use?
  • A PR for documentation, how could users use it.
  • A PR for internal changes, some heavy-lifting refactoring of the code to make sure good design
  • A PR for testing.

@xiang90
Copy link
Collaborator

xiang90 commented Aug 8, 2017

@sgotti

I think the direction is good in general. As @hongchaodeng mentions, can we split this effort into multiple PRs? Thank you!

@xiang90
Copy link
Collaborator

xiang90 commented Aug 16, 2017

@sgotti kindly ping on this one. is there anything we could help to move this forward?

@sgotti
Copy link
Member Author

sgotti commented Aug 16, 2017

@xiang90 @hongchaodeng sorry for being late, I'm in vacation until next week. OK for the plan, I'm not sure if you already have in mind which refactoring will be needed, and I'll keep implementation and tests in the same PR for obvious reasons.

As a first step I'd like to propose some possible changes on the way volume management is handled in the etcd-operator. Is it ok if I open a new dedicated issue (since I have in mind different option with different pros and cons)?

@xiang90
Copy link
Collaborator

xiang90 commented Aug 16, 2017

Is it ok if I open a new dedicated issue (since I have in mind different option with different pros and cons)?

sure.

@georgekuruvillak
Copy link

@sgotti I had a query regarding keeping restartPolicy to Always. I would like to know if a data-corruption in etcd data volume will result in the pod getting restarted unnecessarily. Would it not be better if we can handle it using the reconciliation logic? We could know the memberId thats down and in turn the pvc associated. Could we not create the pod with the same memberId and pvc attached?

While replacing a failed member, its mentioned that we remove the instance and add a new instance. Just restarting the pod does not conform to what is mentioned in this documentation.

@sgotti
Copy link
Member Author

sgotti commented Aug 29, 2017

Is it ok if I open a new dedicated issue (since I have in mind different option with different pros and cons)?
sure.

I wanted to propose to directly use StorageClass in the spec instead of the --pv-provisioner option but I noticed this has been already accepted during my vacations in #1361. That's really good. So I just went ahead opening PR #1373 to discuss just spec changes.

@sgotti
Copy link
Member Author

sgotti commented Aug 29, 2017

@sgotti I had a query regarding keeping restartPolicy to Always. I would like to know if a data-corruption in etcd data volume will result in the pod getting restarted unnecessarily.

I agree that adding etcd data to a persistent volume adds the requirement to also handle a data corruptions case that previously wasn't needed.

Would it not be better if we can handle it using the reconciliation logic? We could know the memberId thats down and in turn the pvc associated. Could we not create the pod with the same memberId and pvc attached?

I think that restartPolicy Always is a fast way (as proposed by @xiang90 ) to use k8s features to restart a pod if it has crashed or the node reboots without adding this logic to the reconcile loop. Another way will be as you proposed to do it inside the operator reconcile loop.

But I don't see how this could help detecting corrupted data vs pod/etcd not starting for other reasons. I'm not sure which is the best way to detect a corruption.

My intial idea was to just remove and add a new member (with a new empty PV) if it's not healthy for some time. This will work also with restartPolicy to true and could be done only if the etcd cluster is healthy so won't create problem if all the etcd pods (or nodes) goes down at the same time achieving the goal of this PR.

Quite probably @hongchaodeng @xiang90 have better ideas.

@georgekuruvillak
Copy link

@sgotti, @xiang90 Is the first part of the PR without PVC reuse going to be incorporated upstream? When can it be roughly expected to be merged?

@sgotti
Copy link
Member Author

sgotti commented Sep 8, 2017

closing this in favor of multiple PRs (first is #1373)

@sgotti sgotti closed this Sep 8, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants