Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1847: Auto remove PVCs created by StatefulSet #1915

Closed
wants to merge 3 commits into from

Conversation

kk-src
Copy link

@kk-src kk-src commented Jul 30, 2020

Auto remove PVCs created by StatefulSet
Issue: #1847

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 30, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @kk-src. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 30, 2020
@k8s-ci-robot k8s-ci-robot added sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jul 30, 2020
@kk-src
Copy link
Author

kk-src commented Jul 30, 2020

/cc @mattcary @dsu-igeek

@k8s-ci-robot
Copy link
Contributor

@kk-src: GitHub didn't allow me to request PR reviews from the following users: dsu-igeek.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @mattcary @dsu-igeek
/cc @msau42 for review.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kk-src kk-src marked this pull request as draft July 30, 2020 15:59
@kk-src kk-src changed the title [WIP] KEP-1847: Auto remove PVCs created by StatefulSet KEP-1847: Auto remove PVCs created by StatefulSet Jul 30, 2020
@kk-src
Copy link
Author

kk-src commented Jul 30, 2020

/cc @msau42

@msau42
Copy link
Member

msau42 commented Jul 30, 2020

/assign @msau42 @kow3ns @janetkuo

Copy link
Contributor

@mattcary mattcary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks in good shape to me. A couple of minor comments only.

Thanks!

the StatefulSet is deleted. As can be seen by the discussion in the issue
[55045](https://github.com/kubernetes/kubernetes/issues/55045) there are several use
cases where the PVCs which are automatically created are deleted as well. There is
merit to the suggestion that the PVCs should not be deleted since they have a different
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Probably more precise to say something like "In many StatefulSet use cases, the PVCs have a different lifecycle than the pods of the StatefulSet, and should not be deleted at the same time. Because of this, PVC deletion will be opt-in for users."

If `VolumeReclaimPolicy` is set to `RemoveOnPodDeletion` Pod is set as the owner of the PVC.
When a Pod is deleted, the PVC corresponding to the Pod also gets deleted. During scale up, a
check is made to see if the ownership is already set and if the Pod UID is that of the current
Pod in consideration. In case of a mismatch the scale up will exit out of the current reconcile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little unclear to me. Is "Pod UID" the pod UID in the owner reference of the PVC? and what is "the current Pod in consideration"---wouldn't the controller be looking at all pods & pvcs at once?

I think it might be more clear to just state the behavior rather than saying what goes wrong with different behavior, something like:

When a Pod is deleted, the PVC owned by the Pod is also deleted. During scale-up, for each Pod of the StatefulSet, the PVCs are gathered. If a PVC has an OwnerRef that does not match this Pod, the PVC must not have yet been deleted and the controller will exit the current reconcile loop and attempt to reconcile in the next iteration. This avoids a race with PVC deletion.

PVCs previously in use before scale down will be used again when the scale up occurs. The PVC deletion
should happen only after the Pod gets deleted. Since the Pod ownership has `blockOwnerDeletion` set to
`true` pods will get deleted before the StatefulSet is deleted. The `blockOwnerDeletion` for PVCs will
be set to `false` which ensures that PVC deletion happens only after the StatefulSet is deleted. This
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very elegant, nice!

-->

## Design Details

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should specify that ownerRefs on PVCs are only set for those that are created for volumes provisioned as part of the the statefulset, ie in VolumeClaimTemplates with a StorageClass specified, and no dataSource, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell? Someone could manually create a dynamically provisioned PVC with the same name that the StatefulSet controller expects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Does that mean we should add an annotation when a PVC is created by the SS controller for a pod?

We don't want to add an owner reference I think as that would make it harder to not change behavior for a retain policy.

It seems to me that an annotation would work well here but maybe there is an idiomatic alternative.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added text indicating static naming convention used by the statefulset will be reused here to identity the PVCs in question.

* Retain - this is the default policy and is considered in cases where no policy is specified.
This would be the existing behaviour - when a StatefulSet is deleted, no action is taken with
respect to the PVCs created by the StatefulSet.
* RemoveOnPodDeletion - Whenever a pod is deleted, the corresponding PVC is deleted as well.
Copy link
Contributor

@pohly pohly Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemoveOnPodDeletion is very similar to "generic ephemeral inline volumes" (https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1698-generic-ephemeral-volumes): volumes get created for a pod and deleted together with it. Just the mechanism is different.

I'm undecided whether it makes sense to have two different approaches for the same use case. On the one hand, it fits into the proposed API. On the other hand, additional code will be needed to implement it and that code will be more complex than the one for RemoveOnStatefulSetDeletion (like dealing with slow PVC removal).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean we want to change the semantics do that the PVC is only removed on StatefulSet scaledown? (and change the name to RemoveOnScaledown or similar)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a string for the reclaim policy is better than a boolean, I would just remove the RemoveOnPodDeletion option unless there's a good reason for having it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean change the semantics so that the three options are:

  • Retain (current behavior)
  • RemoveOnScaledown (delete PVs on scaledown but not if pod is deleted/restarted for another reason such as being unhealthy)
  • RemoveOnStatefulSetDeletion (as currently described)

This would fit the use cases of customer feedback that we've heard.

Copy link
Author

@kk-src kk-src Aug 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pohly @mattcary .

We do need a way to ensure that the PVCs are deleted on pod deletion during scale down. RemoveOnScaledown would fit in.

Copy link

@krmayankk krmayankk Aug 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so we are saying, this KEP is mainly for the scale down scenario and not any pod deletion scenario? It would help to make this clear here and completely remove RemoveOnPodDeletion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense. The only options will be Retain, RemoveOnScaledown, RemoveOnStatefulSetDeletion.


The following changes are required:

1. Add `VolumeReclaimPolicy` entry into StatefulSet spec inorder to make this feature an opt-in.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the word VolumeReclaimPolicy is confusing, it seems to conflict in meaning with reclaimPolicy in PV. Should we make it more clear pvcReclaimPolicy ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

is deleted as part of a scale down. Since the subsequent scale up needs to create fresh PVCs, it will
be slower than scale ups relying on existing PVCs(from earlier scale ups).

User would set the VolumeReclaimPolicy as 'RemoveOnPodDeletion' ensuring PVCs are deleted when corresponding

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you confirm, you dont need RemoveOnPodDeletion, but just RemoveOnScaleDown ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if someone deletes the pod manually kubectl delete pod , do you expect the PVC to be deleted ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense. If a pod is manually deleted, the SS will recreate it so there won't be a problem of a PVC to clean up. Only on scaledown (or SS deletion) are there PVCs that currently need to be manually removed.

The two options RemoveOnScaleDown and RemoveOnStatefulSetDeletion allow the operator to distinguish between the case where scaledowns and scaleups will be frequent (and PVCs should only be deleted at the end, when the whole SS is deleted) and the case when they are infrequent and it's more important to conserve PV resources than have rapid scaleup.

At least that's where I've been coming from with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would someone want both behaviors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think I implicitly assumed deletion of the stateful set is equivalent to scaling down to zero, so that RemoveOnScaleDown contains RemoveOnStatefulSetDeletion.

Probably I'm being imprecise and we should call that out explicitly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like manual deletion needs more consideration. When a Pod is deleted, the PVC still has owner references to the deleted Pod. The StatefulSet controller will bring up the Pod again, but the reconcile logic would exit since the Pod owner is different. Eventually the GC would delete the older PVC and the next reconcile will construct new PVC and add reference to the newly created Pod. This might not be the desired behavior as we are currently talking about only deletions during scale down not during the Pod deletion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

I think the behavior we want is that if the SS controller is deleting a pod it knows if it's scaling down and can delete the PVC appropriately.

Otherwise the SS controller will bring the pod back up (eg if it fails) and in that case we want it to reuse the PVC even on RemoveOnScaleDown.

Will it be enough for the SS controller to only put the ownerref when it is deleting a pod for scaledown? Precisely, when the SS controller wants to remove a pod for scaledown, it first puts an ownerref on the PVC and then deletes the pod (rather than putting on the ownerref at PVC creation as we currently have it).

I don't know if this means an extra reconcile loop, where first we confirm the ownerref has been placed and only then we delete the pod?

Copy link

@mittachaitu mittachaitu Sep 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learned a few things from discussions and thanks for this KEP.

Will it be enough for the SS controller to only put the ownerref when it is deleting a pod for scaledown? Precisely, when the SS controller wants to remove a pod for scaledown, it first puts an ownerref on the PVC and then deletes the pod (rather than putting on the ownerref at PVC creation as we currently have it).

Good one but when we are patching with ownerref I think we have to set BlockOwnerRef as false. If we follow this approach there will be cases where PVC remains in the cluster even after sts is scaledown.
Ex: User deleted the sts pod and within no time(before sts controller reconciles for sts) user also performed scale down in this case, PVC will remain the same(since ownerref is not yet set). Is this case will be marked as known limitations? Correct me if my assumption is wrong in the above case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mittachaitu Good point, I'd have to look at the statefulset controller but you're probably right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kk-src and I looked through the current SS controller code and came to the conclusion that if a manual deletion races with the scaledown, the controller will not do the scaledown and instead wait for all the pods to come back. So I don't think we need to change the proposed logic.

If the race happens in the other direction (user deleted after we start scaling down), it's okay as we're only setting the owner refs for PVCs with pods that are going to be deleted anyway.

At least that's how I reasoned it through, LMK if I missed something.

@kk-src kk-src marked this pull request as ready for review September 10, 2020 15:57
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2020
@kikisdeliveryservice
Copy link
Member

Hi @kk-src

Enhancements Lead here. Will there be any work for this in 1.20?

Also, as a reminder to be included in a release:

The KEP must be merged in an implementable state
The KEP must have test plans
The KEP must have graduation criteria.

Thanks
Kirsten

@kk-src
Copy link
Author

kk-src commented Sep 24, 2020

@kikisdeliveryservice - Yes, the current plan is to make it by enhancement freeze and work through the implementation in the 1.20 time frame.


The following changes are required:

1. Add `VolumeReclaimPolicy` entry into StatefulSet spec inorder to make this feature an opt-in.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

This might be a good place to talk about core concepts and how they relate.
-->

### Risks and Mitigations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A discussion of durability risks with the changed behavior (even though it is opt-in) should be added here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kk-src Maybe something like the following?

If the PVCReclaimPolicy is changed from its default of "Retain", then PVs will be deleted on SS scaledown or deletion. A user will lose data on those PVs if proactive effort is not taken to replicate that data. However, PVs associated with the StatefulSet will be more durable than ephemeral volumes would be, as they are only deleted on scaledown or StatefulSet deletion, and not on other pod lifecycle events like being rescheduled to a new node, even with the new retain policies.

User would set the VolumeReclaimPolicy as 'RemoveOnScaledown' ensuring PVCs are deleted when corresponding
Pods are deleted. New Pods created during sclae up followed by a scaledown will wait for freshly created PVCs.

### Notes/Constraints/Caveats (optional)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add information on applicability of the feature across configurations. Does the feature work with Local PVCs for instance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kk-src Maybe something like the following?

This feature applies to PVs which are dynamically provisioned from the volumeClaimTemplate of a StatefulSet. Any PVC and PV provisioned from this mechanism will function with this feature.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kow3ns
@mattcary - Thank you, added to the text. Will come out in the next commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be dynamically provisioned from the StatefulSet controller? Since this behavior is opt-in, the user should understand what they're getting into?

For example, today, you can manually create a PV object, and set Delete reclaim policy. The provisioner will delete the volume even though it didn't provision it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what if pods have multiple PVCs? Should all of them be deleted? I don't think that would make sense, there might be a shared volume.

Theres a 1:1 PVC:PV relationship so that doesn't come up in that case.

I think it's much simpler to define reasonable behavior if we scope to the volumeClaimTemplates only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point. Scoping the behavior seems reasonable to me, I would just clarify that this is about PVC objects being created by the StatefulSet controller, and not about PV objects being dynamically provisioned. You can have StatefulSet create PVCs that can be bound to pre-created PVs and those should still be in scope for the Statefulset reclaim policy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 In the design section we are talking about the PVCs created from the VolumeClaimTemplate in the StatefulSet. That's exactly the set we're interested in.

eg, if a user created a PVC that matched the naming convention of the volumeClaimTemplate, a pod created for the statefulset would attach to it, so it seems unsurprising that the new reclaim policy would cause that PVC to be deleted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think I misunderstood then. I thought you wanted PVCs manually created by the user with the same naming convention to NOT be part of the reclaim policy. But it sounds like we do want it part of the reclaim policy.

And PVCs referenced outside of volumeClaimTemplates, ie in volumes are out of scope.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry, there was a discussion that only partially made it into the comments. I think that the version down in "design details" is correct: if the statefulset has a volumeClaimTemplate, the static naming scheme defines a PVC for each pod which is called here the associated PVC of a pod. These associated PVCs are the ones that are deleted or not according to the policy.

A new field named `VolumeReclaimPolicy` of the type `StatefulSetVolumeReclaimPolicy`
will be added to the StatefulSet. This field represents the user indication on whether the
associated PVCs can be deleted or not. If this field is not set, the default behaviour is
to leave the PVCs as is during the StatefulSet deletion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just say that the default policy is Retain.


If `VolumeReclaimPolicy` is set to `RemoveOnScaledown`, Pod is set as the owner of the PVCs created
from the `VolumeClaimTemplates` just before the scale down is performed by the statefulset controller.
When a Pod is deleted, the PVC owned by the Pod is also deleted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add here that if the SS is deleted the behavior is as in RemoveOnStatefulSetDeletion.

-->

## Design Details

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Does that mean we should add an annotation when a PVC is created by the SS controller for a pod?

We don't want to add an owner reference I think as that would make it harder to not change behavior for a retain policy.

It seems to me that an annotation would work well here but maybe there is an idiomatic alternative.


### Volume reclaim policy for the StatefulSet created PVCs
A new field named `VolumeReclaimPolicy` of the type `StatefulSetVolumeReclaimPolicy`
will be added to the StatefulSet. This field represents the user indication on whether the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this field be patched on an already existing stateful set? I think it can be, as long as we don't make any guarantees about what happens if a scaleup/down/delete is in progress during the patching.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note to ensure that in alpha we won't allow updating this field after creation to avoid any confusion. Based user feedback we can change this in beta.

This might be a good place to talk about core concepts and how they relate.
-->

### Risks and Mitigations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kk-src Maybe something like the following?

If the PVCReclaimPolicy is changed from its default of "Retain", then PVs will be deleted on SS scaledown or deletion. A user will lose data on those PVs if proactive effort is not taken to replicate that data. However, PVs associated with the StatefulSet will be more durable than ephemeral volumes would be, as they are only deleted on scaledown or StatefulSet deletion, and not on other pod lifecycle events like being rescheduled to a new node, even with the new retain policies.

User would set the VolumeReclaimPolicy as 'RemoveOnScaledown' ensuring PVCs are deleted when corresponding
Pods are deleted. New Pods created during sclae up followed by a scaledown will wait for freshly created PVCs.

### Notes/Constraints/Caveats (optional)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kk-src Maybe something like the following?

This feature applies to PVs which are dynamically provisioned from the volumeClaimTemplate of a StatefulSet. Any PVC and PV provisioned from this mechanism will function with this feature.

Copy link
Member

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on the reqd fields

@kk-src kk-src force-pushed the autoremove-pvc-statefulset branch 2 times, most recently from 925442d to 6e39193 Compare October 7, 2020 00:20
Copy link
Member

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @kk-src

This looks good from an enhancements reqs perspective.

Copy link
Member

@msau42 msau42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For beta, I think we'll have to think more carefully on the disablement/rollback scenario.


* **How can this feature be enabled / disabled in a live cluster?**
- [x] Other
- The feature is enabled only if a user explicitly fills up the `PersistentVolumeClaimReclaimPolicy`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new field needs to be protected by a Kubernetes alpha feature gate. It is the feature gate that controls if the feature is enabled/disabled in a Kubernetes cluster.

* **What happens if we reenable the feature if it was previously rolled back?**
For Statefulset PVCs which were previously set to a non-Retain `PersistentVolumeClaimReclaimPolicy`
the Statefulset is going to come back with VolumeClaimTemplates set to default, but
the experience would be that of non-retain policy - either get deleted on scale down or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be reconciling so that we remove any ownerrefs if they change the policy back? Although it will still be racy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're only adding the ownerrefs while a scaledown is in progress, so I think it will have to be reconciling given idempotency? Anytime we see desired replicas < actual replicas we'll set ownerrefs etc, which should regenerate any that have been manually removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also the StatefulSetDelete policy where we add the ownerref on the statefulset object at creation time right?

be slower than scale ups relying on existing PVCs(from earlier scale ups).

User would set the `PersistentVolumeClaimReclaimPolicy` as 'RemoveOnScaledown' ensuring PVCs are deleted when corresponding
Pods are deleted. New Pods created during sclae up followed by a scaledown will wait for freshly created PVCs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: sclae

will be added to the StatefulSet. This field represents the user indication on whether the
associated PVCs can be deleted or not. The default policy would be `Retain`.

If `PersistentVolumeClaimReclaimPolicy` is set to `RemoveOnScaledown`, Pod is set as the owner of the PVCs created
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be more clear here to say explicitly that an ownerref is created rather than just "pod is set as the owner"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be made more clear by rearranging like his:

A PersistentVolumeClaimReclaimPolicy of RemoveOnScaledown affects both changing the number of replicas in the StatefulSet as well as StatefulSet deletion. When the controller finds that the number of replicas is being reduced, it identifies the pods that are being deleted as well as any PVCs created from the VolumeClaimTemplate of those pods. The OwnerRef of those PVCs is set to the associated pod. When a Pod is deleted, this causes the associated PVC to be deleted as well. When the RemoveOnScaledown policy is set etc (keep what you have written).

will be added to the StatefulSet. This field represents the user indication on whether the
associated PVCs can be deleted or not. The default policy would be `Retain`.

If `PersistentVolumeClaimReclaimPolicy` is set to `RemoveOnScaledown`, Pod is set as the owner of the PVCs created
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be made more clear by rearranging like his:

A PersistentVolumeClaimReclaimPolicy of RemoveOnScaledown affects both changing the number of replicas in the StatefulSet as well as StatefulSet deletion. When the controller finds that the number of replicas is being reduced, it identifies the pods that are being deleted as well as any PVCs created from the VolumeClaimTemplate of those pods. The OwnerRef of those PVCs is set to the associated pod. When a Pod is deleted, this causes the associated PVC to be deleted as well. When the RemoveOnScaledown policy is set etc (keep what you have written).

- Add unit, functional, upgrade and downgrade tests to automated k8s test.


<!--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the comment block.

@@ -0,0 +1,590 @@
<!--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this comment block, and the others.

There is a new field getting added to the StatefulSet. The upgrade will not
change the previously expected behaviour of existing Statefulset.

If the statefulset had been set with the RemoveOnStatefulSetDeletion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say

On a downgrade, the PersistentVolumeClaimReclaimPolicy field will be removed from any StatefulSets. If a scaledown or delete is in process when the downgrade happens, any existing OwnerRefs on PVCs will not be removed so that in most cases when the scaledown completes, unused PVCs will be deleted. However, there may be edge cases causing only some of the unused PVCs to be deleted. As unused PVCs remaining after a scaledown is the expected behavior of the downgraded clusters no further effort will be made to remove them.

Hence no change in any user visible behaviour change by default.

* **Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?**
Yes, but with side effects for users who already started using the feature by means of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be sufficient to say yes here, and refer to the downgrade section above on what happens on rollback.

I think that disabling the feature would also mean removing the new policy field?

I'm not sure what you mean by annotating the PVCs here.

feature flag was diabled, the PVCs could get deleted.

* **What happens if we reenable the feature if it was previously rolled back?**
The reconcile loop which removes references on disablement will not come into action. Since the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have no effect, as the policy fields were removed during the rollback, so when reenabled all StatefulSets in the cluster will be using the retain policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way feature disabling/rollback works is that the field remains stored in etcd, however controllers will not process the field (because it's protected by a feature gate). Then if you re-enable the feature again, then you can start processing the field again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

I think that means that the RemoveOnDeletion needs to be reconciled, that is make sure associated PVCs have ownerrefs to the statefulset.

I don't see this as too much of a problem if things aren't exactly consistent. For example, one might set RemoveOnScaledown on a 10 replica statefulset, disable the feature, scale down to 6 pods, then re-enable the feature; PVCs 7-10 are going to not have been deleted and we can't be expected to delete them. So if the disabling or re-enabling happens during a scaledown and some PVCs are missed or deleted, the behavior is almost the same as before when we were okay with the orphaned PVCs 7-10.

The main thing I think is to make sure "future" scaledown/deletions work correctly, and as long as we get the statefulset ownerref for RemoveOnDeletion that will be covered.

should happen only after the Pod gets deleted. Since the Pod ownership has `blockOwnerDeletion` set to
`true` pods will get deleted before the StatefulSet is deleted. The `blockOwnerDeletion` for PVCs will
be set to `false` which ensures that PVC deletion happens only after the StatefulSet is deleted. This
chain of ownership ensures that Pod deletion occurs before the PVCs are deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if I delete a RemoveOnStatefulSetDeletion StatefulSet with orphan policy? Will the PVCs be gone while the Pods stay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the orphan policy do you mean kubectl delete --cascade=false?

In that case I think nothing should be deleted other than the statefulset resource, the semantics for cascade seem clear. That would mean removing any ownership IIUC.

At any rate, even if the PVCs were deleted, the protection controller would prevent them from being finalized until the pods were deleted. But it might be unexpected to have the PVCs disappear when the pods are eventually deleted.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 9, 2020
be slower than scale ups relying on existing PVCs(from earlier scale ups).

User would set the `PersistentVolumeClaimDeletePolicy` as 'RemoveOnScaledown' ensuring PVCs are deleted when corresponding
Pods are deleted. New Pods created during scale up followed by a scaledown will wait for freshly created PVCs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a real world example of such an application ?if PVC should be deleted when pod is deleted, how is this different than a pod using an emptyDir ? I would vote for some more concrete use cases here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The volume is retained if the pod is rescheduled, eg if the node goes down or is upgraded.

+1 to adding the explicitly so it's clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or statefulset rolling update

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update the pod rescheduling note shortly.

Copy link
Contributor

@mattcary mattcary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kk-src

kk-src is swamped and has agreed to let me clone this PR and try to get it in for the enhancements freeze. I'll do that shortly and we can continue the discussion there. I'll apply all the suggestions in comments on this doc, or continue the conversation, on the new PR.

should happen only after the Pod gets deleted. Since the Pod ownership has `blockOwnerDeletion` set to
`true` pods will get deleted before the StatefulSet is deleted. The `blockOwnerDeletion` for PVCs will
be set to `false` which ensures that PVC deletion happens only after the StatefulSet is deleted. This
chain of ownership ensures that Pod deletion occurs before the PVCs are deleted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the orphan policy do you mean kubectl delete --cascade=false?

In that case I think nothing should be deleted other than the statefulset resource, the semantics for cascade seem clear. That would mean removing any ownership IIUC.

At any rate, even if the PVCs were deleted, the protection controller would prevent them from being finalized until the pods were deleted. But it might be unexpected to have the PVCs disappear when the pods are eventually deleted.

@mattcary
Copy link
Contributor

mattcary commented Feb 6, 2021

I have created #2440. Let's move the party over there. Thanks!

pebrc added a commit to elastic/cloud-on-k8s that referenced this pull request Mar 2, 2021
…4050)

Approach inspired by kubernetes/enhancements#1915

Adds an new volumeClaimDeletePolicy to the Elasticsearch Spec on the cluster level (not per NodeSet)

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: es
spec:
  version: 7.10.1
  volumeClaimDeletePolicy: DeleteOnScaledownAndClusterDeletion
  nodeSets:
  - name: default
    count: 2

Possible values are DeleteOnScaledownAndClusterDeletion (default), DeleteOnScaledownOnly.

RemoveOnScaledownAndClusterDeletion relies on an owner reference pointing to the Elasticsearch resource to garbage collect PVCs once the Elasticsearch cluster has been deleted (existing behaviour).

It also runs additional garbage collection to remove PVCs on each reconciliation that are no longer in use because either the whole node set has been removed or individual nodes have been scaled down (existing behaviour).

RemoveOnScaledownOnly means the PVCs are kept around after the cluster has been deleted. This is implemented by removing the owner reference. Removal of PVCs on scale down happens as before.

Switching from one to the other strategy is allowed and is implemented by avoiding the StatefulSet templating mechanism. This is mainly because the PVC template in StatefulSets are considered immutable and it would require StatefulSets to be recreated in order to change the PVC ownership. Instead the operator edits the PVCs after they have been created by the StatefulSet controller.
@mattcary
Copy link
Contributor

mattcary commented Mar 9, 2021

/close

@k8s-ci-robot
Copy link
Contributor

@mattcary: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.