-
Notifications
You must be signed in to change notification settings - Fork 701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for VolumeClaimDeletePolicies for Elasticsearch clusters #4050
Conversation
@sebgl I think this ready for review, with one exception: I am thinking about adding an e2e test following roughly this playbook:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's one thing I don't understand well. One imporant use case for this feature as I understood it is to be able to do this combination:
- delete pvcs on scale down AND do not delete pvcs on cluster deletion
As in: if I scale my cluster down I don't care about old pvcs of old nodes (likely no data left in there anyway since we migrate it away in the downscale process), but if I remove the Elasticsearch resource entirely I want to be able to restore it as it was later.
Can I achieve this with one of the 3 options?
} | ||
|
||
updated := func(pvc corev1.PersistentVolumeClaim) corev1.PersistentVolumeClaim { | ||
pvc.ResourceVersion = "1000" // fake client starts at 999 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have comparison.Equal(a, b)
(pkg/controller/common/comparison
) to compare while ignoring ResourceVersion. The content of that file should maybe be moved to pkg/utils/compare
with more explicit function names 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot that we had that. It works on single RuntimeObjects
though so we need a version that takes slices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add it there then, so we don't keep that special resource version logic isolated in this test only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its usefulness would be somewhat limited by the fact that most (all?) k8s API objects implement runtime.Object
with pointer receivers, so callers will typically have to take pointers of the elements of the slice.
But I removed to the custom ResourceVersion
code and reused comparison.AssertEqual
No that's not possible with this implementation. The closest you get is using I tried to keep the behaviour aligned with the k8s KEP but if we think the use case you mentioned is a strong enough we could add another option. |
I left a comment on kubernetes/enhancements#2440 to advocate for the "preserve on deletion, remove on downscale" use case. |
I think our naming is different enough from the KEP to allow us deviate also in behaviour. It seems that for Elasticsearch there is indeed little benefit in keeping the volume around on scale down. If anything this might cause trouble when rejoining if > 500 index deletions have happened while the node was offline. (Thanks to @\DaveCTurner for pointing this out). The indices will be considered dangling (index metadata on the old node is enough for that, so master eligible nodes are affected by that). We could drop the This will make it a little bit harder to integrate with the KEP feature once it becomes available. |
@@ -460,6 +460,7 @@ e2e-local: LOCAL_E2E_CTX := /tmp/e2e-local.json | |||
e2e-local: | |||
@go run test/e2e/cmd/main.go run \ | |||
--test-run-name=e2e \ | |||
--operator-image=$(OPERATOR_IMAGE) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the API is a bit confusing.
Using RetainOnClusterDeletion
instead of RemoveOnScaleDown
somewhat suggests we don't remove on scale down, but IIUC we do?
It's strange to offer RemoveOnScaleDown
while other options are actually a superset of it (rather than something different).
I'm wondering if we should move away further from the StatefulSet KEP, and propose something a bit simpler that fits our needs:
spec:
# retainVolumeClaims specifies whether ECK should retain PersistentVolumeClaims when the Elasticsearch resource is removed, allowing to recreate a cluster from existing volumes. If false, ECK deletes PersistentVolumeClaims upon Elasticsearch resource deletion. Defaults to false.
retainVolumeClaimsOnDeletion: false
I think the only argument against it is that a boolean flag is closed against extension vs the enum style attribute which is open to extension by simple addition of values. Now we could say we simply do not see another use case that would require us to add another policy that encodes different behaviour, ever. |
Seb's proposal to have "PVCs automatically deleted on StatefulSet scaledown, but not on StatefulSet deletion" led to the update of the KEP (that has already received 2 lgtm).
I find these names rather clear and it covers all the usecases. |
type VolumeClaimDeletePolicy string | ||
|
||
const ( | ||
// DeleteOnScaledownAndClusterDeletionPolicy remove PersistentVolumeClaims when the corresponding Elasticsearch node is removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// DeleteOnScaledownAndClusterDeletionPolicy remove PersistentVolumeClaims when the corresponding Elasticsearch node is removed. | |
// DeleteOnScaledownAndClusterDeletionPolicy removes PersistentVolumeClaims when the corresponding Elasticsearch node is removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some manual testing (including backward compatibility) and it all seems to work as expected 👍
I left a few nits, also let's not forget about user-facing docs on the website. Other than that, LGTM :)
} | ||
|
||
updated := func(pvc corev1.PersistentVolumeClaim) corev1.PersistentVolumeClaim { | ||
pvc.ResourceVersion = "1000" // fake client starts at 999 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add it there then, so we don't keep that special resource version logic isolated in this test only?
I raised #4287 as a reminder. |
First draft of a possible approach inspired by kubernetes/enhancements#1915
Adds an new
volumeClaimDeletePolicy
to the Elasticsearch Spec on the cluster level (not per NodeSet)Possible values are
DeleteOnScaledownAndClusterDeletion
(default),DeleteOnScaledownOnly
.RemoveOnScaledownAndClusterDeletion
relies on an owner reference pointing to the Elasticsearch resource to garbage collect PVCs once the Elasticsearch cluster has been deleted (existing behaviour).It also runs additional garbage collection to remove PVCs on each reconciliation that are no longer in use because either the whole node set has been removed or individual nodes have been scaled down (existing behaviour).
RemoveOnScaledownOnly
means the PVCs are kept around after the cluster has been deleted. This is implemented by removing the owner reference. Removal of PVCs on scale down happens as before.Switching from one to the other strategy is allowed and is implemented by avoiding the StatefulSet templating mechanism. This is mainly because the PVC template in StatefulSets are considered immutable and it would require StatefulSets to be recreated in order to change the PVC ownership. Instead the operator edits the PVCs after they have been created by the StatefulSet controller.
Why not a per NodeSet setting? I initially started with a setting per NodeSet, but it becomes tricky when users remove a whole node set as we then have no trace of the chosen policy anymore. We would need to add the policy choice to the PVC as an annotation or persist the policy choice in some other place so that we can prevent garbage collection if so desired by the user.
Fixes #2328