-
Notifications
You must be signed in to change notification settings - Fork 531
[OCPCLOUD-1244] Add proposal for etcd protection mechanism during control plane scaling #943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OCPCLOUD-1244] Add proposal for etcd protection mechanism during control plane scaling #943
Conversation
hasbro17
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good and very clearly written 👍
Still going through the additional details section but I just did a first pass and had small note for @hexfusion
|
|
||
| ## Summary | ||
|
|
||
| To enable automation of Control Plane scaling activities, in paritcular vertical scaling of the Control Plane Machines, we must implement a mechanism that protects Etcd quorum and ensures the smoothest possible transition as new Etcd members are added and old members removed from the Etcd cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit
| To enable automation of Control Plane scaling activities, in paritcular vertical scaling of the Control Plane Machines, we must implement a mechanism that protects Etcd quorum and ensures the smoothest possible transition as new Etcd members are added and old members removed from the Etcd cluster. | |
| To enable automation of Control Plane scaling activities, in particular vertical scaling of the Control Plane Machines, we must implement a mechanism that protects Etcd quorum and ensures the smoothest possible transition as new Etcd members are added and old members removed from the Etcd cluster. |
| #### Requirements for Etcd safety | ||
|
|
||
| * The number of voting members should equal the desired number of Control Plane Machines (and this should be an odd number) | ||
| * We may only deviate from this to add a replacement member |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: As noted before we will always deviate from this during scale up as a learner gets promoted before we remove the older voting member.
| * We may only deviate from this to add a replacement member | |
| * We will only deviate from this to add a replacement member |
| It will only remove the hook once it has identifed that the Machine resource is being deleted and a replacement member has been created. The removal of this hook will allow the Machine API to drain and terminate the Machine as it would normally do. | ||
| The Etcd Operator will ensure, based on the desired Control Plane replica count in <@Sam where do we get this from today?>, that the Etcd cluster has either the exact desired count of voting members, or during scaling/replacement operations, at most 1 extra voting member. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The Etcd Operator will ensure, based on the desired Control Plane replica count in <@Sam where do we get this from today?>, that the Etcd cluster has either the exact desired count of voting members, or during scaling/replacement operations, at most 1 extra voting member. | |
| The Etcd Operator will ensure, based on the desired Control Plane replica count in the InstallConfig, that the Etcd cluster has either the exact desired count of voting members, or during scaling/replacement operations, at most 1 extra voting member. |
I believe the etcd-operator gets the desired control plane replica count from the install config's compute.controlPlane.replicas field. Read from the kube-system/cluster-config-v1 configmap.
#920 (comment)
| 1. Etcd Operator identifies the voting members of the Etcd cluster and maps these to the Control Plane Machines that host them | ||
| 1. Etcd Operator adds a deletion hook to each of the Machines hosting a voting member of the Etcd cluster | ||
|
|
||
| ##### Operational flow during a resize operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a nit:
| ##### Operational flow during a resize operation | |
| ##### Operational flow during a node replace operation |
IIUC this flow describes a node replacement operation (via scale up then down) rather than a resize operation as the final cluster size is unchanged.
Although I guess this could be a resize if you only add or remove a machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context behind this was a resize of the underlying machine, ie, vertical scaling of the host. But I can see how that can be confusing from an etcd perspective, so I'll adjust the wording
| 1. The Etcd Operator promotes the new Etcd member to a voting member | ||
| 1. The Etcd Operator demotes the old Etcd member, removing it from the cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hexfusion I brought this up before in an earlier version of this proposal but just wanted to touch on the order here again:
Would we ever be vertically scaling/replacing machines if we already have an unhealthy member?
I'm guessing yes, since that would be a prime usecase for replacing an unhealthy machine/member.
In that case, would the order of promoting a new member before removing the unhealthy one, open us up to quorum loss as described in https://etcd.io/docs/v3.5/faq/#should-i-add-a-member-before-removing-an-unhealthy-member
i.e the the learner going through the following states promotable->promoted/voting->immediately-unhealthy which now leaves us with 2/4 unhealthy voting members and quorum loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I must have missed your previous comments when pushing this.
Would we ever be vertically scaling/replacing machines if we already have an unhealthy member?
I suspect yes as well
In that case, would the order of promoting a new member before removing the unhealthy one, open us up to quorum loss as described in https://etcd.io/docs/v3.5/faq/#should-i-add-a-member-before-removing-an-unhealthy-member
Reading this, I think you're correct, we should reverse the ordering here. Would appreciate Sam's ack on this as well so I won't swap just yet so we don't lose the conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, I forgot to add my update to this but Sam confirmed offline that in the case of an unhealthy/degraded etcd cluster e.g (1/3 unhealthy members) the etcd-operator will not vertically scale or replace the unhealthy member automatically with the learner.
So we can consider scaling an unhealthy cluster as out of scope for this proposal, and defer to the cluster admin to get it back to a healthy state before expecting the etcd-operator scale up with new machines.
And since the recommendation of Remove-then-Add/Promote a member applies to replacing unhealthy members, I would think we can probably keep the ordering of Add/Promote-then-Remove here given that the learner is all caught up but I would defer to @hexfusion on this.
So perhaps noting down the scaling of degraded clusters as out of scope might be good here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note for @hexfusion: I know currently the clustermember controller will just error out and not scale if it sees a degraded cluster, but in your POC it seems like we do promote before doing the degraded check. I need to test this out to confirm but please correct me if that's not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a non goal of "recovering unhealthy etcd clusters" and a note in the overview that we won't be doing anything when the cluster is already degraded
| * Provide the Etcd Operator with the ability to control when a Control Plane Machine is removed from the cluster | ||
| * Allow the Etcd Operator to prevent removal of Etcd members until a replacement member has been promoted to a voting member | ||
| * Allow the Etcd Operator to remove an Etcd member from the Etcd cluster before the Machine is terminated to prevent a degraded Etcd cluster | ||
| * Allow an escape hatch from the protection when surging the capacity with new Control Plane Machines is unavailable (for example in metal environments with limited capacit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
| * Allow an escape hatch from the protection when surging the capacity with new Control Plane Machines is unavailable (for example in metal environments with limited capacit) | |
| * Allow an escape hatch from the protection when surging the capacity with new Control Plane Machines is unavailable (for example in metal environments with limited capacity) |
|
I am not sure why I can not respond to comments directly so I will do that inline here.
In general Learner testing is still fairly light so we can reconsider if more aggressive measures are required.
Correct the assumption would be the cluster is healthy. In the case of failed member it always is required to scale down before scaling up. In this case we do scale up to 4 which increases consensus to 3 members briefly. The details around that are highlighted in etcd docs. [1] https://etcd.io/docs/v3.5/faq/#what-is-failure-tolerance |
hexfusion
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few notes I hope to take a deeper dive this week.
| ### Non-Goals | ||
|
|
||
| * Automation of scaling operations on Machines | ||
| * Horizontal scaling of the Etcd cluster and Control Plane |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| pre-drain.delete.hook.machine.cluster-api.x-k8s.io/etcd-quorum-member: openshift-etcd-operator | ||
| ``` | ||
| The Etcd Operator will apply the hook to a Machine resource once it identifies that the Machine hosts an Etcd quorum member. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not as soon as possible? I forget the reasoning here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, we decided this would be the case so that the deletion protection only comes in once we know that machine needs proection. Thought thinking about it now, I think we should add the hook before we promote for the following reasons:
- If we promote before adding the hook, there's a small window before the protection kicks in
- There should be no harm in having the hook in place while the member is still a learner, when the etcd operator notices the machine is marked for deletion, it should remove the member from the etcd cluster (voting or learner) and then remove the hook, I think that should still be safe
I'll amend this and add context
| It will only remove the hook once it has identifed that the Machine resource is being deleted and a replacement member has been created. The removal of this hook will allow the Machine API to drain and terminate the Machine as it would normally do. | ||
| The Etcd Operator will ensure, based on the desired Control Plane replica count in <@Sam where do we get this from today?>, that the Etcd cluster has either the exact desired count of voting members, or during scaling/replacement operations, at most 1 extra voting member. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today we read from the embedded install-config which is located in a ConfigMap kube-system/cluster-config-v1.
|
@hasbro17 @hexfusion please check the last few commits where I've addressed the various pieces of feedback so far. I will squash the commits once we are happy with the state of the document |
hasbro17
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still LGTM. I imagine this will need a slight update to use the structured spec format of deletion hooks, but this is good enough for me to proceed with implementing etcd-operator's expected scaling workflow.
b919f96 to
57d8fac
Compare
|
I've updated this PR to include the structured format for the Machine Deletion Hooks which was updated recently. I've also squashed the previous feedback commits. There was no additional feedback left to be addressed so there are no other changes in the latest push than the structured example of a preDrain hook If the etcd team are in agreement, I think we should aim to merge this PR within the next week or so |
|
/lgtm |
|
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
|
We had already agreed upon this design last year and it is now being implemented as described. As Sam is no longer on the etcd team, who is the relevant approver for this enhancement? /remote-lifecycle stale |
|
/remove-lifecycle stale |
|
@mfojtik As the staff engineer looking after the etcd team, would you be able to review this and help get it merged? |
| * The number of voting members should equal the desired number of Control Plane Machines (and this should be an odd | ||
| number) | ||
| * We will only deviate from this to add a replacement member | ||
| * Once the new member is added, we should remove old members as soon as possible to reduce the risk of degrading the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old member or members ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this should be singular, good catch
| * This also ensures that we keep the full etcd data on disk on a minimum of the desired number of Control Plane | ||
| Machines at all times | ||
|
|
||
| #### Protecting etcd during scaling operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to protect from an accidental etcd rollout? (like somebody change etcd config or hit upgrade button during this surgery) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it could be tricky to co-ordinate, because the operation could be either a single machine being replaced or many.
If the etcd operator blocked rolling out config changes while a Machine was marked as deleted, that could be safe.
We expect that when a new machine is being added, the old one is deleted signalling the etcd member that will eventually be removed. In the case of removing multiple machines, we have designed the mechanism so that we can delete multiple machines at once, and the mechanism will roll out the new etcd members 1 by 1, so this should also be safe.
Not sure if it's strictly necessary though, not sure what effect changing the config can have to the etcd cluster, I assume we restrict the configuration changes that end users can make for the etcd instances? If we are in the learning phase, a config change sounds like it shouldn't be too detrimental?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding users really can't change the etcd config very much since the etcd-operator controls the env vars that are rolled out to the etcd static pods:
https://github.com/openshift/cluster-etcd-operator/blob/d8c023e4d217aae762b99322923c68101c4b3ffd/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L194-L203
https://github.com/openshift/cluster-etcd-operator/blob/d8c023e4d217aae762b99322923c68101c4b3ffd/pkg/etcdenvvar/etcd_env.go#L71-L84
They can however force a redeployment of all the etcd static-pods and trigger a roll-out to the next revision (even if it's without any changes) by updating the etcd/cluster CR e.g:
oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "foo"}}' --type=mergeAll that will do (via the revision controller) is essentially restart the member pod on each node serially.
Since that rollout doesn't change the cluster membership I don't think it would impact an ongoing scaling operation.
At worst it might delay the scaling by a little bit if the operator is attempting to promote a learner member whose pod is being recreated with the new revision.
The scaling operation itself would actually cause multiple revision rollouts as it updates the cluster membership when adding/removing members.
@mfojtik So that should be okay unless you had another case in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or hit upgrade button during this surgery
Upgrades should be blocked via the quorum guard during an ongoing scaling operation. That's currently outlined in the proposal:
While a scaling operation is occurring, the etcd quorum guard will prevent the draining of any of the Control Plane
Machines, until the new etcd member has been promoted and the old etcd member removed from the cluster.
This in turn means that updates caused during upgrades (for example changes to MachineConfig) will be blocked while the
scaling operation occurs.
This will delay the upgrade process, but should not block it indefinitely unless an issue occurs.
| The etcd Operator will apply the hook to a Machine resource once it identifies that the Machine hosts an etcd member. | ||
| The hook should be added before the member is promoted to ensure that there is no period where the member is a voting | ||
| member, while the machine is not protected by the deletion mechanism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does etcd operator today reflect the etcd member status (learner/voter/leader/etc.) in some CR status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe it does on the etcd/cluster CR. Although that info is available from a member list or endpoint status call to the cluster.
57d8fac to
c9129aa
Compare
|
Amended the |
|
|
||
| ## Motivation | ||
|
|
||
| As Red Hat expands its managed services offerings, the ability to safely increase and decrease the capacity of an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the motivation sounds very much like scaling up to 5 masters and more is the goal. It is not (I hope), but this enhancement is about safely replacing nodes with bigger machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, I'll clarify to make this more obvious we mean via vertical scaling of the control plane
| #### Protecting etcd during scaling operations | ||
|
|
||
| To ensure that the safety requirements described above are maintained during scaling operations, | ||
| the etcd Operator will manage the etcd cluster in clusters with a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| the etcd Operator will manage the etcd cluster in clusters with a | |
| the etcd operator will manage the etcd cluster in clusters with a |
|
|
||
| 1. A Control Plane Machine is marked for deletion and a new Control Plane Machine is created | ||
| a. The order of these operations does not matter | ||
| 1. The etcd Operator notices the new Machine and adjust the etcd quorum guard appropriately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 1. The etcd Operator notices the new Machine and adjust the etcd quorum guard appropriately | |
| 1. The etcd Operator notices the new Machine and adjusts the etcd quorum guard appropriately |
|
|
||
| Typically in UPI clusters, Machines and MachineSets are not present. This would represent a non-functional Machine API. | ||
|
|
||
| Typically in IPI clusters, Machines and MachineSets are created and after bootstrap, there are 6 "Running" Machines, 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mentioned somewhere above a "Running" state. This is missing here.
|
|
||
| To ensure that we do not block users from replacing Control Plane Machines in these scenarios, we must allow a user to | ||
| remove the etcd quorum hook without it being replaced by the etcd Operator. | ||
| When a Machine is marked for deletion, if the hook is removed, the etcd Operator will not replace it. This will allow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the etcd Operator will not replace it
what does this mean? It will not readd the hook? So there must be a check in the hook-add controller that it only adds hooks to unmarked machines. This must be added somewhere above, at least as a note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "mark for deletion" the same as setting the metadata.deleteTimestamp? is it possible to mutate a deleted, but not removed object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this mean? It will not readd the hook? So there must be a check in the hook-add controller that it only adds hooks to unmarked machines.
Yep, that's what we are trying to say here, will see if there's an appropriate place to add a note above. This is called out in more detail in the machine deletion hooks enhancement though, that's why I didn't go into super detail level here
Is "mark for deletion" the same as setting the metadata.deleteTimestamp? is it possible to mutate a deleted, but not removed object?
Yes, I mean someone has called a DELETE operation on the resource so now the metadata.deletionTimestamp is set, not sure if there's a more commonly used verbiage for this than "marked for deletion"?
It is possible to mutate an object that has not been deleted, the API server doesn't prevent this. We can mutate it all the way up until the GC removes it, but that will only happen when the Machine API removes its finaliser. I've tested this manually and the deletion hooks have gone through QE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, just clarify here a bit.
|
@sttts I've clarified the areas of wording you highlighted in a separate commit for now, if you're happy with the updates I'll squash it in again |
| If these GitOps systems do not correctly handle server side changes (like additional annotations), | ||
| then they may remove the hooks after the etcd Operator has added them. | ||
| In this case we expect a hot loop where the etcd and GitOps operators fight to add and remove the hook. | ||
| We must test this scenario and ensure that external systems aren't going to interfere with the mechanism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the machine types are wrong, compare https://github.com/openshift/api/blob/master/machine/v1beta1/types_machine.go#L208. The list-type must be map. Otherwise, atomic is assumed. This means that Gitops tools will replace the whole list if they just want to change one hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't picked up in the initial API review, this could cause an issue to change so close the code freeze
Is there any way for a CRD to have merge by default on a list or does the merging need to be handled by the operator updating them?
Edit: We need to open a BZ to add the list-type to the field
Perhaps we need to implement webhooks to prevent addition/removal of more than one hook at a time, that would then match the behaviour of finalizers and hopefully prevent errors from third party controllers
For a map, would it need to be something along the lines of map[string]string or is there precedence for doing map[string]struct{...} in kube?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested this yet, but I think to resolve this comment we need https://github.com/openshift/api/compare/master...JoelSpeed:set-list-type-hooks?expand=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could cause an issue to change so close the code freeze
Both upstream and downstream are not very strict with considering list-type changes as breaking API change. We will even do them for types that we have for 5y as stable v1. So I would not worry about that.
Is there any way for a CRD to have merge by default on a list or does the merging need to be handled by the operator updating them?
list-type :-)
|
|
||
| To ensure that the safety requirements described above are maintained during scaling operations, | ||
| the etcd Operator will manage the etcd cluster in clusters with a | ||
| the etcd operator will manage the etcd cluster in clusters with a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Operator is written with capital O throughout the doc. It just hurts my eyes a little :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fixup :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember now, this was an artifact of me not know how etcd should be written, I had variation on Etcd Operator which looked ok to me but then i was told etcd had to be all lower, which I fixed, but forgot the O 😅
|
@JoelSpeed changes look good. |
|
General comment: https://docs.google.com/document/d/1Ef8AeG6JdE11E-3fjdFA3kwOUrBb1zJzwBAPDbUR3XQ/edit lists #920 as a related enhancement describing how to add learners. I think we need to write this down. That other enhancement is closed and not up-to-date according to @hasbro17. Does it make sense to add the important bits here? |
85d9ea8 to
b6c2cdd
Compare
IMO, the two topics are different enough that they warrant a separate document. This document is very much focused on using the learner feature and the machine deletion hook feature together to protect quorum, both of those features are sufficiently large enough I'd expect rather to just link to a full document describing how they work rather than adding it in here. @hasbro17 Any thoughts about getting the learner enhancement updated/revived? |
b6c2cdd to
8ed85c3
Compare
|
I would concur with @JoelSpeed on keeping the learner scaling enhancement separate from this one focusing on quorum protection. I will try to revive and update the other enhancement to update for the discoveries/changes made from the current implementation. |
| has been created. The removal of this hook will allow the Machine API to drain and terminate the Machine as it would | ||
| normally do. | ||
| In the case that a Machine is deleted before the member is promoted, the etcd operator is expected to not promote the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like we would need some sort of locking to ensure a new member is not promoted when a machine was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just realized this bit refers to deleting a new learner member. Nevertheless, this is a potential race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, what happens when a newly promoted machine gets deleted? Do we have to guarantee it won't be replaced on the next run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we actually need to do anything special here.
The deletion hook should already be on the member before it is promoted.
If we promote a member but the Machine was deleted, all this does is create extra toil.
When we next reconcile it will just look like the Machine was deleted, the promotion element doesn't actually effect this so we just treat it as a voting member being removed.
This is very much an optimisation rather than a necessity and the design of the hooks and the particular points at which they are added and removed ensure we should be safe.
| normally do. | ||
| In the case that a Machine is deleted before the member is promoted, the etcd operator is expected to not promote the | ||
| new member, and remove the deletion hook to allow the Machine to be removed from the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and remove the deletion hook to allow the Machine to be removed from the cluster
are you saying the hook should be also removed from the Machine that represents the new member?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you are :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so, if a learner member's machine is deleted, and we notice that before we promote it, the idea is we don't actually promote it and remove it from the etcd cluster. Once it's removed from the etcd cluster, the Machine has no impact on etcd anymore so it's safe to allow it to go away, by removing the pre-drain hook
| this prevents draining of a healthy etcd member until a new member becomes healthy. This operation will prevent other | ||
| components in the cluster (eg. MCO) from disrupting the quorum of etcd during this operation. | ||
|
|
||
| Note: When the cluster is already degraded, the etcd operator is expected to report as degraded for admin intervention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to determine a cluster is in a degraded condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have health checks that determine when members are unhealthy
https://github.com/openshift/cluster-etcd-operator/blob/14fecbd21d47fe19ab0219f59a10a0bbd4dc67ea/pkg/etcdcli/health.go#L42
And today we abort membership changes when that's the case
https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/operator/clustermembercontroller/clustermembercontroller.go#L87-L94
| Once a Machine has been marked for deletion, if the hook is removed by some other entity, the etcd operator | ||
| is expected not to re-add the hook. This allows an escape hatch when manual intervention is required. | ||
| The etcd operator will ensure, based on the desired Control Plane replica count in the cluster `InstallConfig` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the cluster-config-v1 configmap in kube-system namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is, AFAIk this is the only source of truth for how big etcd should be in cluster at the moment
|
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
|
/remove-lifecycle stale I spoke with @hasbro17 recently and it seems this enhancement is ready to merge, the project is being implemented as described and the only remaining questions are to do with the implementation details. @sttts @mfojtik @p0lyn0mial Did any of you have further concerns or are you happy for this to merge now? |
|
/lgtm |
|
Given the absence of further reviews/objections I'm going to approve and merge this proposal. /approve |
|
GIven @hasbro17 doesn't have approve rights on this repo, but is the approver for this enhancement, I will add the label on his behalf |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hasbro17, JoelSpeed The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@JoelSpeed: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This proposal describes how the etcd operator will leverage machine deletion hooks to ensure that, when a control plane machine with an active etcd member is being removed from the cluster, the etcd data is protected and the machine is not removed from the cluster until it is no longer required for etcd quorum