[OCPCLOUD-1244] Add proposal for etcd protection mechanism during control plane scaling #943

JoelSpeed · 2021-10-29T13:18:06Z

This proposal describes how the etcd operator will leverage machine deletion hooks to ensure that, when a control plane machine with an active etcd member is being removed from the cluster, the etcd data is protected and the machine is not removed from the cluster until it is no longer required for etcd quorum

hasbro17

Looks pretty good and very clearly written 👍

Still going through the additional details section but I just did a first pass and had small note for @hexfusion

hasbro17 · 2021-10-29T18:20:16Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+
+## Summary
+
+To enable automation of Control Plane scaling activities, in paritcular vertical scaling of the Control Plane Machines, we must implement a mechanism that protects Etcd quorum and ensures the smoothest possible transition as new Etcd members are added and old members removed from the Etcd cluster.


Nit

Suggested change

To enable automation of Control Plane scaling activities, in paritcular vertical scaling of the Control Plane Machines, we must implement a mechanism that protects Etcd quorum and ensures the smoothest possible transition as new Etcd members are added and old members removed from the Etcd cluster.

To enable automation of Control Plane scaling activities, in particular vertical scaling of the Control Plane Machines, we must implement a mechanism that protects Etcd quorum and ensures the smoothest possible transition as new Etcd members are added and old members removed from the Etcd cluster.

hasbro17 · 2021-10-29T19:35:45Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+#### Requirements for Etcd safety
+
+* The number of voting members should equal the desired number of Control Plane Machines (and this should be an odd number)
+    * We may only deviate from this to add a replacement member


Nit: As noted before we will always deviate from this during scale up as a learner gets promoted before we remove the older voting member.

Suggested change

* We may only deviate from this to add a replacement member

* We will only deviate from this to add a replacement member

hasbro17 · 2021-10-29T19:44:40Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+
+It will only remove the hook once it has identifed that the Machine resource is being deleted and a replacement member has been created. The removal of this hook will allow the Machine API to drain and terminate the Machine as it would normally do.
+
+The Etcd Operator will ensure, based on the desired Control Plane replica count in <@Sam where do we get this from today?>, that the Etcd cluster has either the exact desired count of voting members, or during scaling/replacement operations, at most 1 extra voting member.


Suggested change

The Etcd Operator will ensure, based on the desired Control Plane replica count in <@Sam where do we get this from today?>, that the Etcd cluster has either the exact desired count of voting members, or during scaling/replacement operations, at most 1 extra voting member.

The Etcd Operator will ensure, based on the desired Control Plane replica count in the InstallConfig, that the Etcd cluster has either the exact desired count of voting members, or during scaling/replacement operations, at most 1 extra voting member.

I believe the etcd-operator gets the desired control plane replica count from the install config's compute.controlPlane.replicas field. Read from the kube-system/cluster-config-v1 configmap.
#920 (comment)

hasbro17 · 2021-10-29T20:34:11Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+1. Etcd Operator identifies the voting members of the Etcd cluster and maps these to the Control Plane Machines that host them
+1. Etcd Operator adds a deletion hook to each of the Machines hosting a voting member of the Etcd cluster
+
+##### Operational flow during a resize operation


Probably a nit:

Suggested change

##### Operational flow during a resize operation

##### Operational flow during a node replace operation

IIUC this flow describes a node replacement operation (via scale up then down) rather than a resize operation as the final cluster size is unchanged.
Although I guess this could be a resize if you only add or remove a machine.

The context behind this was a resize of the underlying machine, ie, vertical scaling of the host. But I can see how that can be confusing from an etcd perspective, so I'll adjust the wording

hasbro17 · 2021-10-29T20:43:46Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+1. The Etcd Operator promotes the new Etcd member to a voting member
+1. The Etcd Operator demotes the old Etcd member, removing it from the cluster


@hexfusion I brought this up before in an earlier version of this proposal but just wanted to touch on the order here again:

Would we ever be vertically scaling/replacing machines if we already have an unhealthy member?
I'm guessing yes, since that would be a prime usecase for replacing an unhealthy machine/member.

In that case, would the order of promoting a new member before removing the unhealthy one, open us up to quorum loss as described in https://etcd.io/docs/v3.5/faq/#should-i-add-a-member-before-removing-an-unhealthy-member

i.e the the learner going through the following states promotable->promoted/voting->immediately-unhealthy which now leaves us with 2/4 unhealthy voting members and quorum loss.

Apologies, I must have missed your previous comments when pushing this.

Would we ever be vertically scaling/replacing machines if we already have an unhealthy member?

I suspect yes as well

In that case, would the order of promoting a new member before removing the unhealthy one, open us up to quorum loss as described in https://etcd.io/docs/v3.5/faq/#should-i-add-a-member-before-removing-an-unhealthy-member

Reading this, I think you're correct, we should reverse the ordering here. Would appreciate Sam's ack on this as well so I won't swap just yet so we don't lose the conversation

Likewise, I forgot to add my update to this but Sam confirmed offline that in the case of an unhealthy/degraded etcd cluster e.g (1/3 unhealthy members) the etcd-operator will not vertically scale or replace the unhealthy member automatically with the learner.
So we can consider scaling an unhealthy cluster as out of scope for this proposal, and defer to the cluster admin to get it back to a healthy state before expecting the etcd-operator scale up with new machines.

And since the recommendation of Remove-then-Add/Promote a member applies to replacing unhealthy members, I would think we can probably keep the ordering of Add/Promote-then-Remove here given that the learner is all caught up but I would defer to @hexfusion on this.

So perhaps noting down the scaling of degraded clusters as out of scope might be good here.

Side note for @hexfusion: I know currently the clustermember controller will just error out and not scale if it sees a degraded cluster, but in your POC it seems like we do promote before doing the degraded check. I need to test this out to confirm but please correct me if that's not the case.

I've added a non goal of "recovering unhealthy etcd clusters" and a note in the overview that we won't be doing anything when the cluster is already degraded

petr-muller · 2021-11-02T13:39:18Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+* Provide the Etcd Operator with the ability to control when a Control Plane Machine is removed from the cluster
+* Allow the Etcd Operator to prevent removal of Etcd members until a replacement member has been promoted to a voting member
+* Allow the Etcd Operator to remove an Etcd member from the Etcd cluster before the Machine is terminated to prevent a degraded Etcd cluster
+* Allow an escape hatch from the protection when surging the capacity with new Control Plane Machines is unavailable (for example in metal environments with limited capacit)


nit

Suggested change

* Allow an escape hatch from the protection when surging the capacity with new Control Plane Machines is unavailable (for example in metal environments with limited capacit)

* Allow an escape hatch from the protection when surging the capacity with new Control Plane Machines is unavailable (for example in metal environments with limited capacity)

hexfusion · 2021-11-03T19:18:17Z

I am not sure why I can not respond to comments directly so I will do that inline here.

Side note for @hexfusion: I know currently the clustermember controller will just error out and not scale if it sees a degraded cluster, but in your POC it seems like we do promote before doing the degraded check. I need to test this out to confirm but please correct me if that's not the case.

In general StrictReconfigCheck (enabled by default) should always protect the cluster from membership changes that could result in quorum loss. Client error would look like.

etcdserver: unhealthy cluster

Learner testing is still fairly light so we can reconsider if more aggressive measures are required.

So we can consider scaling an unhealthy cluster as out of scope for this proposal, and defer to the cluster admin to get it back to a healthy state before expecting the etcd-operator scale up with new machines.

Correct the assumption would be the cluster is healthy. In the case of failed member it always is required to scale down before scaling up. In this case we do scale up to 4 which increases consensus to 3 members briefly. The details around that are highlighted in etcd docs.

[1] https://etcd.io/docs/v3.5/faq/#what-is-failure-tolerance

hexfusion

few notes I hope to take a deeper dive this week.

hexfusion · 2021-10-29T15:51:46Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+### Non-Goals
+
+* Automation of scaling operations on Machines
+* Horizontal scaling of the Etcd cluster and Control Plane


hexfusion · 2021-10-29T15:53:59Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+pre-drain.delete.hook.machine.cluster-api.x-k8s.io/etcd-quorum-member: openshift-etcd-operator
+```
+
+The Etcd Operator will apply the hook to a Machine resource once it identifies that the Machine hosts an Etcd quorum member.


why not as soon as possible? I forget the reasoning here.

IIRC, we decided this would be the case so that the deletion protection only comes in once we know that machine needs proection. Thought thinking about it now, I think we should add the hook before we promote for the following reasons:

If we promote before adding the hook, there's a small window before the protection kicks in

There should be no harm in having the hook in place while the member is still a learner, when the etcd operator notices the machine is marked for deletion, it should remove the member from the etcd cluster (voting or learner) and then remove the hook, I think that should still be safe

I'll amend this and add context

hexfusion · 2021-10-29T15:56:34Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+
+It will only remove the hook once it has identifed that the Machine resource is being deleted and a replacement member has been created. The removal of this hook will allow the Machine API to drain and terminate the Machine as it would normally do.
+
+The Etcd Operator will ensure, based on the desired Control Plane replica count in <@Sam where do we get this from today?>, that the Etcd cluster has either the exact desired count of voting members, or during scaling/replacement operations, at most 1 extra voting member.


Today we read from the embedded install-config which is located in a ConfigMap kube-system/cluster-config-v1.

JoelSpeed · 2021-11-04T10:50:56Z

@hasbro17 @hexfusion please check the last few commits where I've addressed the various pieces of feedback so far. I will squash the commits once we are happy with the state of the document

hasbro17

Still LGTM. I imagine this will need a slight update to use the structured spec format of deletion hooks, but this is good enough for me to proceed with implementing etcd-operator's expected scaling workflow.

JoelSpeed · 2021-11-18T17:43:23Z

I've updated this PR to include the structured format for the Machine Deletion Hooks which was updated recently. I've also squashed the previous feedback commits. There was no additional feedback left to be addressed so there are no other changes in the latest push than the structured example of a preDrain hook

If the etcd team are in agreement, I think we should aim to merge this PR within the next week or so

hasbro17 · 2021-12-10T18:05:34Z

/lgtm

openshift-bot · 2022-01-07T23:07:54Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

JoelSpeed · 2022-01-10T10:12:48Z

We had already agreed upon this design last year and it is now being implemented as described. As Sam is no longer on the etcd team, who is the relevant approver for this enhancement?

/remote-lifecycle stale

JoelSpeed · 2022-01-10T10:13:00Z

/remove-lifecycle stale

JoelSpeed · 2022-01-10T10:32:43Z

@mfojtik As the staff engineer looking after the etcd team, would you be able to review this and help get it merged?

mfojtik · 2022-01-11T08:58:01Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+* The number of voting members should equal the desired number of Control Plane Machines (and this should be an odd
+  number)
+  * We will only deviate from this to add a replacement member
+  * Once the new member is added, we should remove old members as soon as possible to reduce the risk of degrading the


old member or members ?

Yeah this should be singular, good catch

mfojtik · 2022-01-11T09:00:06Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+  * This also ensures that we keep the full etcd data on disk on a minimum of the desired number of Control Plane
+    Machines at all times
+
+#### Protecting etcd during scaling operations


do we need to protect from an accidental etcd rollout? (like somebody change etcd config or hit upgrade button during this surgery) ?

This feels like it could be tricky to co-ordinate, because the operation could be either a single machine being replaced or many.

If the etcd operator blocked rolling out config changes while a Machine was marked as deleted, that could be safe.

We expect that when a new machine is being added, the old one is deleted signalling the etcd member that will eventually be removed. In the case of removing multiple machines, we have designed the mechanism so that we can delete multiple machines at once, and the mechanism will roll out the new etcd members 1 by 1, so this should also be safe.

Not sure if it's strictly necessary though, not sure what effect changing the config can have to the etcd cluster, I assume we restrict the configuration changes that end users can make for the etcd instances? If we are in the learning phase, a config change sounds like it shouldn't be too detrimental?

From my understanding users really can't change the etcd config very much since the etcd-operator controls the env vars that are rolled out to the etcd static pods:
https://github.com/openshift/cluster-etcd-operator/blob/d8c023e4d217aae762b99322923c68101c4b3ffd/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L194-L203
https://github.com/openshift/cluster-etcd-operator/blob/d8c023e4d217aae762b99322923c68101c4b3ffd/pkg/etcdenvvar/etcd_env.go#L71-L84

They can however force a redeployment of all the etcd static-pods and trigger a roll-out to the next revision (even if it's without any changes) by updating the etcd/cluster CR e.g:

oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "foo"}}' --type=merge

All that will do (via the revision controller) is essentially restart the member pod on each node serially.

Since that rollout doesn't change the cluster membership I don't think it would impact an ongoing scaling operation.
At worst it might delay the scaling by a little bit if the operator is attempting to promote a learner member whose pod is being recreated with the new revision.

The scaling operation itself would actually cause multiple revision rollouts as it updates the cluster membership when adding/removing members.
@mfojtik So that should be okay unless you had another case in mind.

or hit upgrade button during this surgery

Upgrades should be blocked via the quorum guard during an ongoing scaling operation. That's currently outlined in the proposal:

While a scaling operation is occurring, the etcd quorum guard will prevent the draining of any of the Control Plane Machines, until the new etcd member has been promoted and the old etcd member removed from the cluster. This in turn means that updates caused during upgrades (for example changes to MachineConfig) will be blocked while the scaling operation occurs. This will delay the upgrade process, but should not block it indefinitely unless an issue occurs.

mfojtik · 2022-01-11T09:07:57Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+The etcd Operator will apply the hook to a Machine resource once it identifies that the Machine hosts an etcd member.
+The hook should be added before the member is promoted to ensure that there is no period where the member is a voting
+member, while the machine is not protected by the deletion mechanism.


does etcd operator today reflect the etcd member status (learner/voter/leader/etc.) in some CR status?

I don't believe it does on the etcd/cluster CR. Although that info is available from a member list or endpoint status call to the cluster.

JoelSpeed · 2022-01-12T09:47:28Z

Amended the remove old members to remove the old member, no other changes, waiting on @mfojtik response to the responses before proceeding to make any further changes

sttts · 2022-01-18T10:50:07Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+
+## Motivation
+
+As Red Hat expands its managed services offerings, the ability to safely increase and decrease the capacity of an


the motivation sounds very much like scaling up to 5 masters and more is the goal. It is not (I hope), but this enhancement is about safely replacing nodes with bigger machines.

Fair point, I'll clarify to make this more obvious we mean via vertical scaling of the control plane

sttts · 2022-01-18T10:56:56Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+#### Protecting etcd during scaling operations
+
+To ensure that the safety requirements described above are maintained during scaling operations,
+the etcd Operator will manage the etcd cluster in clusters with a


Suggested change

the etcd Operator will manage the etcd cluster in clusters with a

the etcd operator will manage the etcd cluster in clusters with a

sttts · 2022-01-18T11:02:08Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+
+1. A Control Plane Machine is marked for deletion and a new Control Plane Machine is created
+    a. The order of these operations does not matter
+1. The etcd Operator notices the new Machine and adjust the etcd quorum guard appropriately


Suggested change

1. The etcd Operator notices the new Machine and adjust the etcd quorum guard appropriately

1. The etcd Operator notices the new Machine and adjusts the etcd quorum guard appropriately

sttts · 2022-01-18T11:06:22Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+
+Typically in UPI clusters, Machines and MachineSets are not present. This would represent a non-functional Machine API.
+
+Typically in IPI clusters, Machines and MachineSets are created and after bootstrap, there are 6 "Running" Machines, 3


you mentioned somewhere above a "Running" state. This is missing here.

sttts · 2022-01-18T11:12:26Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+
+To ensure that we do not block users from replacing Control Plane Machines in these scenarios, we must allow a user to
+remove the etcd quorum hook without it being replaced by the etcd Operator.
+When a Machine is marked for deletion, if the hook is removed, the etcd Operator will not replace it. This will allow


the etcd Operator will not replace it

what does this mean? It will not readd the hook? So there must be a check in the hook-add controller that it only adds hooks to unmarked machines. This must be added somewhere above, at least as a note.

Is "mark for deletion" the same as setting the metadata.deleteTimestamp? is it possible to mutate a deleted, but not removed object?

what does this mean? It will not readd the hook? So there must be a check in the hook-add controller that it only adds hooks to unmarked machines.

Yep, that's what we are trying to say here, will see if there's an appropriate place to add a note above. This is called out in more detail in the machine deletion hooks enhancement though, that's why I didn't go into super detail level here

Is "mark for deletion" the same as setting the metadata.deleteTimestamp? is it possible to mutate a deleted, but not removed object?

Yes, I mean someone has called a DELETE operation on the resource so now the metadata.deletionTimestamp is set, not sure if there's a more commonly used verbiage for this than "marked for deletion"?

It is possible to mutate an object that has not been deleted, the API server doesn't prevent this. We can mutate it all the way up until the GC removes it, but that will only happen when the Machine API removes its finaliser. I've tested this manually and the deletion hooks have gone through QE.

sounds good, just clarify here a bit.

JoelSpeed · 2022-01-18T11:56:40Z

@sttts I've clarified the areas of wording you highlighted in a separate commit for now, if you're happy with the updates I'll squash it in again

sttts · 2022-01-18T12:25:12Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+If these GitOps systems do not correctly handle server side changes (like additional annotations),
+then they may remove the hooks after the etcd Operator has added them.
+In this case we expect a hot loop where the etcd and GitOps operators fight to add and remove the hook.
+We must test this scenario and ensure that external systems aren't going to interfere with the mechanism.


the machine types are wrong, compare https://github.com/openshift/api/blob/master/machine/v1beta1/types_machine.go#L208. The list-type must be map. Otherwise, atomic is assumed. This means that Gitops tools will replace the whole list if they just want to change one hook.

This wasn't picked up in the initial API review, this could cause an issue to change so close the code freeze

Is there any way for a CRD to have merge by default on a list or does the merging need to be handled by the operator updating them?

Edit: We need to open a BZ to add the list-type to the field

Perhaps we need to implement webhooks to prevent addition/removal of more than one hook at a time, that would then match the behaviour of finalizers and hopefully prevent errors from third party controllers

~~For a map, would it need to be something along the lines of map[string]string or is there precedence for doing map[string]struct{...} in kube?~~

I haven't tested this yet, but I think to resolve this comment we need https://github.com/openshift/api/compare/master...JoelSpeed:set-list-type-hooks?expand=1

this could cause an issue to change so close the code freeze

Both upstream and downstream are not very strict with considering list-type changes as breaking API change. We will even do them for types that we have for 5y as stable v1. So I would not worry about that.

Is there any way for a CRD to have merge by default on a list or does the merging need to be handled by the operator updating them?

list-type :-)

sttts · 2022-01-19T09:49:45Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md


 To ensure that the safety requirements described above are maintained during scaling operations,
-the etcd Operator will manage the etcd cluster in clusters with a
+the etcd operator will manage the etcd cluster in clusters with a


nit: Operator is written with capital O throughout the doc. It just hurts my eyes a little :-)

Will fixup :)

I remember now, this was an artifact of me not know how etcd should be written, I had variation on Etcd Operator which looked ok to me but then i was told etcd had to be all lower, which I fixed, but forgot the O 😅

sttts · 2022-01-19T09:50:24Z

@JoelSpeed changes look good.

sttts · 2022-01-19T09:53:00Z

General comment: https://docs.google.com/document/d/1Ef8AeG6JdE11E-3fjdFA3kwOUrBb1zJzwBAPDbUR3XQ/edit lists #920 as a related enhancement describing how to add learners. I think we need to write this down. That other enhancement is closed and not up-to-date according to @hasbro17. Does it make sense to add the important bits here?

JoelSpeed · 2022-01-19T09:58:57Z

Does it make sense to add the important bits here?

IMO, the two topics are different enough that they warrant a separate document. This document is very much focused on using the learner feature and the machine deletion hook feature together to protect quorum, both of those features are sufficiently large enough I'd expect rather to just link to a full document describing how they work rather than adding it in here.

@hasbro17 Any thoughts about getting the learner enhancement updated/revived?

hasbro17 · 2022-01-20T16:35:34Z

I would concur with @JoelSpeed on keeping the learner scaling enhancement separate from this one focusing on quorum protection. I will try to revive and update the other enhancement to update for the discoveries/changes made from the current implementation.

p0lyn0mial · 2022-01-24T15:02:17Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+has been created. The removal of this hook will allow the Machine API to drain and terminate the Machine as it would
+normally do.
+
+In the case that a Machine is deleted before the member is promoted, the etcd operator is expected to not promote the


It sounds like we would need some sort of locking to ensure a new member is not promoted when a machine was deleted.

I have just realized this bit refers to deleting a new learner member. Nevertheless, this is a potential race.

also, what happens when a newly promoted machine gets deleted? Do we have to guarantee it won't be replaced on the next run?

I don't think we actually need to do anything special here.
The deletion hook should already be on the member before it is promoted.

If we promote a member but the Machine was deleted, all this does is create extra toil.
When we next reconcile it will just look like the Machine was deleted, the promotion element doesn't actually effect this so we just treat it as a voting member being removed.

This is very much an optimisation rather than a necessity and the design of the hooks and the particular points at which they are added and removed ensure we should be safe.

p0lyn0mial · 2022-01-24T15:04:00Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+normally do.
+
+In the case that a Machine is deleted before the member is promoted, the etcd operator is expected to not promote the
+new member, and remove the deletion hook to allow the Machine to be removed from the cluster.


and remove the deletion hook to allow the Machine to be removed from the cluster

are you saying the hook should be also removed from the Machine that represents the new member?

yes, you are :)

Yes, so, if a learner member's machine is deleted, and we notice that before we promote it, the idea is we don't actually promote it and remove it from the etcd cluster. Once it's removed from the etcd cluster, the Machine has no impact on etcd anymore so it's safe to allow it to go away, by removing the pre-drain hook

p0lyn0mial · 2022-01-24T15:04:45Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+this prevents draining of a healthy etcd member until a new member becomes healthy. This operation will prevent other
+components in the cluster (eg. MCO) from disrupting the quorum of etcd during this operation.
+
+Note: When the cluster is already degraded, the etcd operator is expected to report as degraded for admin intervention.


is there a way to determine a cluster is in a degraded condition?

We do have health checks that determine when members are unhealthy
https://github.com/openshift/cluster-etcd-operator/blob/14fecbd21d47fe19ab0219f59a10a0bbd4dc67ea/pkg/etcdcli/health.go#L42

And today we abort membership changes when that's the case
https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/operator/clustermembercontroller/clustermembercontroller.go#L87-L94

p0lyn0mial · 2022-02-10T05:03:54Z

enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

+Once a Machine has been marked for deletion, if the hook is removed by some other entity, the etcd operator
+is expected not to re-add the hook. This allows an escape hatch when manual intervention is required.
+
+The etcd operator will ensure, based on the desired Control Plane replica count in the cluster `InstallConfig`


Is this the cluster-config-v1 configmap in kube-system namespace?

Yes it is, AFAIk this is the only source of truth for how big etcd should be in cluster at the moment

openshift-bot · 2022-03-10T11:55:27Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

JoelSpeed · 2022-03-10T11:58:37Z

/remove-lifecycle stale

I spoke with @hasbro17 recently and it seems this enhancement is ready to merge, the project is being implemented as described and the only remaining questions are to do with the implementation details.

@sttts @mfojtik @p0lyn0mial Did any of you have further concerns or are you happy for this to merge now?

hasbro17 · 2022-04-05T16:00:59Z

/lgtm

hasbro17 · 2022-04-06T18:17:22Z

Given the absence of further reviews/objections I'm going to approve and merge this proposal.
The PRs for implementing the feature are still based on the design outlined here and if there is any divergence (e.g on the use of the install-config) we can circle back and update the proposal.

/approve

JoelSpeed · 2022-04-07T08:11:45Z

GIven @hasbro17 doesn't have approve rights on this repo, but is the approver for this enhancement, I will add the label on his behalf
/approve

openshift-ci · 2022-04-07T08:13:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasbro17, JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2022-04-07T08:22:34Z

@JoelSpeed: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci bot requested review from mrunalp and sttts October 29, 2021 13:18

JoelSpeed mentioned this pull request Oct 29, 2021

Add Machine Deletion Hooks Proposal #862

Merged

hasbro17 reviewed Oct 29, 2021

View reviewed changes

petr-muller reviewed Nov 2, 2021

View reviewed changes

hexfusion reviewed Nov 3, 2021

View reviewed changes

hasbro17 reviewed Nov 15, 2021

View reviewed changes

JoelSpeed force-pushed the etcd-protection branch from b919f96 to 57d8fac Compare November 18, 2021 17:42

openshift-ci bot assigned hasbro17 Dec 10, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2021

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 7, 2022

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 10, 2022

mfojtik reviewed Jan 11, 2022

View reviewed changes

JoelSpeed force-pushed the etcd-protection branch from 57d8fac to c9129aa Compare January 12, 2022 09:46

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 12, 2022

JoelSpeed mentioned this pull request Jan 17, 2022

[OCPCLOUD-1248] Add initial iteration of ControlPlaneMachineSet operator enhancement #1008

Merged

sttts reviewed Jan 18, 2022

View reviewed changes

sttts reviewed Jan 19, 2022

View reviewed changes

JoelSpeed force-pushed the etcd-protection branch from 85d9ea8 to b6c2cdd Compare January 19, 2022 09:54

Add proposal for etcd protection mechanism during control plane scaling

8ed85c3

JoelSpeed force-pushed the etcd-protection branch from b6c2cdd to 8ed85c3 Compare January 19, 2022 10:08

p0lyn0mial reviewed Jan 24, 2022

View reviewed changes

p0lyn0mial reviewed Feb 10, 2022

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2022

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 5, 2022

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2022

openshift-merge-robot merged commit 4f8aa36 into openshift:master Apr 7, 2022

JoelSpeed deleted the etcd-protection branch April 7, 2022 12:22


		## Summary

		To enable automation of Control Plane scaling activities, in paritcular vertical scaling of the Control Plane Machines, we must implement a mechanism that protects Etcd quorum and ensures the smoothest possible transition as new Etcd members are added and old members removed from the Etcd cluster.

	* We may only deviate from this to add a replacement member
	* We will only deviate from this to add a replacement member


		It will only remove the hook once it has identifed that the Machine resource is being deleted and a replacement member has been created. The removal of this hook will allow the Machine API to drain and terminate the Machine as it would normally do.

		The Etcd Operator will ensure, based on the desired Control Plane replica count in <@Sam where do we get this from today?>, that the Etcd cluster has either the exact desired count of voting members, or during scaling/replacement operations, at most 1 extra voting member.

	##### Operational flow during a resize operation
	##### Operational flow during a node replace operation

		1. The Etcd Operator promotes the new Etcd member to a voting member
		1. The Etcd Operator demotes the old Etcd member, removing it from the cluster

	* Allow an escape hatch from the protection when surging the capacity with new Control Plane Machines is unavailable (for example in metal environments with limited capacit)
	* Allow an escape hatch from the protection when surging the capacity with new Control Plane Machines is unavailable (for example in metal environments with limited capacity)


		## Motivation

		As Red Hat expands its managed services offerings, the ability to safely increase and decrease the capacity of an

	the etcd Operator will manage the etcd cluster in clusters with a
	the etcd operator will manage the etcd cluster in clusters with a

	1. The etcd Operator notices the new Machine and adjust the etcd quorum guard appropriately
	1. The etcd Operator notices the new Machine and adjusts the etcd quorum guard appropriately


		Typically in UPI clusters, Machines and MachineSets are not present. This would represent a non-functional Machine API.

		Typically in IPI clusters, Machines and MachineSets are created and after bootstrap, there are 6 "Running" Machines, 3

[OCPCLOUD-1244] Add proposal for etcd protection mechanism during control plane scaling #943

[OCPCLOUD-1244] Add proposal for etcd protection mechanism during control plane scaling #943

Uh oh!

Conversation

JoelSpeed commented Oct 29, 2021

Uh oh!

hasbro17 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hasbro17 Nov 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hexfusion commented Nov 3, 2021

Uh oh!

hexfusion left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoelSpeed commented Nov 4, 2021

Uh oh!

hasbro17 left a comment

Choose a reason for hiding this comment

Uh oh!

JoelSpeed commented Nov 18, 2021

Uh oh!

hasbro17 commented Dec 10, 2021

Uh oh!

openshift-bot commented Jan 7, 2022

Uh oh!

JoelSpeed commented Jan 10, 2022

Uh oh!

JoelSpeed commented Jan 10, 2022

Uh oh!

JoelSpeed commented Jan 10, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hasbro17 Jan 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

hasbro17 Nov 3, 2021 •

edited

Loading

hasbro17 Jan 12, 2022 •

edited

Loading

JoelSpeed Jan 18, 2022 •

edited

Loading