Skip to content

Conversation

@JoelSpeed
Copy link
Contributor

This proposal describes how the etcd operator will leverage machine deletion hooks to ensure that, when a control plane machine with an active etcd member is being removed from the cluster, the etcd data is protected and the machine is not removed from the cluster until it is no longer required for etcd quorum

Copy link
Contributor

@hasbro17 hasbro17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good and very clearly written 👍

Still going through the additional details section but I just did a first pass and had small note for @hexfusion


## Summary

To enable automation of Control Plane scaling activities, in paritcular vertical scaling of the Control Plane Machines, we must implement a mechanism that protects Etcd quorum and ensures the smoothest possible transition as new Etcd members are added and old members removed from the Etcd cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
To enable automation of Control Plane scaling activities, in paritcular vertical scaling of the Control Plane Machines, we must implement a mechanism that protects Etcd quorum and ensures the smoothest possible transition as new Etcd members are added and old members removed from the Etcd cluster.
To enable automation of Control Plane scaling activities, in particular vertical scaling of the Control Plane Machines, we must implement a mechanism that protects Etcd quorum and ensures the smoothest possible transition as new Etcd members are added and old members removed from the Etcd cluster.

#### Requirements for Etcd safety

* The number of voting members should equal the desired number of Control Plane Machines (and this should be an odd number)
* We may only deviate from this to add a replacement member
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: As noted before we will always deviate from this during scale up as a learner gets promoted before we remove the older voting member.

Suggested change
* We may only deviate from this to add a replacement member
* We will only deviate from this to add a replacement member

It will only remove the hook once it has identifed that the Machine resource is being deleted and a replacement member has been created. The removal of this hook will allow the Machine API to drain and terminate the Machine as it would normally do.
The Etcd Operator will ensure, based on the desired Control Plane replica count in <@Sam where do we get this from today?>, that the Etcd cluster has either the exact desired count of voting members, or during scaling/replacement operations, at most 1 extra voting member.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Etcd Operator will ensure, based on the desired Control Plane replica count in <@Sam where do we get this from today?>, that the Etcd cluster has either the exact desired count of voting members, or during scaling/replacement operations, at most 1 extra voting member.
The Etcd Operator will ensure, based on the desired Control Plane replica count in the InstallConfig, that the Etcd cluster has either the exact desired count of voting members, or during scaling/replacement operations, at most 1 extra voting member.

I believe the etcd-operator gets the desired control plane replica count from the install config's compute.controlPlane.replicas field. Read from the kube-system/cluster-config-v1 configmap.
#920 (comment)

1. Etcd Operator identifies the voting members of the Etcd cluster and maps these to the Control Plane Machines that host them
1. Etcd Operator adds a deletion hook to each of the Machines hosting a voting member of the Etcd cluster

##### Operational flow during a resize operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a nit:

Suggested change
##### Operational flow during a resize operation
##### Operational flow during a node replace operation

IIUC this flow describes a node replacement operation (via scale up then down) rather than a resize operation as the final cluster size is unchanged.
Although I guess this could be a resize if you only add or remove a machine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context behind this was a resize of the underlying machine, ie, vertical scaling of the host. But I can see how that can be confusing from an etcd perspective, so I'll adjust the wording

Comment on lines 122 to 123
1. The Etcd Operator promotes the new Etcd member to a voting member
1. The Etcd Operator demotes the old Etcd member, removing it from the cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hexfusion I brought this up before in an earlier version of this proposal but just wanted to touch on the order here again:

Would we ever be vertically scaling/replacing machines if we already have an unhealthy member?
I'm guessing yes, since that would be a prime usecase for replacing an unhealthy machine/member.

In that case, would the order of promoting a new member before removing the unhealthy one, open us up to quorum loss as described in https://etcd.io/docs/v3.5/faq/#should-i-add-a-member-before-removing-an-unhealthy-member

i.e the the learner going through the following states promotable->promoted/voting->immediately-unhealthy which now leaves us with 2/4 unhealthy voting members and quorum loss.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I must have missed your previous comments when pushing this.

Would we ever be vertically scaling/replacing machines if we already have an unhealthy member?

I suspect yes as well

In that case, would the order of promoting a new member before removing the unhealthy one, open us up to quorum loss as described in https://etcd.io/docs/v3.5/faq/#should-i-add-a-member-before-removing-an-unhealthy-member

Reading this, I think you're correct, we should reverse the ordering here. Would appreciate Sam's ack on this as well so I won't swap just yet so we don't lose the conversation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, I forgot to add my update to this but Sam confirmed offline that in the case of an unhealthy/degraded etcd cluster e.g (1/3 unhealthy members) the etcd-operator will not vertically scale or replace the unhealthy member automatically with the learner.
So we can consider scaling an unhealthy cluster as out of scope for this proposal, and defer to the cluster admin to get it back to a healthy state before expecting the etcd-operator scale up with new machines.

And since the recommendation of Remove-then-Add/Promote a member applies to replacing unhealthy members, I would think we can probably keep the ordering of Add/Promote-then-Remove here given that the learner is all caught up but I would defer to @hexfusion on this.

So perhaps noting down the scaling of degraded clusters as out of scope might be good here.

Copy link
Contributor

@hasbro17 hasbro17 Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note for @hexfusion: I know currently the clustermember controller will just error out and not scale if it sees a degraded cluster, but in your POC it seems like we do promote before doing the degraded check. I need to test this out to confirm but please correct me if that's not the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a non goal of "recovering unhealthy etcd clusters" and a note in the overview that we won't be doing anything when the cluster is already degraded

* Provide the Etcd Operator with the ability to control when a Control Plane Machine is removed from the cluster
* Allow the Etcd Operator to prevent removal of Etcd members until a replacement member has been promoted to a voting member
* Allow the Etcd Operator to remove an Etcd member from the Etcd cluster before the Machine is terminated to prevent a degraded Etcd cluster
* Allow an escape hatch from the protection when surging the capacity with new Control Plane Machines is unavailable (for example in metal environments with limited capacit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
* Allow an escape hatch from the protection when surging the capacity with new Control Plane Machines is unavailable (for example in metal environments with limited capacit)
* Allow an escape hatch from the protection when surging the capacity with new Control Plane Machines is unavailable (for example in metal environments with limited capacity)

@hexfusion
Copy link
Contributor

I am not sure why I can not respond to comments directly so I will do that inline here.

Side note for @hexfusion: I know currently the clustermember controller will just error out and not scale if it sees a degraded cluster, but in your POC it seems like we do promote before doing the degraded check. I need to test this out to confirm but please correct me if that's not the case.

In general StrictReconfigCheck (enabled by default) should always protect the cluster from membership changes that could result in quorum loss. Client error would look like.

etcdserver: unhealthy cluster

Learner testing is still fairly light so we can reconsider if more aggressive measures are required.

So we can consider scaling an unhealthy cluster as out of scope for this proposal, and defer to the cluster admin to get it back to a healthy state before expecting the etcd-operator scale up with new machines.

Correct the assumption would be the cluster is healthy. In the case of failed member it always is required to scale down before scaling up. In this case we do scale up to 4 which increases consensus to 3 members briefly. The details around that are highlighted in etcd docs.

[1] https://etcd.io/docs/v3.5/faq/#what-is-failure-tolerance

Copy link
Contributor

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few notes I hope to take a deeper dive this week.

### Non-Goals

* Automation of scaling operations on Machines
* Horizontal scaling of the Etcd cluster and Control Plane
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

pre-drain.delete.hook.machine.cluster-api.x-k8s.io/etcd-quorum-member: openshift-etcd-operator
```
The Etcd Operator will apply the hook to a Machine resource once it identifies that the Machine hosts an Etcd quorum member.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not as soon as possible? I forget the reasoning here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we decided this would be the case so that the deletion protection only comes in once we know that machine needs proection. Thought thinking about it now, I think we should add the hook before we promote for the following reasons:

  • If we promote before adding the hook, there's a small window before the protection kicks in
  • There should be no harm in having the hook in place while the member is still a learner, when the etcd operator notices the machine is marked for deletion, it should remove the member from the etcd cluster (voting or learner) and then remove the hook, I think that should still be safe

I'll amend this and add context

It will only remove the hook once it has identifed that the Machine resource is being deleted and a replacement member has been created. The removal of this hook will allow the Machine API to drain and terminate the Machine as it would normally do.
The Etcd Operator will ensure, based on the desired Control Plane replica count in <@Sam where do we get this from today?>, that the Etcd cluster has either the exact desired count of voting members, or during scaling/replacement operations, at most 1 extra voting member.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today we read from the embedded install-config which is located in a ConfigMap kube-system/cluster-config-v1.

@JoelSpeed
Copy link
Contributor Author

@hasbro17 @hexfusion please check the last few commits where I've addressed the various pieces of feedback so far. I will squash the commits once we are happy with the state of the document

Copy link
Contributor

@hasbro17 hasbro17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM. I imagine this will need a slight update to use the structured spec format of deletion hooks, but this is good enough for me to proceed with implementing etcd-operator's expected scaling workflow.

@JoelSpeed
Copy link
Contributor Author

I've updated this PR to include the structured format for the Machine Deletion Hooks which was updated recently. I've also squashed the previous feedback commits. There was no additional feedback left to be addressed so there are no other changes in the latest push than the structured example of a preDrain hook

If the etcd team are in agreement, I think we should aim to merge this PR within the next week or so

@hasbro17
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2021
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 7, 2022
@JoelSpeed
Copy link
Contributor Author

We had already agreed upon this design last year and it is now being implemented as described. As Sam is no longer on the etcd team, who is the relevant approver for this enhancement?

/remote-lifecycle stale

@JoelSpeed
Copy link
Contributor Author

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 10, 2022
@JoelSpeed
Copy link
Contributor Author

@mfojtik As the staff engineer looking after the etcd team, would you be able to review this and help get it merged?

* The number of voting members should equal the desired number of Control Plane Machines (and this should be an odd
number)
* We will only deviate from this to add a replacement member
* Once the new member is added, we should remove old members as soon as possible to reduce the risk of degrading the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old member or members ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this should be singular, good catch

* This also ensures that we keep the full etcd data on disk on a minimum of the desired number of Control Plane
Machines at all times

#### Protecting etcd during scaling operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to protect from an accidental etcd rollout? (like somebody change etcd config or hit upgrade button during this surgery) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it could be tricky to co-ordinate, because the operation could be either a single machine being replaced or many.

If the etcd operator blocked rolling out config changes while a Machine was marked as deleted, that could be safe.

We expect that when a new machine is being added, the old one is deleted signalling the etcd member that will eventually be removed. In the case of removing multiple machines, we have designed the mechanism so that we can delete multiple machines at once, and the mechanism will roll out the new etcd members 1 by 1, so this should also be safe.

Not sure if it's strictly necessary though, not sure what effect changing the config can have to the etcd cluster, I assume we restrict the configuration changes that end users can make for the etcd instances? If we are in the learning phase, a config change sounds like it shouldn't be too detrimental?

Copy link
Contributor

@hasbro17 hasbro17 Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding users really can't change the etcd config very much since the etcd-operator controls the env vars that are rolled out to the etcd static pods:
https://github.com/openshift/cluster-etcd-operator/blob/d8c023e4d217aae762b99322923c68101c4b3ffd/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L194-L203
https://github.com/openshift/cluster-etcd-operator/blob/d8c023e4d217aae762b99322923c68101c4b3ffd/pkg/etcdenvvar/etcd_env.go#L71-L84

They can however force a redeployment of all the etcd static-pods and trigger a roll-out to the next revision (even if it's without any changes) by updating the etcd/cluster CR e.g:

oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "foo"}}' --type=merge

All that will do (via the revision controller) is essentially restart the member pod on each node serially.

Since that rollout doesn't change the cluster membership I don't think it would impact an ongoing scaling operation.
At worst it might delay the scaling by a little bit if the operator is attempting to promote a learner member whose pod is being recreated with the new revision.

The scaling operation itself would actually cause multiple revision rollouts as it updates the cluster membership when adding/removing members.
@mfojtik So that should be okay unless you had another case in mind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or hit upgrade button during this surgery

Upgrades should be blocked via the quorum guard during an ongoing scaling operation. That's currently outlined in the proposal:

While a scaling operation is occurring, the etcd quorum guard will prevent the draining of any of the Control Plane
Machines, until the new etcd member has been promoted and the old etcd member removed from the cluster.
This in turn means that updates caused during upgrades (for example changes to MachineConfig) will be blocked while the
scaling operation occurs.
This will delay the upgrade process, but should not block it indefinitely unless an issue occurs.

Comment on lines 124 to 126
The etcd Operator will apply the hook to a Machine resource once it identifies that the Machine hosts an etcd member.
The hook should be added before the member is promoted to ensure that there is no period where the member is a voting
member, while the machine is not protected by the deletion mechanism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does etcd operator today reflect the etcd member status (learner/voter/leader/etc.) in some CR status?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe it does on the etcd/cluster CR. Although that info is available from a member list or endpoint status call to the cluster.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 12, 2022
@JoelSpeed
Copy link
Contributor Author

Amended the remove old members to remove the old member, no other changes, waiting on @mfojtik response to the responses before proceeding to make any further changes


## Motivation

As Red Hat expands its managed services offerings, the ability to safely increase and decrease the capacity of an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the motivation sounds very much like scaling up to 5 masters and more is the goal. It is not (I hope), but this enhancement is about safely replacing nodes with bigger machines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, I'll clarify to make this more obvious we mean via vertical scaling of the control plane

#### Protecting etcd during scaling operations

To ensure that the safety requirements described above are maintained during scaling operations,
the etcd Operator will manage the etcd cluster in clusters with a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the etcd Operator will manage the etcd cluster in clusters with a
the etcd operator will manage the etcd cluster in clusters with a


1. A Control Plane Machine is marked for deletion and a new Control Plane Machine is created
a. The order of these operations does not matter
1. The etcd Operator notices the new Machine and adjust the etcd quorum guard appropriately
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. The etcd Operator notices the new Machine and adjust the etcd quorum guard appropriately
1. The etcd Operator notices the new Machine and adjusts the etcd quorum guard appropriately


Typically in UPI clusters, Machines and MachineSets are not present. This would represent a non-functional Machine API.

Typically in IPI clusters, Machines and MachineSets are created and after bootstrap, there are 6 "Running" Machines, 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mentioned somewhere above a "Running" state. This is missing here.


To ensure that we do not block users from replacing Control Plane Machines in these scenarios, we must allow a user to
remove the etcd quorum hook without it being replaced by the etcd Operator.
When a Machine is marked for deletion, if the hook is removed, the etcd Operator will not replace it. This will allow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the etcd Operator will not replace it

what does this mean? It will not readd the hook? So there must be a check in the hook-add controller that it only adds hooks to unmarked machines. This must be added somewhere above, at least as a note.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "mark for deletion" the same as setting the metadata.deleteTimestamp? is it possible to mutate a deleted, but not removed object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean? It will not readd the hook? So there must be a check in the hook-add controller that it only adds hooks to unmarked machines.

Yep, that's what we are trying to say here, will see if there's an appropriate place to add a note above. This is called out in more detail in the machine deletion hooks enhancement though, that's why I didn't go into super detail level here

Is "mark for deletion" the same as setting the metadata.deleteTimestamp? is it possible to mutate a deleted, but not removed object?

Yes, I mean someone has called a DELETE operation on the resource so now the metadata.deletionTimestamp is set, not sure if there's a more commonly used verbiage for this than "marked for deletion"?

It is possible to mutate an object that has not been deleted, the API server doesn't prevent this. We can mutate it all the way up until the GC removes it, but that will only happen when the Machine API removes its finaliser. I've tested this manually and the deletion hooks have gone through QE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, just clarify here a bit.

@JoelSpeed
Copy link
Contributor Author

@sttts I've clarified the areas of wording you highlighted in a separate commit for now, if you're happy with the updates I'll squash it in again

If these GitOps systems do not correctly handle server side changes (like additional annotations),
then they may remove the hooks after the etcd Operator has added them.
In this case we expect a hot loop where the etcd and GitOps operators fight to add and remove the hook.
We must test this scenario and ensure that external systems aren't going to interfere with the mechanism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the machine types are wrong, compare https://github.com/openshift/api/blob/master/machine/v1beta1/types_machine.go#L208. The list-type must be map. Otherwise, atomic is assumed. This means that Gitops tools will replace the whole list if they just want to change one hook.

Copy link
Contributor Author

@JoelSpeed JoelSpeed Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't picked up in the initial API review, this could cause an issue to change so close the code freeze

Is there any way for a CRD to have merge by default on a list or does the merging need to be handled by the operator updating them?

Edit: We need to open a BZ to add the list-type to the field

Perhaps we need to implement webhooks to prevent addition/removal of more than one hook at a time, that would then match the behaviour of finalizers and hopefully prevent errors from third party controllers

For a map, would it need to be something along the lines of map[string]string or is there precedence for doing map[string]struct{...} in kube?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could cause an issue to change so close the code freeze

Both upstream and downstream are not very strict with considering list-type changes as breaking API change. We will even do them for types that we have for 5y as stable v1. So I would not worry about that.

Is there any way for a CRD to have merge by default on a list or does the merging need to be handled by the operator updating them?

list-type :-)


To ensure that the safety requirements described above are maintained during scaling operations,
the etcd Operator will manage the etcd cluster in clusters with a
the etcd operator will manage the etcd cluster in clusters with a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Operator is written with capital O throughout the doc. It just hurts my eyes a little :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fixup :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember now, this was an artifact of me not know how etcd should be written, I had variation on Etcd Operator which looked ok to me but then i was told etcd had to be all lower, which I fixed, but forgot the O 😅

@sttts
Copy link
Contributor

sttts commented Jan 19, 2022

@JoelSpeed changes look good.

@sttts
Copy link
Contributor

sttts commented Jan 19, 2022

General comment: https://docs.google.com/document/d/1Ef8AeG6JdE11E-3fjdFA3kwOUrBb1zJzwBAPDbUR3XQ/edit lists #920 as a related enhancement describing how to add learners. I think we need to write this down. That other enhancement is closed and not up-to-date according to @hasbro17. Does it make sense to add the important bits here?

@JoelSpeed
Copy link
Contributor Author

Does it make sense to add the important bits here?

IMO, the two topics are different enough that they warrant a separate document. This document is very much focused on using the learner feature and the machine deletion hook feature together to protect quorum, both of those features are sufficiently large enough I'd expect rather to just link to a full document describing how they work rather than adding it in here.

@hasbro17 Any thoughts about getting the learner enhancement updated/revived?

@hasbro17
Copy link
Contributor

I would concur with @JoelSpeed on keeping the learner scaling enhancement separate from this one focusing on quorum protection. I will try to revive and update the other enhancement to update for the discoveries/changes made from the current implementation.

has been created. The removal of this hook will allow the Machine API to drain and terminate the Machine as it would
normally do.
In the case that a Machine is deleted before the member is promoted, the etcd operator is expected to not promote the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like we would need some sort of locking to ensure a new member is not promoted when a machine was deleted.

Copy link
Contributor

@p0lyn0mial p0lyn0mial Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just realized this bit refers to deleting a new learner member. Nevertheless, this is a potential race.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, what happens when a newly promoted machine gets deleted? Do we have to guarantee it won't be replaced on the next run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we actually need to do anything special here.
The deletion hook should already be on the member before it is promoted.

If we promote a member but the Machine was deleted, all this does is create extra toil.
When we next reconcile it will just look like the Machine was deleted, the promotion element doesn't actually effect this so we just treat it as a voting member being removed.

This is very much an optimisation rather than a necessity and the design of the hooks and the particular points at which they are added and removed ensure we should be safe.

normally do.
In the case that a Machine is deleted before the member is promoted, the etcd operator is expected to not promote the
new member, and remove the deletion hook to allow the Machine to be removed from the cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and remove the deletion hook to allow the Machine to be removed from the cluster

are you saying the hook should be also removed from the Machine that represents the new member?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so, if a learner member's machine is deleted, and we notice that before we promote it, the idea is we don't actually promote it and remove it from the etcd cluster. Once it's removed from the etcd cluster, the Machine has no impact on etcd anymore so it's safe to allow it to go away, by removing the pre-drain hook

this prevents draining of a healthy etcd member until a new member becomes healthy. This operation will prevent other
components in the cluster (eg. MCO) from disrupting the quorum of etcd during this operation.

Note: When the cluster is already degraded, the etcd operator is expected to report as degraded for admin intervention.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to determine a cluster is in a degraded condition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once a Machine has been marked for deletion, if the hook is removed by some other entity, the etcd operator
is expected not to re-add the hook. This allows an escape hatch when manual intervention is required.
The etcd operator will ensure, based on the desired Control Plane replica count in the cluster `InstallConfig`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the cluster-config-v1 configmap in kube-system namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is, AFAIk this is the only source of truth for how big etcd should be in cluster at the moment

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2022
@JoelSpeed
Copy link
Contributor Author

/remove-lifecycle stale

I spoke with @hasbro17 recently and it seems this enhancement is ready to merge, the project is being implemented as described and the only remaining questions are to do with the implementation details.

@sttts @mfojtik @p0lyn0mial Did any of you have further concerns or are you happy for this to merge now?

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2022
@hasbro17
Copy link
Contributor

hasbro17 commented Apr 5, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 5, 2022
@hasbro17
Copy link
Contributor

hasbro17 commented Apr 6, 2022

Given the absence of further reviews/objections I'm going to approve and merge this proposal.
The PRs for implementing the feature are still based on the design outlined here and if there is any divergence (e.g on the use of the install-config) we can circle back and update the proposal.

/approve

@JoelSpeed
Copy link
Contributor Author

GIven @hasbro17 doesn't have approve rights on this repo, but is the approver for this enhancement, I will add the label on his behalf
/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 7, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasbro17, JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 7, 2022

@JoelSpeed: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 4f8aa36 into openshift:master Apr 7, 2022
@JoelSpeed JoelSpeed deleted the etcd-protection branch April 7, 2022 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants