diff --git a/enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md b/enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md new file mode 100644 index 0000000000..658fb6cb74 --- /dev/null +++ b/enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md @@ -0,0 +1,452 @@ +--- +title: protecting-etcd-quorum-during-control-plane-scaling +authors: + - "@JoelSpeed" +reviewers: + - "@hexfusion" + - "@hasbro17" +approvers: + - "@hexfusion" +creation-date: 2021-08-20 +last-updated: 2021-11-18 +status: implementable +see-also: + - "[Machine Deletion Hooks](https://github.com/openshift/enhancements/pull/862)" +--- + +# Protecting etcd Quorum During Control Plane Scaling + +## Release Signoff Checklist + +- [x] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Operational readiness criteria is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +To enable automation of Control Plane scaling activities, in particular vertical scaling of the Control Plane Machines, +we must implement a mechanism that protects etcd quorum and ensures the smoothest possible transition as new etcd +members are added and old members removed from the etcd cluster. + +## Motivation + +As Red Hat expands its managed services offerings, the ability to safely vertically scale the capacity of an +OpenShift Control Plane in some automated manner becomes imperative. + +Currently, when a cluster starts to hit capacity limits on the Control Plane, a very involved manual process is +required to not only add new Machines to the cluster, but monitor and manage the etcd cluster to ensure that the quorum +is preserved throughout the operation. + +This process is not sustainable and we must provide safety mechanisms on top of the existing etcd quorum guard to make +this process both easier and safer. + +### Goals + +* Provide the etcd operator with the ability to control when a Control Plane Machine is removed from the cluster +* Allow the etcd operator to prevent removal of etcd members until a replacement member has been promoted to a voting + member +* Allow the etcd operator to remove an etcd member from the etcd cluster before the Machine is terminated to prevent a + degraded etcd cluster +* Allow an escape hatch from the protection when surging the capacity with new Control Plane Machines is unavailable + (for example in metal environments with limited capacity) + + +### Non-Goals + +* Automation of scaling operations on Machines +* Horizontal scaling of the etcd cluster and Control Plane +* Providing these protection mechanisms when the Machine API is [not functional](#When-is-Machine-API-Functional) +* Recovering unhealthy etcd clusters + +## Proposal + +### User Stories + +#### Story 1 + +As an operator of a managed OpenShift cluster, I want to be able to automate the scaling operations of Control Plane +Machines without having to manually ensure the safety of the etcd cluster. + +#### Story 2 + +As a developer of OpenShift, I want to implement safety mechanisms that adhere to best practices for etcd scaling +operations to protect end users from potential quorum losses and data losses. + +#### Story 3 + +As an end user of OpenShift, I want to be able to increase the size of my Control Plane without having to know the +intricacies of etcd and protecting its quorum. + +### API Extensions + +This enhancement does not introduce any new API extensions. + +### Implementation Details/Notes/Constraints + +#### Requirements for etcd safety + +* The number of voting members should equal the desired number of Control Plane Machines (and this should be an odd + number) + * We will only deviate from this to add a replacement member + * Once the new member is added, we should remove the old member as soon as possible to reduce the risk of degrading the + cluster while having an even number of voting members +* Existing voting members should not be removed until their replacements have been promoted to voting members + * By starting new etcd members as [learners](https://etcd.io/docs/v3.3/learning/learner/#raft-learner), we can ensure + that the new member is fully "caught-up" and promotable to a full voting member before we start the removal process + of the old member + * This protects etcd from potential inconsistencies in its data if the cluster were to have some interruption shortly + after a new member joins the cluster + * This also ensures that we keep the full etcd data on disk on a minimum of the desired number of Control Plane + Machines at all times + +#### Protecting etcd during scaling operations + +To ensure that the safety requirements described above are maintained during scaling operations, +the etcd operator will manage the etcd cluster in clusters with a +[functional Machine API](#When-is-Machine-API-Functional), +by leveraging [Machine Deletion Hooks](#What-are-Machine-Deletion-Hooks) to coordinate with the Machine API when it is, +and isn't safe, to remove Machines with voting members of the etcd cluster running on them. + +##### Overview of protecion mechanism + +To ensure the safety of the etcd cluster quorum, the etcd operator will leverage a pre-drain [Machine Deletion Hooks](#What-are-Machine-Deletion-Hook) to prevent the removal of any Control Plane Machine hosting a voting member of the etcd quorum. + +```yaml +lifecycleHooks: + preDrain: + - name: EtcdQuorumOperator + owner: clusteroperator/etcd +``` + +The etcd operator will apply the hook to a Machine resource once it identifies that the Machine hosts an etcd member. +The hook should be added before the member is promoted to ensure that there is no period where the member is a voting +member, while the machine is not protected by the deletion mechanism. + +It will only remove the hook once it has identified that the Machine resource is being deleted and a replacement member +has been created. The removal of this hook will allow the Machine API to drain and terminate the Machine as it would +normally do. + +In the case that a Machine is deleted before the member is promoted, the etcd operator is expected to not promote the +new member, and remove the deletion hook to allow the Machine to be removed from the cluster. +Once a Machine has been marked for deletion, if the hook is removed by some other entity, the etcd operator +is expected not to re-add the hook. This allows an escape hatch when manual intervention is required. + +The etcd operator will ensure, based on the desired Control Plane replica count in the cluster `InstallConfig` +resource, that the etcd cluster has either the exact desired count of voting members, or during scaling/replacement +operations, at most 1 extra voting member. + +The etcd operator will also leverage the etcd quorum guard to prevent voluntary disruptions of etcd members during the +process. By ensuring that the quorum guard PDB always has `minAvailable: (Num Current Control Plane Machines) - 1`, +this prevents draining of a healthy etcd member until a new member becomes healthy. This operation will prevent other +components in the cluster (eg. MCO) from disrupting the quorum of etcd during this operation. + +Note: When the cluster is already degraded, the etcd operator is expected to report as degraded for admin intervention. +The etcd operator will not attempt to recover the cluster using the methods described in this enhancement. + +##### Adoption of an existing Control Plane into the new mechanism + +Note: This flow is the expectation for when a cluster is upgraded from 4.N-1 to 4.N where 4.N is the version where this +mechanism is introduced. + +1. etcd operator fetches all Control Plane Machines +1. etcd operator identifies if the Control Plane Machines are in the `Running` phase + a. If no Control Plane Machines are `Running`, assume that Machine API is non-functional and stop here +1. etcd operator identifies the voting members of the etcd cluster and maps these to the Control Plane Machines that + host them +1. etcd operator adds a deletion hook to each of the Machines hosting a voting member of the etcd cluster + +##### Operational flow during a node replacement (vertical scaling) operation + +1. A Control Plane Machine is marked for deletion and a new Control Plane Machine is created + a. The order of these operations does not matter +1. The etcd operator notices the new Machine and adjusts the etcd quorum guard appropriately +1. The Machine API creates the new host and the new Node joins the cluster +1. A new etcd member starts on the newly created Node + a. This member is initially started as a learner member +1. The new etcd member syncs the full etcd state and becomes promotable +1. The etcd operator adds a deletion hook to the new Machine +1. The etcd operator promotes the new etcd member to a voting member +1. The etcd operator demotes the old etcd member, removing it from the cluster +1. The etcd operator removes the deletion hook from the old Machine +1. The Machine API now drains and removes the old Machine +1. etcd operator notices the removed Machine and adjusts the etcd quorum guard appropriately + +##### Operational flow if a Control Plane Machine is deleted by a user + +1. User deletes the Machine object + a. This may also be some other component, for example this could be caused by a MachineHealthCheck +1. Machine API observes the etcd quorum pre-drain hook and waits for this to be removed before proceeding with the + Machine removal +1. etcd operator determines that removal of the etcd member on the deleted Machine would violate the desired replica + count, takes no action +1. At some point, some user (or operator) creates a new Control Plane Machine +1. At this point, the remaining flow is as above, go to step 2 of + [Operational flow during a resize operation](#Operational-flow-during-a-resize-operation) + +##### Interaction with upgrades + +While no scaling operations are occuring within the Control Plane set of Machines (ie. the number of Control Plane +Machines matches the desired count and none are in the process of being removed), the mechanism described in this +proposal will not interfere with the upgrade process and upgrades will proceed as normal. + +While a scaling operation is occurring, the etcd quorum guard will prevent the draining of any of the Control Plane +Machines, until the new etcd member has been promoted and the old etcd member removed from the cluster. +This in turn means that updates caused during upgrades (for example changes to MachineConfig) will be blocked while the +scaling operation occurs. +This will delay the upgrade process, but should not block it indefinitely unless an issue occurs. + +#### Additional Details + +##### What are Machine Deletion Hooks? + +[Machine Deletion Hooks](https://github.com/openshift/enhancements/pull/862) are a mechanism within the Machine API +that allow other operators to pause the Machine lifecycle in various places. +For example, once a Machine is marked for deletion, an operator may use a hook to prevent the Machine API from draining +a Node. + +For the use case described in this document, we will leverage a pre-drain hook to pause the Machine removal until the +etcd member present on the Machine has been removed from the etcd quorum. + +Once the member is removed from quorum, the etcd operator will remove the pre-drain hook, +which will signal to the Machine API that it is now safe to drain and terminate the instance as normal. + +##### When is Machine API Functional? + +We often refer to clusters in OpenShift as UPI and IPI. +However, there is no clear distinction, apart from during the install process between these two types of cluster. +Importantly, there should be no way to tell, after the cluster was installed, whether the cluster was created using UPI +or IPI. + +Since in a UPI cluster, Control Plane Machines are typically unmanaged, the mechanisms described in this document will +not work in a UPI cluster. As there is no way for the etcd operator to determine whether a cluster is UPI or IPI, it +must instead determine whether or not the Machine API is functional. + +For the purposes of this document, we define a functional Machine API as one which configured correctly such that if +required, it could create a new Machine. + +Typically in UPI clusters, Machines and MachineSets are not present. This would represent a non-functional Machine API. + +Typically in IPI clusters, Machines and MachineSets are created and after bootstrap, there are 6 Machines in the +`Running` phase, 3 Control Plane, 3 Worker. This would represent a functional Machine API. + +###### Functional Machine API Scenarios + +- The cluster was created using the IPI installation method. + - The Control Plane Machine objects are created by the installer and linked to the existing hosts. The customer can + use Machines API to create a new Control Plane host if required +- The cluster was created using the UPI installation method. The customer missed the instruction to remove the Machines + and MachineSets from the manifests directory. The Control Plane Machines somehow became Running. + - We have no evidence to suggest this is actually possible + - Typically the installer creates Machines with specific names/tags. These are used by Machine API to identify the + host and link it to the Machine. In UPI scenarios these names aren't mentioned in the documentation and the + customer has free choice over the naming of their hosts + - The Machine spec in this case will be half complete as the installer cannot fulfill all of the infrastructure + information before the cluster is created + - In this case, we have no way to determine whether creating new Machines from this spec will work + - Assuming that the safety mechanism in this proposal is working as expected, there should be no risk to the Control + Plane, but there may be additional work for the cluster administrator should they need to perform maintenance on + the Control Plane. + - We expect in this scenario that the administrator was not intending to use Machine API and as such, wouldn't try to + use Machine API during this maintenance window, and as such, wouldn't actually run into any issues +- The cluster was created using the UPI installation method. The customer then configures Machine resources for their + Control Plane Hosts + - This process is undocumented, but theoretically possible + - We expect in this case that the customer would configure the providerSpec to be accurate and test that new Machines + can be created that work with their clusters. + - In this scenario, the Machine API represents the configuration of an IPI cluster. Machines will be ready and we + should be able to manage the Control Plane Machines as with an IPI cluster + +In each of these scenarios, the presence of Machines in the `Running` phase signals that the Machine API is functional. + +###### Non-Functional Machine API Scenarios + +- The cluster was created using the UPI installation method. The customer correctly removed the Machines and + MachineSets before installation. + - There are no Machines in the cluster so Machine API cannot be functional +- The cluster was created using the UPI installation method. The customer missed the instruction to remove the Machines + and MachineSets from the manifests directory. + - This was a common scenario when vSphere was introduced as a new Machine API provider. The instruction was missed + from the UPI documentation + - In this scenario, the Machines for the Control Plane all siti in the `Provisioning` phase + - The Machine objects created do not have a full configuration and as such, the Machine API fails to provision new + Control Plane Machines +- The cluster is created using either the UPI or IPI installation method, but to an unsupported platform (eg. platform + None) + - In this case, as Machine API does not support the platform, the installer will not have generated and Machines/ + MachineSets + +In each of the scenarios, either that are no Machines in the cluster or the Machines will be stuck in the +`Provisioning` phase. In particular the absence of any `Running` Machine signals that the Machine API is non-functional. + +##### New metrics to export via the telemeter + +TODO: Consult with SREs to identify metrics they might find useful for us to export + +#### Example of vertically scaling a Control Plane Machine + +To use this mechanism to vertically scale a Control Plane, the following procedure must be carried out by some user or +operator. + +1. Identify a Controle Plane Machine for replacement (eg. `my-cluster-master-0`) +1. Determine a new name for the new Control Plane Machine that will not conflict with existing Control Plane Machines + (eg `my-cluster-master-3`) +1. Take a copy of the Machine resource: `oc get machine -n openshift-machine-api my-cluster-master-0 -o yaml > my- + cluster-master-3.yaml` +1. Modify the Machine YAML to update the `name` and the size of the instance (eg. changing + `spec.providerSpec.value.instanceType` from `c5.xlarge` to `c5.2xlarge`) +1. Create the new Machine: `oc create -f my-cluster-master-3.yaml` +1. Delete the old Machine: `oc delete machine -n openshift-machine-api my-cluster-master-0` +1. Wait until the process described in [Operational flow during a resize operation](#Operational-flow-during-a-resize-operation) + has been completed. The end result of this is that the Machine `my-cluster-master-0` will be removed + +### Risks and Mitigations + +#### Blocking removal of Machines in environments with restricted capacity + +This proposal assumes that when a Machine is to be replaced, that there is additional capacity available to allow the +new Machine to be created before the old Machine is removed. This may not be true in all environments, for example if +there is no quota left or in bare-metal environments with limited hardware available. + +To ensure that we do not block users from replacing Control Plane Machines in these scenarios, we must allow a user to +remove the etcd quorum hook without it being replaced by the etcd operator. +When a Machine is marked for deletion, if the hook is removed, the etcd operator will not replace it. This will allow +users to override the mechanism and continue with their replacement in these scenarios. + +In the future, if a new `ControlPlane` CRD is introduced, the etcd operator could observe the upgrade strategy on this +CR within the cluster and take appropriate action based on this. +We could allow users via this CRD to signal that they do not have capcity to burst during scaling operations and in +this case the etcd operator can remove the hooks as appropriate. + +#### Other controllers may interfere with the mechanism + +We know that some OpenShift users leverage GitOps mechanisms to manage Machines within OpenShift. +If these GitOps systems do not correctly handle server side changes (like additional annotations), +then they may remove the hooks after the etcd operator has added them. +In this case we expect a hot loop where the etcd and GitOps operators fight to add and remove the hook. +We must test this scenario and ensure that external systems aren't going to interfere with the mechanism. + +## Design Details + +### Open Questions + +- Do we want to explicitly block upgrades while scaling operations are happening? What could go wrong if we don't? +- What metrics would be useful to expose from etcd operator and machine api to help monitor the progress of these + operations? +- What metrics are we likely to want to pull back from customer clusters into CCX? + +### Test Plan + +We will need to develop a new E2E suite that exercises the replacement process outlined in this proposal. + +In particular, we will need write the following into a test case: +- Bring up new cluster and check that all is healthy +- Delete a control plane machine and check that it does not get removed +- Check that etcd quorum is still intact - etcd operator should degrade when the cluster is degraded +- Create a new control plan machine to replace the deleted machine +- Monitor etcd to ensure that it does not degrade during replacement procedure +- Wait until old control plane machine is removed from the cluster + +### Graduation Criteria + +#### Dev Preview -> Tech Preview + +TBD + +#### Tech Preview -> GA + +TBD + +#### Removing a deprecated feature + +This proposal introduces a new internal OpenShift safety mechanism. +No features will be deprecated or removed during the implementation of this proposal. + +### Upgrade / Downgrade Strategy + +Note: For the purpose of this section, assume version 4.N is the version in which this feature is introduced. + +#### Upgrading version 4.N-1 to version 4.N + +When upgrading from version 4.N-1 to version 4.N, the etcd operator will introduce the new hooks onto the Machines. +These hooks take the form of annotations and as such can be added straight away without any new API rollout. + +Once in place, the new hooks will be observed by the Machine API Controllers and the new mechanism will be active. + +The process is described in more detail above in the [Adoption of existing an Control Plane into the new mechanism](#Adoption-of-existing-an-Control-Plane-into-the-new-mechanism) section. + +#### Downgrading version 4.N to version 4.N-1 + +On dowgrades, the hooks added by the etcd operator will persist on the Machine object. +However, as soon as the Machine API Controllers are downgraded, the hooks will no longer be enforced. + +We will time the release of the [Machine Deletion Hooks](https://github.com/openshift/enhancements/pull/862) feature in +Machine API such that it is only active from version 4.N, preventing the need for specific downgrade logic as part of +this proposal. + +### Version Skew Strategy + +We do not expect users to resize their Control Planes during upgrade operations and as such should see no version skews. + +The mechanism will only be effective once the Machine API Controllers and etcd operator are upgraded, +however, neither depends on the other to be operational and as such, no issues should occur during the introduction of +this feature. + +### Operational Aspects of API Extensions + +This enhancement does not introduce any new API extensions. +Therefore no operational details are required. + +#### Failure Modes + +N/A + +#### Support Procedures + +N/A + +## Implementation History + +No implementation of this proposal currently exists. + +## Drawbacks + +- This design means that the etcd operator has to understand a new API type, which adds complexity to the etcd operator + and ties it to the Machine API. + This may mean additional complexity if this mechanism were to be needed in the Centrally Managed Infrastructure + project or if OpenShift were to migrate to Cluster API. + +## Alternatives + +### Use a separate component for this mechanism + +Rather than embedding the described mechanism within the etcd operator itself, we could create a new component within +OpenShift that focuses explicitly on the coordination of the etcd cluster lifecycle and the Machine lifecycle. +This new component would then be able to be lifecylced separately and could potentially be adapted to leverage Cluster +API as an alternative if needed in the future. +Since it is not clear if this is an immediate need, we believe that the extra effort of creating a new component is not +necessary during the first iteration of this proposal, but the mechanism could be extracted into a separate component +in the future. + +### Leverage a PDB to prevent disruptions + +We could consider using a PDB (eg etcd-quorum-guard) to prevent the removal of Machines by blocking the Machine from +being drained. However, by blocking the Machine controller from draining a Machine, we also block other components from +draining the Machine. This would in turn block normal day to day operations such as upgrades, where normally, MCO will +drain a Machine and reboot it to apply updates. + +Today, the etcd quorum guard protects the etcd cluster quorum by preventing more than 1 etcd member from being +disrupted at any one time. +These small interuptions for updates are tolerable as the etcd member should have a relatively small diff in its data +when it starts back up, meaning the cluster is degraded for only a short period. + +When replacing the Machine, it is preferable to ensure that the replacement member is promotable before removing the +old member. This minimises the duration in which the etcd cluster could become degraded. +To prevent that member being removed, we would need a PDB that does not allow the MCO to drain nodes, +and as such, a PDB isn't a suitable mechanism for this use case. + +## Infrastructure Needed + +No additional infrastructure will be needed as a result of this proposal.