Skip to content

Conversation

@JoelSpeed
Copy link
Contributor

This adds the initial version of the design for the ControlPlaneSet CRD and operator to pair with it.

The aim here is to allow automated vertical scaling of Control Plane Machines within OpenShift HA clusters.
We have designed a protection mechanism (#943) which handles most of the safety aspects of the system separately.
This enhancement is purely about the management of the Machines themselves, how they will be updated when they need to and how users can interact with the new CRD.

@openshift-ci openshift-ci bot requested review from dgoodwin and jcantrill January 17, 2022 12:01
## Motivation

As OpenShift adoption increases and our managed services offerings become more popular, the manual effort required to
scale Control Plane Machines as clusters grow or shrink is becoming a significant drain on the OpenShift Dedicated team.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scale Control Plane Machines as clusters grow or shrink is becoming a significant drain on the OpenShift Dedicated team.
scale Control Plane Machines as clusters grow or shrink is becoming a significant drain on the SRE Platform team managing OpenShift Dedicated (OSD), Red Hat OpenShift on AWS (ROSA), and Azure Red Hat OpenShift (ARO).

### Goals

* Provide a "best practices" approach to declarative management of Control Plane Machines
* Allow users to "1-click" scale their Control Plane to large or smaller instances
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scaling down is a risky change. Is that really a goal? It either requires an acceptance of risk on the part of the cluster administrator or a lot more checks on the OCP side to ensure the cluster will not become unstable. Has this specifically been requested from SREP / customers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understood it from initial requirements gather calls, yes. We need to scale up the control plane to bigger instances when the capacity is being stretched and be able to scale down to smaller instances when it is being under-utilized.

Having any kind of code check that the user is scaling down the instance type sounds like it will be very complex, we will have to start interpreting VM types which is not something we've done before

@jeremyeder Could you weigh in here

A new CRD `ControlPlaneSet` will be introduced into OpenShift and a respective `control-plane-set-operator` will be
introduced as a new second level operator within OpenShift to perform the operations described in this proposal.

The new CRD will define the specification for creating Machines for the Control Plane. The operator will be responsible
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The new CRD will define the specification for creating Machines for the Control Plane. The operator will be responsible
The new CRD will define the specification for managing (creating, adopting, updating) Machines for the Control Plane. The operator will be responsible

// for the Control Plane Machines.
type ControlPlaneSetStrategyType string

const (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why in-place upgrades are not listed? I expect this would have a much shorter down-time than a replace and mitigates concerns with new machine creation. SRE has an SOP for that on OSD here: https://github.com/openshift/ops-sop/blob/2ff3d49cf7b23353f3fc4928b367230f6dd32972/v4/howto/resize-masters.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Machine API by design is an immutable infrastructure product, we don't support in-place mutation of machines.

As far as I understood it, the idea of manually stopping and resizing machines is a support exception to allow a quicker replacement until we have full automation in place.

IMO, bringing up new machines, while potentially wasteful, should actually be safer in the long term as we know programatically what we are doing there, with a shut down and update, this gets more confusing if you try to program the changes


When adding a ControlPlaneSet to a new cluster, the end user will need to define the ControlPlaneSet resource by
copying the existing Control Plane Machine ProviderSpecs.
Once this is copied, they should remove the failure domain and add the desired failure domains to the `FailureDomains`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of the OSD/ROSA/ARO case this is a lot of changes specific to individual clusters and the logic seems fairly straight forward. Is there a possibility adoption could be part of the ControlPlaneSet creation, triggered by some specific condition such as an annotation? This would ensure the rollout of this feature across managed OpenShift is done correctly and consistently. The alternative is some bespoke logic created by SRE released in conjunction with the CPS to all managed clusters that then has to be cleaned up in a future release. Having this as a part of the CPS operator would allow the deletion and recreation of the CR (previous section) to be nearly trivial from an administrator perspective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a possibility adoption could be part of the ControlPlaneSet creation, triggered by some specific condition such as an annotation?

Could you expand on your concern here, I'm not sure exactly what you're saying. The idea is that the adoption process happens when the CPS is created, so I would expect SRE to apply a new CPS to each individual cluster (possibly in batches) if they want to take advantage of the CPS on older clusters. My recommendation though is that we focus on new clusters where this functionality will be included from the start

Having this as a part of the CPS operator would allow the deletion and recreation of the CR (previous section) to be nearly trivial from an administrator perspective.

So your suggestion is that if you create an empty CPS, the ControlPlaneSet operator should populate the spec for you? This feels like an anti-pattern for Kube and the idea of declarative APIs. I'm hesitant about us suggesting this as there's no guarantee that the existing configs are accurate, eg if the machines were on a UPI cluster (they should have been removed but maybe there were), at that point, we have introduced a spec that is invalid. If we infer it and get it wrong, that's on us, if a user specs it and gets it wrong, that's on them 🤔

We should also bear in mind the scope of this project, adding adoption into the mix and having to deal with the safety aspects of that will significantly expand the scope and extend the development time which is likely to already be tight if we want to get something running for this within 4.11

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually managing these for the managed OCP fleet isn't an option. But understood the concern with automatically creating the spec for CPS. For my use case though it's IPI 100% of the time. I am also not suggesting we populate CPS all the time, would it be a more tenable solution if it were an opt-in?

Copy link
Contributor Author

@JoelSpeed JoelSpeed Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting that the managed fleet would come up with some automation for creating the CPS, I don't think this warrants being in the core of the product though and I'm concerned about the scope creep.

We could possible have an opt-in adoption mechanism, but if we were to do that, that would have to be a stretch goal and not part of the MVP IMO, I wouldn't want to jeopardise the main project for this. I'm also not sure I'd want to expose this to all users which makes me think again, this should be something separate, what is your hesitation about running some job on the cluster to create the CPS? Is it just the clean up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't speak for the use case across all OCP customers but it seems like a useful pattern to adopt via code in the operator instead of documentation and steps executed by a customer. There are a lot of clusters out in the wild, I presume management of masters is a useful capability. I know from the managed OCP perspective it's been a gap that is getting more attention lately.

So I'm not necessarily reluctant to run a job to create CPS resources, though the mechanics on how we can roll that out will make it painful. I'm asking if that's the right way to do it for OCP and our customers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if a valid option here would be for us to have a way for the user to create a local manifest that could applied as a CPS?

for example, the user could issue an oc command, and it would return a manifest they could save and then inspect before applying. that might provide a nice middle ground where it could be automated but doesn't need to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, the user could issue an oc command, and it would return a manifest they could save and then inspect before applying. that might provide a nice middle ground where it could be automated but doesn't need to be.

We talked about this during the cluster lifecycle arch call last week. There were others with concerns about having this as an automated thing in-cluster. We automate it and break the cluster, that's on us, the customer breaks it, that's on them.

There was some support for a CLI kind of thing/script that would spit out a CPS manifest for the customer to apply if they wanted, with the caveat that they should verify it before they apply it. Though my initial stance on this was that this should be a stretch goal, certainly for the purpose of this enhancement and feature we have been focusing on the experience for new clusters, older clusters can adopt later IMO

#### A user deletes the ControlPlaneSet resource

Users will be allowed to remove ControlPlaneSet resources should they so desire. This shouldn't pose a risk to the
control plane as the ControlPlaneSet will orphan Machines it manages before it is removed from the cluster. More detail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the finalizer on the CPS is removed before orphaning Machines is complete? I guess the machines get deleted due to owner references when CPS is deleted? Admins, especially ones under pressure and who haven't learned not to touch finalizers, will do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess the machines get deleted due to owner references when CPS is deleted?

Yep, the machines will be deleted, but they should also be protected from removal by the etcd operator's machine deletion hooks, so you'll have deleted machines, sat for a while doing nothing, an alert will fire, and the only way to then resolve would be to re-add a CPS and have it replace the machines, then remove the CPS properly.

Users should not interfere with finalizers, especially in sensitive applications like this one

control plane as the ControlPlaneSet will orphan Machines it manages before it is removed from the cluster. More detail
is available in the [Removing/disabling the ControlPlaneSet](#Removingdisabling-the-ControlPlane) notes.

#### The ControlPlaneSet spec may not match the existing Machines
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If adoption of existing machines is managed by this operator the risk is mitigated a good bit, at least in the case where master machines are consistent with each other except in the failure domain (zone).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel that really adds much for us though, as the only safe way for the operator to do an adoption like this is when the masters are all consistent, in which case, there's no risk anyway

## Alternatives

Previous iterations of this enhancement linked in the metadata describe alternative ways we could implement this
feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given our general desire to stop adding new things to the release payload, and the fact that this new feature seems to be opt-in, did you explore delivering this new operator via OLM instead of the payload?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was imagining that this would eventually be (4.12) default and set up by the installer, so I hadn't investigated the idea of OLM. If we do want to make this opt-in permanently then having it delivered via OLM could be an option.

When I've been talking with the etcd team about this in the past, we had imagined they would also observe the CPS resource to understand control plane scale in the longer term as well as they don't have a good way to see how many replicas they should have at the moment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to add an open question for is this optional or not

reviewers:
- TBD
approvers:
- TBD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh @tkashem @hasbro17 and myself please.

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoelSpeed i think this looks nice and i appreciate the level of detail. i left a few comments/questions inline. i also want to say that i like the idea of creating this as an operator with an operand, regardless of whether we deploy as part of the release payload or as something on OLM.

containing the business logic of the control plane set activity in an operator seems like a good way for us to plan for a future where we have more cluster-api integration, and also be flexible to other structural changes like that.

lastly, as kind of a cheeky question, given the discussion of OLM and deploying this operator through the standard marketplace, is there anything that would prevent a developer from creating this type of operator and releasing it on the community operator hub for use with okd?
(with credit to @lobziik who has also mentioned something like this)

prevent the Control Plane Machine from being removed until a replacement for the etcd member has been introduced into
the cluster. To allow for this use case, the end user can manually remove the etcd protection (via removing the Machine
deletion hook on the deleted Machine) allowing the rollout to proceed, the etcd operator will not re-add the
protection if the Machine is already marked for deletion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the operator be able to detect that these manual intervention steps have been performed?

i'm curious how a user might discover that they should have done it, or what might happen if they forget.

(no worries if this falls under one of the open questions below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator will see that the Machine is not being drained (machines have a condition for this), it can report that back in its own status, machine foo isn't being drained because etcd is blocking it.

Then, when the etcd operator pre-drain hook is removed by the end user, this will go away, the drain will occur and the Machine goes away. The ControlPlaneSet will then spot the Machine has gone, and create the replacement.

When a change is required, any logic for rolling out the Machine will be paused until the ControlPlaneSet controller
observes a deleted Machine.
When a Machine is marked for deletion, it will create a new Machine based on the current spec.
It will then proceed with the replacement as normal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any option here for the user to specify that it should wait until the machine is actually deleted from the infrastructure before proceeding?

i'm thinking about resource constrained environments, and this sounds like it could be similar in some respects to the Recreate strategy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Machine should never be in the cluster if it isn't gone from the provider, so that's how we would know.

For resource constrained environments we have the Recreate strategy, this strategy is just for trialing new configs or slowing down the rollout by manually choosing when each control plane member is replaced

Notably it will need to ensure that there are no owner references on the Machines pointing to the ControlPlaneSet
instances. This will prevent the garbage collector from removing the Control Plane Machines when the ControlPlaneSet is
deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we make the ControlPlaneSet a standard component would this capability go away? (eg user cannot delete ControlPlaneSet)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, I think we will still want to allow users to opt-out of this feature (especially when regarding the composable openshift concept)


When adding a ControlPlaneSet to a new cluster, the end user will need to define the ControlPlaneSet resource by
copying the existing Control Plane Machine ProviderSpecs.
Once this is copied, they should remove the failure domain and add the desired failure domains to the `FailureDomains`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if a valid option here would be for us to have a way for the user to create a local manifest that could applied as a CPS?

for example, the user could issue an oc command, and it would return a manifest they could save and then inspect before applying. that might provide a nice middle ground where it could be automated but doesn't need to be.


## Proposal

A new CRD `ControlPlaneSet` will be introduced into OpenShift and a respective `control-plane-set-operator` will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically we have one control plane, not a set. So technically ControlPlaneMachineSet is the right term

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're are correct. Naming is not something I'm 100% sold on. I originally didn't include Machine in the name because I was trying to not make it sound like a MachineSet (which is akin to a replicaset), I want users to think of this more as a statefulset. I was also wondering how it would look in docs, IIRC they write the names of resources out, so you could see worker machine set as an item in today docs, and this resource would be a control plane machine set, is that confusing?

I'm not adverse to calling this a ControlPlaneMachineSet though. Perhaps bringing docs into the naming conversation might yield some interesting discussion. @jeana-redhat Do you have any thoughts about what would be clearer to an end user in terms of what we want to name this CRD and operator? Currently it's ControlPlaneSet, but Stefan has correctly pointed out, it would be more accurate as ControlPlaneMachineSet

Copy link

@jeana-redhat jeana-redhat Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am definitely concerned that a user reading about "machine sets" would see a "control plane machine set" as a subset or type of machine set. If there is a more accurate term that also can't cause accidental conflation of terms, that would be ideal.

NB: this would still be a point of possible confusion even if we are using resource names. ControlPlaneMachineSet also sounds like a type or subset of MachineSet.

Copy link
Contributor

@elmiko elmiko Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with @sttts reasoning about this being a ControlPlaneMachineSet, but i also agree about the naming confusion with respects to using MachineSet in the name.

we chatted about this during our team standup and i'm curious if we could go in a divergent direction by calling this a ControlPlaneInfrastructureSet? this would make it clearer about what the resource is defining and also leave us room for expansion in the future as we consider different types of infrastructures and APIs that openshift will use.

i don't have a strong opinion about the name, but i like the idea of making this more clear to users.

// +kubebuilder:validation:Enum:=3;5
// +kubebuilder:default:=3
// +kubebuilder:validation:Required
Replicas *int32 `json:"replicas"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why a pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that it's pretty normal for replicas to be a pointer within kube, it allows distinction between an unset value and an explicit 0, both of which are valid in the case of a deployment or statefulset.

Perhaps in this case 0 replicas doesn't make sense though? In that case we maybe don't need the pointer, though I'm still tempted to keep it to be inline with all of the other resources with a scale subresource that I've seen

// Strategy defines how the ControlPlaneSet will update
// Machines when it detects a change to the ProviderSpec.
// +optional
Strategy ControlPlaneSetStrategy `json:"strategy,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this, could you validate that the format is correct, a default tag for a struct is new to me

// Template describes the Control Plane Machines that will be created
// by this ControlPlaneSet.
// +kubebuilder:validation:Required
Template ControlPlaneSetTemplate `json:"template"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is meant to be an embedded resource, it doesn't have a TypeMeta.

The ProviderSpec is an embedded resources however. Though this has just prompted me to realise that we aren't following the spec.template.spec pattern that we have elsewhere, I think the providerSpec needs to be replaced with a reference to https://github.com/openshift/api/blob/d74727069f6fb9193d6ba50ed7e3a5edb2477bec/machine/v1beta1/types_machine.go#L163-L200

It will take the desired configuration, the `ProviderSpec`, and ensure that an appropriate number of Machines exist
within the cluster which match this specification.

It will be introduced as a new second level operator so that, if there are issues with its operation, it may report
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean it will have a distinct release cycle to OCP ? Is that feasible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was my understanding that a second level operator was the term used to describe the group of operators deployed by CVO, perhaps that's not the correct wording?

I am fully intending for this to be an in-payload project deployed around the same run level as Machine API. Though, due to the impact this controller might have, I was planning to have it as a separate ClusterOperator so that it has a distinct status it can report for cluster health, rather than trying to shoehorn that into the Machine API one

We had some discussion about this being an OLM deployed operator during the Cluster Lifecycle architecture call on Thursday 20th, though I'm still leaning towards this being in payload.

Note: I've clarified this with staff engineers, a second level operator is something deployed by CVO, sometimes "CVO operator" or "OLM operator" are used, I could update the wording to clarify this

// appropriate location for the particular provider.
// +kubebuilder:validation:MinLength:=1
// +kubebuilder:validation:Required
FailureDomains []string `json:"failureDomains"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this mean that users put regions here? Looks like it encourages that. Cross-region etcd is not supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes, I see what you mean, the wording on this isn't the best. The intention is only for availability zones to be put here. The issue we face is that they are called different names across different cloud providers. IIRC the failure domain label on a node is the AZ typically, so the nomenclature has come from that. I will try to clarify the description.

As we are not planning to tackle horizontal scaling of the control plane initially, we will implement a validating
webhook to deny changes to this value once set.

For new clusters, the installer will create (timeline TBD) the ControlPlaneSet resource and will set the value based on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this should be a second-level operator. How can the installer then create a CR ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A second level operator is deployed by CVO so we can have the installer create a CR, as for example it does with Machine API. See my other comment for more detail

The RollingUpdate strategy will be the default. It mirrors the RollingUpdate strategy familiar to users from
Deployments. When a change is required to be rolled out, the Machine controller will create a new Control Plane
Machine, wait for this to become ready, and then remove the old Machine. It will repeat this process until all Machines
are up to date.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this even matter? With the other enhancement about etcd machine hooks, wouldn't there are always be some kind of rolling update? Or would the machine api controller create the machine anyway and just the etcd membership would be "rolling"? Maybe reference the etcd machine hook enhancement here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a reference for sure

I think the point I was trying to make is that for example, the instance size would be a rolling update as we create and delete machines to migrate over. You're right though, it isn't massively an issue, you could create 3 new machines and then delete all three old ones and etcd's protection mechanism should handle it. I think in previous discussions though we concluded it would be best to only surge a single instance if we can help it, to reduce the likelihood of etcd having issues with balancing the etcd members across zones (the zonal information is under MAPI control not etcd)


The Recreate strategy mirrors the Recreate strategy familiar to users from Deployments.
When a change is required to be rolled out, the Machine controller will first remove a Control Plane Machine, wait for
its removal, and then create a new Control Plane Machine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't work with etcd machine hook, will it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, you describe it two paragraph further down 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a discussion in the architecture call last Thursday, I think we will make this a future goal. It's not required for our current use cases with customers so we can describe it here, but leave it out of the 4.11 plan.

This will give us time to figure out the kinks and how we handle the etcd protection question before we actually implement this part of the overall feature

- What alternative can we provide to satisfy resource constrained environments if we do not implement this strategy?
- Should we teach the etcd operator to remove the protection mechanism when the ControlPlaneSet is configured with this
strategy?
- To minimise risk, should we allow this strategy only on certain platforms? (Eg disallow the strategy on public clouds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an idea: allow it, even automate the whole mechanism, but let the user acknowledge it with some very clear value to set: RecreateDangerUnsupportedYouWillLoseYourDataPotentiallyDontWineDontComplain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We totally could rename the strategy to make it more obvious that it's dangerous, I'm not against that, but for now I might park this (add another open question) as we are planning on skipping this strategy in the first iteration now


As some users may want to remove or disable the ControlPlaneSet, and we have no bulletproof way to prevent a user
deleting the ControlPlaneSet, a finalizer will be placed on the ControlPlaneSet to allow the controller to ensure a
safe removal of the ControlPlaneSet, while leaving the Control Plane Machines in place.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have an admission webhook. It can prevent deletion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we could set the failure policy to block deletion requests when the admission webhook is down, though I don't think we want to block deletion altogether else we block users from opting out of this feature

I've removed the note about bulletproof as that was misleading


1. How do we wish to introduce this feature into the cluster, should it be a Technical Preview in the first instance?
2. Do we want this to be an optional add-on (delivered via OLM) or do we want this to be a default operator for future
clusters?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a distinct life-cycle from MCO and etcd operator even feasible?

Productizing things via cPaaS is still super complex. Would rather like to see the development time spent into making this mechanism rock-solid than wasting it in wrongly understood modularity of the platform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a section talking about why this should be in payload based off some discussion from an arch call last week, I agree, I don't think this should be OLM based personally, though there are some who disagree, so I will write up the reasons/my argument for it being in payload to see if I can convince them otherwise

- We expect these operations to be very infrequent within a cluster once established, so the impact should be
minimal
- The failure policy will be to fail closed, as such, when the webhook is down, no update operations will succeed
- The Kube API server operator already detects failing webhooks and as such will identify the issue early
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add: the ControlPlaneSet is not critical for cluster operator. Everything done by the operator described here can be done manually to the machine(set) objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion thanks

Copy link
Contributor Author

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed some updates and some comments on each of the existing pieces of feedback that have been added so far, thanks for the reviews.

I have a todo list to get to this week, it includes:

  • Failure Domains: These need to be explained more and discuss how this works across different platforms
  • How does this fit in with Cluster API
  • Clarifying that this will be installed by default and the reasons for doing so
  • Making the Recreate strategy a future feature and exclude it from the initial scope
  • Add some notes about testing
  • Anything I've missed from this list that was called out during the cluster lifecycle call last week


## Proposal

A new CRD `ControlPlaneSet` will be introduced into OpenShift and a respective `control-plane-set-operator` will be
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're are correct. Naming is not something I'm 100% sold on. I originally didn't include Machine in the name because I was trying to not make it sound like a MachineSet (which is akin to a replicaset), I want users to think of this more as a statefulset. I was also wondering how it would look in docs, IIRC they write the names of resources out, so you could see worker machine set as an item in today docs, and this resource would be a control plane machine set, is that confusing?

I'm not adverse to calling this a ControlPlaneMachineSet though. Perhaps bringing docs into the naming conversation might yield some interesting discussion. @jeana-redhat Do you have any thoughts about what would be clearer to an end user in terms of what we want to name this CRD and operator? Currently it's ControlPlaneSet, but Stefan has correctly pointed out, it would be more accurate as ControlPlaneMachineSet

// +kubebuilder:validation:Enum:=3;5
// +kubebuilder:default:=3
// +kubebuilder:validation:Required
Replicas *int32 `json:"replicas"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that it's pretty normal for replicas to be a pointer within kube, it allows distinction between an unset value and an explicit 0, both of which are valid in the case of a deployment or statefulset.

Perhaps in this case 0 replicas doesn't make sense though? In that case we maybe don't need the pointer, though I'm still tempted to keep it to be inline with all of the other resources with a scale subresource that I've seen

// Strategy defines how the ControlPlaneSet will update
// Machines when it detects a change to the ProviderSpec.
// +optional
Strategy ControlPlaneSetStrategy `json:"strategy,omitempty"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this, could you validate that the format is correct, a default tag for a struct is new to me

// appropriate location for the particular provider.
// +kubebuilder:validation:MinLength:=1
// +kubebuilder:validation:Required
FailureDomains []string `json:"failureDomains"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes, I see what you mean, the wording on this isn't the best. The intention is only for availability zones to be put here. The issue we face is that they are called different names across different cloud providers. IIRC the failure domain label on a node is the AZ typically, so the nomenclature has come from that. I will try to clarify the description.

// Template describes the Control Plane Machines that will be created
// by this ControlPlaneSet.
// +kubebuilder:validation:Required
Template ControlPlaneSetTemplate `json:"template"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is meant to be an embedded resource, it doesn't have a TypeMeta.

The ProviderSpec is an embedded resources however. Though this has just prompted me to realise that we aren't following the spec.template.spec pattern that we have elsewhere, I think the providerSpec needs to be replaced with a reference to https://github.com/openshift/api/blob/d74727069f6fb9193d6ba50ed7e3a5edb2477bec/machine/v1beta1/types_machine.go#L163-L200


As some users may want to remove or disable the ControlPlaneSet, and we have no bulletproof way to prevent a user
deleting the ControlPlaneSet, a finalizer will be placed on the ControlPlaneSet to allow the controller to ensure a
safe removal of the ControlPlaneSet, while leaving the Control Plane Machines in place.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we could set the failure policy to block deletion requests when the admission webhook is down, though I don't think we want to block deletion altogether else we block users from opting out of this feature

I've removed the note about bulletproof as that was misleading

Notably it will need to ensure that there are no owner references on the Machines pointing to the ControlPlaneSet
instances. This will prevent the garbage collector from removing the Control Plane Machines when the ControlPlaneSet is
deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, I think we will still want to allow users to opt-out of this feature (especially when regarding the composable openshift concept)


When adding a ControlPlaneSet to a new cluster, the end user will need to define the ControlPlaneSet resource by
copying the existing Control Plane Machine ProviderSpecs.
Once this is copied, they should remove the failure domain and add the desired failure domains to the `FailureDomains`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, the user could issue an oc command, and it would return a manifest they could save and then inspect before applying. that might provide a nice middle ground where it could be automated but doesn't need to be.

We talked about this during the cluster lifecycle arch call last week. There were others with concerns about having this as an automated thing in-cluster. We automate it and break the cluster, that's on us, the customer breaks it, that's on them.

There was some support for a CLI kind of thing/script that would spit out a CPS manifest for the customer to apply if they wanted, with the caveat that they should verify it before they apply it. Though my initial stance on this was that this should be a stretch goal, certainly for the purpose of this enhancement and feature we have been focusing on the experience for new clusters, older clusters can adopt later IMO


1. How do we wish to introduce this feature into the cluster, should it be a Technical Preview in the first instance?
2. Do we want this to be an optional add-on (delivered via OLM) or do we want this to be a default operator for future
clusters?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a section talking about why this should be in payload based off some discussion from an arch call last week, I agree, I don't think this should be OLM based personally, though there are some who disagree, so I will write up the reasons/my argument for it being in payload to see if I can convince them otherwise

- We expect these operations to be very infrequent within a cluster once established, so the impact should be
minimal
- The failure policy will be to fail closed, as such, when the webhook is down, no update operations will succeed
- The Kube API server operator already detects failing webhooks and as such will identify the issue early
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion thanks

@JoelSpeed
Copy link
Contributor Author

@staebler I've added an extra note to the API description that clarifies the subnet defaulting behaviour. I think I've resolved all of your comments now

Comment on lines 292 to 293
// If no subnet reference is provided, the Machine will be created in the first
// subnet returned by AWS when listing subnets within the provided availability zone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit on spacing

Suggested change
// If no subnet reference is provided, the Machine will be created in the first
// subnet returned by AWS when listing subnets within the provided availability zone.
// If no subnet reference is provided, the Machine will be created in the first
// subnet returned by AWS when listing subnets within the provided availability zone.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2022
@JoelSpeed
Copy link
Contributor Author

I sent an email out last time this went stale, there has been no further activity and no changes as we are implementing. I think we should merge this now, CC @sdodson

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 3, 2022
@enxebre
Copy link
Member

enxebre commented May 3, 2022

Thanks for the exhaustive doc @JoelSpeed.
Just a note: I know there's been long discussion around naming, even though ControlPlane + Machine + Set makes sense technically, I found it very unfortunate that out of a universal range of possibilities for the name we choose the very only one that clashes with other existing technology that behaves differently i.e MachineSet. As a user I would find it very confusing and I would find very difficult to understand the decision. No need to extend the discussion in this PR but might want to reconsider before shipping.

underlying hardware of my control plane and have these changes safely applied using immutable hardware concepts
- As a cluster administrator of OpenShift, I would like to be able to control rollout of changes to my Control Plane
Machines such that I can test changes with a single replica before applying the change to all replicas
- (Future work) As a cluster administrator of OpenShift with restricted hardware capacity, I would like to be able to
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does mean the current approach is only handling scaling up w/o down? I think it's important to make it clear, and ensure that scaling down throws an unsupported error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current approach is handling replacement w/scale up and down, the point of this line in particular is that currently we require surging the capacity as a part of the replacement. This story in particular is referencing a recreate style replacement rather than a rolling update style replacement

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a limitation that should be explicitly expressed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already expanded upon in the The Recreate update strategy (FUTURE WORK) section, but I can reword this here if you think there's a way I can make this clearer in the goals

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note on this, this project has been designed with both cloud and local datacenter solutions in mind, though, the configuration changes we are targeting, eg resizing, don't make sense in baremetal environments. So we aren't sure whether we actually want to support this Recreate strategy ever, but we left it in for prosperity to facilitate a future conversation about this

// More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
// Labels are required to match the ControlPlaneMachineSet selector.
// +kubebuilder:validation:Required
ObjectMeta ControlPlaneMachineSetTemplateObjectMeta `json:"metadata"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why not just use metav1.ObjectMeta and set the fields you care about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a rule in OpenShift APIs we avoid reusing upstream objects in our specs as they add fields to our APIs that we don't care about. For example, having the whole metav1.ObjectMeta here would mean fields like the name or creation timestamp are now a part of this API, yet they have no meaning to the controller and if a user sets them, are ignored.

It's better to have a struct with only those fields we actually care about so that the generated schema can give better feedback to the user about what is and isn't supported.

// OpenShiftMachineV1Beta1Machine defines the template for creating Machines
// from the v1beta1.machine.openshift.io API group.
// +kubebuilder:validation:Required
OpenShiftMachineV1Beta1Machine *OpenShiftMachineV1Beta1MachineTemplate `json:"machines_v1beta1_machine_openshift_io,omitempty"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*Spec should be the type, and Template is usually the field name. See https://github.com/kubernetes/api/blob/9b88471ea2b12de1a16e6c140bd1131c349f3811/apps/v1/types.go#L184 or other fields in that file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is a discriminated union, the template is a one of choice, so we needed to add an extra layer of indentation. We are otherwise compliant with the pattern from a end user /JSON structure perspective.

Comment on lines +117 to +118
// +kubebuilder:validation:Required
Replicas *int32 `json:"replicas"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'm a bit confused this enhancement is meant for vertical scaling, since horizontal was explicitly mentioned as non-goal, but looking through the API I can only see Replicas which is horizontal scaling, rather than machine descriptions denoting vertical scaling. What am I missing?

Deployments. When a change is required to be rolled out, the Machine controller will create a new Control Plane
Machine, wait for this to become ready, and then remove the old Machine. It will repeat this process until all Machines
are up to date.
During this process the etcd membership will be protected by the mechanism described in a [separate enhancement](https://github.com/openshift/enhancements/pull/943),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This merged so probably worth fixing the link.


There are however a few concerns about this idea which means we are planning to not implement this (at least for the
first iteration of the project):
- It may encourage users to attempt to use `ControlPlaneMachineSets` with UPI clusters which are incorrectly configured
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically in the manual case you have a similar problem, the question is are you planning on putting the same limitation on UPI as in the etcd enhancement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller is designed in a way such that, without an existing working set of Control Plane Machines it will degrade and not take any action, we will inform the user via the status of actions they need to take to resolve the issue.

So yes, unless the user has correctly configured a set of Machines, we will not do anything, in the same way the etcd protection does.

Copy link
Member

@sdodson sdodson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks fine to me, I appreciate the attention to making this a smooth transition and that it's being made optional from the start.

Comment on lines +454 to +457
The `FailureDomains` field is expected to be populated by the user/installer based on the topology of the Control Plane
Machines. For example, we expect on AWS that this will contain a list of availability zones within a single region.
Note, we are explicitly not expecting users to add different regions to the the `FailureDomains` as we do not support
running OpenShift across multiple regions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickdillon Do we provision all the necessary infrastructure in all AZs within a region or will there need to be some inspection of AZs to determine which are fully capable of supporting a control plane node?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading further, I guess perhaps we're just trusting the admin or installer to provide valid input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user doesn't provide valid input, then the new Machine will fail to come up and the operator will go degraded, prompting them to resolve the issue, so I think we are ok with this.

That said, AFAIK, we do provision across failure domains, eg on AWS I know when I create a US-East cluster, the installer creates 6 machinesets across all 6 of the failure domains, so maybe it does create the infrastructure

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 12, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sdodson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 12, 2022
@sdodson
Copy link
Member

sdodson commented May 12, 2022

/lgtm
/hold
I don't believe anyone else has pending feedback feel free to remove the hold whenever you feel everyone's required input has been addressed.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 12, 2022
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 12, 2022
@JoelSpeed
Copy link
Contributor Author

/hold cancel

I have responded to and addressed all of the comment through the various rounds of reviews this enhancement has had. I've had reviews from Machine API, Installer, Workloads, API/Auth, Service Delivery and docs. I have also sent several emails to the relevant parties to ensure they have had adequate time to review, stating my intention to merge if no further comments came through. The rate of comments has slowed to a point where I think it's better to merge this and resolve any follow ups in a separate PR.

/override ci/prow/makdownlint

New requirements were added for markdownlint since this was authored

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 13, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 13, 2022

@JoelSpeed: /override requires a failed status context or a job name to operate on.
The following unknown contexts were given:

  • ci/prow/makdownlint

Only the following contexts were expected:

  • ci/prow/markdownlint
  • pull-ci-openshift-enhancements-master-markdownlint
  • tide
Details

In response to this:

/hold cancel

I have responded to and addressed all of the comment through the various rounds of reviews this enhancement has had. I've had reviews from Machine API, Installer, Workloads, API/Auth, Service Delivery and docs. I have also sent several emails to the relevant parties to ensure they have had adequate time to review, stating my intention to merge if no further comments came through. The rate of comments has slowed to a point where I think it's better to merge this and resolve any follow ups in a separate PR.

/override ci/prow/makdownlint

New requirements were added for markdownlint since this was authored

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JoelSpeed
Copy link
Contributor Author

/override ci/prow/markdownlint

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 13, 2022

@JoelSpeed: /override requires a failed status context or a job name to operate on.
The following unknown contexts were given:

  • ci/prow/makdownlint

Only the following contexts were expected:

  • ci/prow/markdownlint
  • pull-ci-openshift-enhancements-master-markdownlint
  • tide
Details

In response to this:

/override ci/prow/makdownlint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 13, 2022

@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/markdownlint

Details

In response to this:

/override ci/prow/markdownlint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 13, 2022

@JoelSpeed: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 5198f3a into openshift:master May 13, 2022
@JoelSpeed JoelSpeed deleted the control-plane-set branch May 13, 2022 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.