Skip to content

Conversation

@JoelSpeed
Copy link
Contributor

@JoelSpeed JoelSpeed commented Feb 3, 2022

This PR adds the new ControlPlaneMachineSet CRD to the machine api group as described in openshift/enhancements#1008

To provide an example, a valid ControlPlaneMachineSet will look something like

apiVersion: machine.openshift.io/v1
kind: ControlPlaneMachineSet
metadata:
  name: cluster
  namespace: openshift-machine-api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
  selector:
    matchLabels:
      "machine.openshift.io/cluster-api-machine-role": master
      "machine.openshift.io/cluster-api-machine-type": master
  template:
    machineType: machines_v1beta1_machine_openshift_io
    machines_v1beta1_machine_openshift_io:
      failureDomains:
        aws:
        - placement:
            availabilityZone: us-east-1
        - placement:
            availabilityZone: us-east-2
      metadata:
        labels:
          "machine.openshift.io/cluster-api-machine-role": master
          "machine.openshift.io/cluster-api-machine-type": master
      spec:
        providerSpec:
          value:
            // Configuration for the provider specifics goes here

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2022
@JoelSpeed JoelSpeed force-pushed the control-plane-machine-set branch 2 times, most recently from 78d6932 to db4ef19 Compare February 9, 2022 17:16
@@ -0,0 +1,244 @@
package v1beta1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we won't change this and it will be on by default, it shoudl be v1. I'm thinking about writing a script to ban new APIs that try to be beta :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we won't change this

Now that is a good question, we are currently trying to make this as compatible with Cluster API as possible to support future work there, but, we will definitely want to be adding fields to this in the future and possibly deprecating some of it depending on how the Cluster API project stabilises their (v1beta1) API.

I've written a bunch about this in the enhancement which I thinks we may actually want to evolve this in the future.

For example, depending on how things go, we may want to move the current template into a discriminated union (which I believe could be achieved on upgrade using a conversion webhook) to allow a new machineTemplate or some other field to be introduced to support CAPI as well.

I've not really seen how other OpenShift APIs have actually evolved, so I'd be keen to learn more about this and what we are signing ourselves up for with these changes that we are predicting

// failure domain field. This will be overriden when the Machines
// are created based on the FailureDomains field.
// +kubebuilder:validation:Required
Spec MachineSpec `json:"spec"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

! not a reference, an actual MachineSpec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I hate this too! But, it matches our existing APIs (compare this CRD with the MachineSet CRD, they are very very similar). If we had a chance to start our API again, I would love to have a separate CRD for the MachineSpec that is then referenced (as CAPI does), but I don't think we can easily change our existing API to do this and I know historically there were a few strong voices against the proliferation of CRDs within the MAPI project.

I think it would be weird/confusing to have this as a v1 with a reference when the rest of our Machine API just inlines the specs

// More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
// Labels are required to match the ControlPlaneMachineSet selector.
// +kubebuilder:validation:Required
ObjectMeta metav1.ObjectMeta `json:"metadata"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What information is valid to set here and why would I set it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches the pattern of the MachineSet, where the labels and annotations in this objectmeta become the labels and annotations on the newly minted machines (much like in a deployment/replicaset/pod relationship).

I'm gonna take a stab and say you're going to suggest we trim this to our own objectmeta object which supports only the fields we need? (We sort of did this in MachineSet already, can't recall exactly the history of that, will do some archeology and come back)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we added that because of https://bugzilla.redhat.com/show_bug.cgi?id=1702089, for some reason the CRD generation wasn't generating a schema for the objectmeta back then and so the solution was to embed it in our own types so we could get schema validation. We copied from CAPI so that's why we have the fields we have, even though we only actually use labels and annotations 🤔

Template ControlPlaneMachineSetTemplate `json:"template"`
}

type ControlPlaneMachineSetTemplate struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being used as a MachineSet that has different RBAC permissions? Would this be more naturally modelled as a MachineSet that only the controller you have managing this resource is allowed to update directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great question which I hope the [alternatives section(https://github.com/openshift/enhancements/pull/1008/files#diff-167c14da636a95920dc7f720fd0d0be7490305b02914e4113ee9aebf2f51a3bdR931-R965) in the enhancement can help to answer.

Yes, this is a bit like MachineSets, but with previous iterations of this enhancement which used a collection of MachineSets, there were an awful lots of caveats, gotchas and risks of "user does something" and we don't really have a safe way to react. Having spoken with several of the staff engineers while taking over this work they shared concerns similar to mine that by layering MachineSets, we are exposing a resource in the middle that isn't needed and, given this is the control plane we are trying to manage, could be dangerous if a customer goes in and starts modifying it

@JoelSpeed JoelSpeed force-pushed the control-plane-machine-set branch 2 times, most recently from ac4505e to 37c3a48 Compare March 3, 2022 09:54
@JoelSpeed JoelSpeed changed the title [WIP][OCPCLOUD-1416] Add ControlPlaneMachineSet to Machine API group [OCPCLOUD-1416] Add ControlPlaneMachineSet to Machine API group Mar 3, 2022
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 3, 2022
@JoelSpeed JoelSpeed force-pushed the control-plane-machine-set branch 2 times, most recently from 143d6d8 to 0058f53 Compare March 3, 2022 11:55
@JoelSpeed
Copy link
Contributor Author

As this project is not part of the no-ff process, I'll apply the labels manually

/label docs-approved
/label px-approved
/label qe-approved

@openshift-ci openshift-ci bot added docs-approved Signifies that Docs has signed off on this PR px-approved Signifies that Product Support has signed off on this PR qe-approved Signifies that QE has signed off on this PR labels Mar 3, 2022
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// +genclient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-namespaced. If someone wants to create an externally managed control plane variant, that would be a different type so they can evolve independently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this statement

There are a number of arguments for making this namespaced:

  • The rest of the Machine API is namespaced, it is confusing to have part of the API non-namespaced
  • We have an existing design to migrate users from MAPI to CAPI, this mirrors resources and converts them to work with Cluster API. By making this a cluster scoped resource we would have to create some new migration mechanism specifically for this resource, which then would need documenting separately to the rest of Machine APi
  • If this is cluster scoped, we cannot use it with CAPI. CAPI Control Plane implementations (which we intended to make this), must be namespaced - if we make this cluster scoped, there is no point having the discriminated union over future Machine types, we would have to start again to implement a CAPI version of this (this is a technical limitation, Cluster API will add an owner reference to the ControlPlane implementation CR for a namespaced object, you cannot have an owner reference from a namespaced to a cluster scoped object)
  • This project has already been designed such that it could be used in a centralised machine management or Cluster API style management/guest cluster pattern, making it cluster scoped would prevent us from leveraging this within those use cases and we would have to create an identical resource with an identical implementation to make this work with multiple instances within the management cluster
  • Restricting this to cluster scoped seems to add a whole bunch of restrictions which in the future I think we will come to regret, as it will inevitable cause a lot of work for us in the future duplicating this to make it work multiple times

I'm aware of your concern with the fact that logically there should only be one of these in an HA cluster, I believe we can resolve that by restricting it initially to a single instance in a single namespace, we already restrict Machine API to a single namespace, this is something users are familiar with

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counterpoints in favor cluster scoped include making it impossible to let a user create one in the wrong namespace, making it impossible to accidentally grant privileges to namespace level users by using an uncareful *, making it impossible to use a different namespace that may eventually have an impact, and being at a scope that is universally recognized as privleged.

In addition, given that migration would still be one to one it could still be done with a namespaced reference.

With regard to "but we could use the same type for a different topology management mode", it seems extremely likely that the set of capabilities and control desired in self-managed versus externally managed be different as we've seen with other operators where the constraints imposed by external management are different. This is likely to lead to logically divergent schemas that you will you attempt to force into a single type.

However, in spite of all that, I am willing to let this play out if you really want it. But when you start developing divergent schemas, you will need a plan either maintain two different schemas (and thus a very confusing controller on top of it) or develop some killer validation routines that manage to avoid the cyclical re-bootstrapping problem.

// Currently, the only valid value is machine.v1beta1.machine.openshift.io.
// +unionDiscriminator
// +kubebuilder:default:="machines.v1beta1.machine.openshift.io"
// +optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no benefit to optional, let's go required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given there's only one valid value, I thought making it optional and defaulting it to that one valid value would be beneficial to users.

Is removing a default later in the cycle considered a breaking change? Eg once we add a separate machine type to this enum?

// OpenShiftMachineV1Beta1Machine defines the template for creating Machines
// from the v1beta1.machine.openshift.io API group.
// +kubebuilder:validation:Required
OpenShiftMachineV1Beta1Machine *OpenShiftMachineV1Beta1MachineTemplate `json:"machines.v1beta1.machine.openshift.io,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually know. is this a valid json key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, it seems to be, I managed to create a resource with Kind and this CRD installed, and I can get the value out using jq, eg this works, you just have to make sure you've quoted the key

kubectl get controlplanemachineset cluster -o json | jq ".spec.template.\"machines.v1beta1.machine.openshift.io\""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I am wondering if some libraries won't handle this well, perhaps using underscores would be less likely to cause issues

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I've switched over to underscores which seems to work well as well, I suspect there are json libraries that split on periods so probably not the best choice

// +optional
Type ControlPlaneMachineSetStrategyType `json:"type,omitempty"`

// This is left as a struct to allow future rolling update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for noting this. It would not have occurred to me otherwise.

}

// AWSFailureDomain configures failure domain information for the AWS platform
type AWSFailureDomain struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a "pick one" type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, in this struct you can have one, or the other, or both, is completely flexible and up to the user. I think in most cases users will need to use both

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation will need to enforce that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you enforce "you must set at least one of these" with kubebuilder validation tags? Perhaps MinProperties does this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yep, just checked that and it works, will double check other placements as well, they all have only one field so will have to make sure they're all required

// +listType=map
// +listMapKey=type
// +optional
Conditions []metav1.Condition `json:"conditions,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: conditions first please.

@JoelSpeed JoelSpeed force-pushed the control-plane-machine-set branch from 0058f53 to 1a56ef5 Compare March 4, 2022 10:55
@JoelSpeed
Copy link
Contributor Author

@deads2k I've switched the machines.v1beta1.machine.openshift.io to use underscores instead as I'm now thinking some json libraries won't like it, even though initial testing looked ok.

Just got one outstanding question on the default, which I've left alone for now, PTAL #1112 (comment)

@JoelSpeed JoelSpeed force-pushed the control-plane-machine-set branch from 1a56ef5 to 4388a4d Compare March 4, 2022 11:22
@JoelSpeed
Copy link
Contributor Author

Actually, the more I think about this, this resource has to be namespace scoped, PTAL #1112 (comment)

@JoelSpeed JoelSpeed force-pushed the control-plane-machine-set branch from 4388a4d to 73953e9 Compare March 4, 2022 16:05
@JoelSpeed JoelSpeed force-pushed the control-plane-machine-set branch from 73953e9 to 89e1891 Compare March 4, 2022 16:46
@deads2k
Copy link
Contributor

deads2k commented Mar 4, 2022

Actually, the more I think about this, this resource has to be namespace scoped, PTAL #1112 (comment)

I suggest against taking this approach and instead building the API that will allow evolution of different things to be different. I've left a comment above explaining why, but I will let this play out.

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 4, 2022
@JoelSpeed JoelSpeed force-pushed the control-plane-machine-set branch from 89e1891 to 94fcd74 Compare March 7, 2022 14:21
@JoelSpeed
Copy link
Contributor Author

Made some minor changes adding missing optional tags for the status fields and making the AWSResourceReference a discriminated union, I think this is now ready to go

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one spelling error, otherwise looks good to me

// a validation error.
// +union
type AWSResourceReference struct {
// Type determines how the reference will fetch the AWS resouce.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Type determines how the reference will fetch the AWS resouce.
// Type determines how the reference will fetch the AWS resource.

@JoelSpeed JoelSpeed force-pushed the control-plane-machine-set branch from 94fcd74 to f162692 Compare March 7, 2022 14:49
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 7, 2022
@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@JoelSpeed JoelSpeed force-pushed the control-plane-machine-set branch from f162692 to bf40233 Compare March 7, 2022 15:06
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 7, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 7, 2022

@JoelSpeed: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@elmiko
Copy link
Contributor

elmiko commented Mar 7, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 7, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 7, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, elmiko, JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. docs-approved Signifies that Docs has signed off on this PR lgtm Indicates that a PR is ready to be merged. px-approved Signifies that Product Support has signed off on this PR qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants