Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
305 changes: 305 additions & 0 deletions enhancements/machine-api/cluster-api-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
---
title: Cluster API Integration
authors:
- "@JoelSpeed"
- "@alexander-demichev"
reviewers:
- "@elmiko"
- "@Fedosin"
- "@lobziik"
- "@asalkeld"
- "@hardys"
approvers:
- "@elmiko"
- "@enxebre"
- "@asalkeld"
creation-date: 2021-09-16
last-updated: 2021-09-16
status: implementable
---

# Cluster API Integration

## Release Signoff Checklist

- [x] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Operational readiness criteria is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

This enhancement describes the process of integrating the upstream [Cluster API](https://github.com/kubernetes-sigs/cluster-api)
project into OpenShift standalone clusters.

## Motivation

We would like to give users the ablility to use Cluster API for machine management, as an addition or supplement for Machine API.

### Goals

- Run Cluster API controllers for managing infrastructure in a similar way to Machine API.
- Provide forward compatibility between Machine API (MAPI) and Cluster API (CAPI).
- Ensure feature parity between MAPI and CAPI before migration.

### Non-Goals

- Deprecate or remove any existing APIs.
- Stop providing support for Machine API in near future.
- Provide any automated integration or migration between MAPI and CAPI resources.
- Change current autoscaler behavior to use Cluster API. This will be handled after Technical Preview.

## Proposal

This proposal is about introducing Cluster API alongside Machine API as a technical preview in OpenShift clusters. Cluster API on OpenShift has the potential to unlock new infrastructure providers and community engagement for our users.
During the technical preview we will gather feedback on its usefulness as well as evaluate the feasibility of using Cluster API as a primary infrastructure resource API for OpenShift.

### User Stories

#### Story 1

As an OpenShift developer, I would like to leverage the upstream community Cluster API infrastructure providers and reduce the barrier to OpenShift of supporting new providers.

#### Story 2

As an OpenShift developer, I would like to collaborate with third parties who already have vested interests in maintaining Machine controllers for various infrastructure providers so that i can benefit from their expertise as I add new features.

#### Story 3

As a cloud developer, I would like to easily onboard new infrastructure providers as this process is well documented by the CAPI community and any implementation will be able to be leveraged by both Kubernetes and OpenShift customers, increasing the value of implementing a new provider.

#### Story 4

As a developer, I would like to be able to use the same set of tools for infrastructure management in OpenShift as I can for vanilla Kubernetes.

#### Story 5

As a cloud operator, I would like to be able to use the CAPI infrastructure resource API for managing mixed infrastructure provider clusters.

#### Story 6

As a developer, I would like to have support for hub-spoke OpenShift clusters. Where a management cluster can manage workload clusters that are running on different infrastructure providers.

#### Story 7

As a user, I would like to create new MachineSets using CAPI and be able to explore features that are not available in MAPI.

### Implementation Details

First, we need to establish Cluster API resource management by ensuring all required components(CRDs, controllers,
RBAC, secrets) are successfully installed and running within the OpenShift cluster.

Cluster API will only be present in the cluster (installed by a new operator) if and when a user installs a feature gate.
We will introduce a new, OpenShift specific, feature gate `ClusterAPIEnabled` and include it within the `TechPreviewNoUpgrade` FeatureSet.

Once installed, this preview will allow the user to create new MachineSets using CAPI and explore the
features available within CAPI, for comparison with MAPI. For example, availability set support in Azure that
was already in CAPI, but only being introduced to MAPI as of 4.10.

During this timeframe, any user wanting to migrate or try out the preview will be left to manually migrate the MachineSet
or create a new one, using either the upstream documentation or documentation provided by OpenShift.

This preview will be intended to be used to create day 2 worker MachineSets and is not expected to be integrated into the install process in any way.

It's important to note that the preview will focus on supporting CRDs that our users are familiar with, that includes:
Machines, MachineSets, MachineHealthChecks. We are not planning to document support for other CRDs like MachineDeployment,
however they will be installed.

### Supported platforms

The technical preview aims to support: AWS, Azure, GCP, Baremetal, Openstack.

#### Cluster API resource management

In order to maintain the lifecycle of Cluster API related resources, we will create a new operator `cluster-capi-operator`, this name was chosen for avoiding confusion with upstream Cluster API operator.
This operator will be responsible for all administrative tasks related to the deployment of the Cluster API project within the cluster.
During tech preview phase, the new operator will leverage the new [CVO feature](https://github.com/openshift/enhancements/blob/master/enhancements/update/cvo-techpreview-manifests.md) for managing all Cluster API related CRDs.

`cluster-capi-operator` and it's operands will be provisioned in a new `openshift-cluster-api` namespace.

The operator will perform the following tasks:

##### Reconcile FeatureGate object

While Cluster API intergration is in tech preview, the operator will reconcile the cluster [`FeatureGate`](https://docs.openshift.com/container-platform/4.8/nodes/clusters/nodes-cluster-enabling-features.html) object and check for `ClusterAPIEnabled` feature gate presence.
The operator will procceed with Cluster API installation if and only if the required feature gate is present.

##### Deploy Cluster Machine Approver

In order for Cluster API machines to succefully join the cluster, the Kubelet CSRs need to be approved.
The operator will deploy a separate instance of the `cluster-machine-approver`, which will be configured to be used with Cluster API machines by providing [`--apigroup`](https://github.com/openshift/cluster-machine-approver/blob/master/main.go#L54) flag that was recently introduced.

##### Install upstream CAPI operator

We will use the upstream [Cluster API operator](https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20201020-capi-provider-operator.md) for managing CRDs and deploying infrastructure providers.

##### Deploy core Cluster API

Once the upstream Cluster API Operator is installed, the next step is to create a `CoreProvider` CR along with a configmap that contains upstream Cluster API CRDs, Deployment, Webhooks and RBAC resources.
The example usage is described [here](https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20201020-capi-provider-operator.md#air-gapped-environment).

##### Deploy Cluster API infrastructure provider

The `cluster-capi-operator` operator will create an appropriate `InfrastructureProvider` CR (based on the cluster platform) and a configmap that contains upstream Cluster API cloud provider CRDs, Deployment, Webhooks and RBAC resources.

##### Reconcile Cluster object

Cluster API's main entity is a `Cluster`, it represents the cluster which is managed by Cluster API and the cluster's infrastructure. More details about cluster object are [here](https://cluster-api.sigs.k8s.io/user/concepts.html).
The `cluster-capi-operator` will need to create the `Cluster` and a proper `InfrastructureCluster` resource for the OpenShift cluster.
Because we have our own infrastructure management strategy in OpenShift, we should leverage the [externally managed cluster infrastructure](https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20210203-externally-managed-cluster-infrastructure.md) feature.
This means that the created `InfrastructureCluster` should have `cluster.x-k8s.io/managed-by:` annotation set.

Cluster object reconciler can be done as a separate controller. The controller should:
- Wait before Cluster and InfrastructureCluster CRD is present
- Create both Cluster and InfrastructureCluster objects with externally managed cluster infrastructure annotation.
- Ensure spec/status of InfrastructureCluster are configured for the OpenShift cluster (infrastructure information can be sourced from resources within the OpenShift Cluster).
- Patch `Cluster` status to `Ready=true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is creating the namespace where the cluster and machinesets CRs are mean to live?
This needs to be handled in a way that does not class with hypershift CAPI CRDs and controllers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enxebre does hypershift re-implement the capi controllers or does it install the capi controllers?
What I am getting at is, should they be able to work together (ignoring version differences), or can you deal with capi moving to openshift-cluster-api?

The CRDs have webhooks specifying a namespace (like this https://github.com/kubernetes-sigs/cluster-api/blob/main/config/crd/patches/webhook_in_clusters.yaml#L17) that will be different in the CRDs and WebHookConfigurations

Currently the downstream operator gathers and specifies the resources (including CRDs) and the upstream operator will install them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should be able to operate independently. Hypershift owns the lifecycle of their CAPI controllers/CRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will only reconcile CAPI CRs in openshift-cluster-api namespace, CRDs can possibly conflict if we use different CAPI versions.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any mention to the control plane CR here? Which implementation are you planning to use?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are currently not installing the controlplane CR (but super easy to do so).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not planning to support control planes for tech preview

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nice to put this in the non-goals as well

##### Create user data secret

Cluster API Machines will need a user data secret, similar to the one that Machine API uses.
This secret is created by installer for Machine API.
While Cluster API components are in technical preview, and therefore not integrated into the OpenShift Installer, the operator can copy the worker user data secret from `openshift-machine-api` namespace to `openshift-cluster-api`.

At this point all Cluster API components should be installed and ready to use.

#### CVO management

A new `cluster-capi-operator` image will be built and included in every release payload.

#### Credentials management

The `cluster-capi-operator`'s manifests should contain an appropriate `CredentialsRequest` for each supported infrastructure provider.
This is similiar to [machine-api-operator](https://github.com/openshift/machine-api-operator/blob/6f629682b791a6f4992b78218bfc6e41a32abbe9/install/0000_30_machine-api-operator_00_credentials-request.yaml)

#### Cluster API cloud providers

Cluster API infrastructure providers will live in forks, similar to what is now done for Machine API. We now evaluating moving
current providers implementation to new repos that will be called `machine-api-provider-*` and reseting current
`cluster-api-provider-*` to latest upstream.

#### Example usage

Usage is similar to Machine API with small differences, MachineSets reference the `Cluster` object and infrastructure machine template.

```yaml
---
apiVersion: cluster.x-k8s.io/v1alpha4
kind: MachineSet
metadata:
name: capi-ms
namespace: openshift-cluster-api
spec:
clusterName: cluster-name
replicas: 1
selector:
matchLabels:
test: example
template:
metadata:
labels:
test: example
spec:
bootstrap:
dataSecretName: worker-user-data
clusterName: cluster-name
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha4
kind: AWSMachineTemplate
name: cluster-name

---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha4
kind: AWSMachineTemplate
metadata:
name: capi-machine-template
namespace: openshift-cluster-api
spec:
template:
spec:
uncompressedUserData: true
iamInstanceProfile: ....
instanceType: m5.large
cloudInit:
secureSecretsBackend: secrets-manager
insecureSkipSecretsManager: true
ami:
id: ....
subnet:
filters:
- name: tag:Name
values:
- ...
additionalSecurityGroups:
- filters:
- name: tag:Name
values:
- ...
```

### Risks and Mitigations

- During tech preview `cluster-capi-operator` will have permissions to manage CRDs, this might be a not secure permission for an operator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit

Suggested change
- During tech preview `cluster-capi-operator` will have permissions to manage CRDs, this might be a not secure permission for an operator.
- During tech preview `cluster-capi-operator` will have permissions to manage CRDs, this might not be a secure permission for an operator.

- Note, this permission should be restricted to creating CRDs only, as once installed, the technical preview cannot be uninstalled.
- CLI usage, once Cluster API is installed command like `oc get machine` will return Cluster API machines, in order to use Machine API users will have to use fully qualified name `oc get machines.machine.openshift.io`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes me wonder if we shouldn't have some sort of warning message associated with oc get machines once the feature gate is active, but i'm not sure if that's even possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we still have to figure out what to do here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did the conversation end up with API team about changing the priority so we don't make this breaking change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a "hack" to set the preference in openshift. If we do this then any scripts should not break and users will have to use fully qualified names for CAPI resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make a note of that hack and ideally get something to track that so we don't forget to do it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added a note

If we want to not introduce this breaking change then we have to set prefered API group in our [kubernetes fork](https://github.com/openshift/kubernetes/blob/master/pkg/controlplane/controller/crdregistration/patch.go).
- Feature parity, for last year we've been trying to upstream all features introduced to Machine API but we can't be sure all of them work in upstream. We need to have a good set of regression tests running periodically.

### API Extensions

With `ClusterAPIEnabled` feature enabled, the following API extensions will be added:

- Core Cluster API resources and webhooks, they can be found [here](https://github.com/kubernetes-sigs/cluster-api/tree/main/api/v1beta1)
- Depending on the provider where a cluster is running, infrastructure provider CRD and webhooks will be added, see
[AWS](https://github.com/kubernetes-sigs/cluster-api-provider-aws/tree/main/api/v1beta1), [Azure](https://github.com/kubernetes-sigs/cluster-api-provider-azure/tree/main/api/v1beta1), [GCP](https://github.com/kubernetes-sigs/cluster-api-provider-gcp/tree/main/api/v1alpha4).
- Cluster API Operator CRDs will be added, see [here](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20201020-capi-provider-operator.md#new-api-types)

## Design Details

### Test Plan

- Cluster API providers should already come with a set of e2e tests, we will run these on each PR.
- `cluster-capi-operator` will include it's own e2e suite ensuring that all Cluter API components are successfully installed.

### Operational Aspects of API Extensions
#### Failure Modes

If Cluster API starts failing, it will affect worker machine management, which is a critical
component of the OCP system. In case of Cluster API failures, users will be able to use the Machine API.

#### Support Procedures

The process of troubleshooting failure is similar to the process of troubleshooting Machine API failures.
We will be working on making sure that similar or equivalent events, metrics and alerts are present.

### Graduation Criteria

#### Dev Preview -> Tech Preview

- Write symptoms-based alerts for the component(s)
- Ability to have Cluster API installed using a feature gate
- Ability to use Cluster API for machine management
- End user documentation
- Running upstream e2e workflow on openshift

#### Tech Preview -> GA (Future Work)

- Cluster API will be installed in all OpenShift clusters by default.
- Bidirectional migration for MAPI and CAPI.
- New infrastructure providers implemented as CAPI.

#### Removing a deprecated feature

### Upgrade / Downgrade Strategy

### Version Skew Strategy

## Implementation History

## Drawbacks

## Alternatives

## Infrastructure Needed