Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
198 changes: 198 additions & 0 deletions enhancements/baremetal/enable-baremetal-on-other-platforms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
---
title: enable-baremetal-on-other-platforms
authors:
- "@asalkeld"
- "@sadasu"
reviewers:
- "@hardys"
- "@romfreiman"
- "@dhellman"
approvers:
- "@hardys"
creation-date: 2021-08-20
last-updated: 2021-08-20
status: implementable
see-also:
- "/enhancements/baremetal/baremetal-provisioning-config.md"
replaces:
superseded-by:
---

# Enable baremetal on other Platforms to support centralized host management

## Release Signoff Checklist

- [x] Enhancement is `implementable`
- [x] Design details are appropriately documented from clear requirements
- [x] Test plan is defined
- [ ] Operational readiness criteria is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

Baremetal Host API is only available when deploying an OpenShift cluster with the baremetal
platform (via the IPI or AI (Assisted Installer) workflow). Having the ability to
manage baremetal hosts from clusters without requiring the cluster to be on baremetal
would be beneficial to customers.

## Motivation

An initial driver of this feature are the centralized host management use cases
in edge topologies, which without this feature, is restricted to having the
central OpenShift cluster deployed on baremetal.

See:
- https://github.com/openshift/enhancements/blob/master/enhancements/installer/agent-based-installation-in-hive.md
- https://github.com/openshift/assisted-service/tree/master/docs/hive-integration

### Goals

The specific goals of this proposal are to:

Support the centralized host management use case by partially enabling Baremetal Host API
on the following on-premise platforms:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's on-premises. There's no singular form of 'premises'.

- None
- OpenStack
- vSphere

We will be successful when:

Centralized host management can deploy clusters when running on the above platforms.

### Non-Goals

Allow Baremetal Host API to be fully enabled on all platforms.

Allow Machine API integration across platform types.
- The centralized host management flow currently interacts directly via this API
without any Machine API integration.
- This means further work will be required via a future enhancement to enable the
single cluster case where a combination of e.g VM controlplane and Baremetal
workers is desired.

## Proposal

BMO (baremetal-operator) provides the Baremetal Host API, it in turn is configured
and managed by CBO (cluster-baremetal-operator).

CBO reads the Provisioning CR that is created by the installer on baremetal platforms
and uses that to configure and deploy BMO. In the case of non-baremetal platforms
the user (or automation) will need to define the Provisioning CR.

Currently CBO checks the platform and if it is not baremetal it will be in a "disabled" state i.e. it will
1. set status.conditions Disabled=true and
2. not read or process the Provisioning CR and thus not deploy baremetal-operator.

This proposal is to allow CBO to be enabled on the following platforms only:
- Baremetal (current)
- None
- OpenStack
- vSphere.

Note: The Hypershift use case will be explicitly disallowed by disabling CBO when
Infrastructure.Status.ControlPlaneTopology == "External".

Further (to restrict the testing matrix) the allowed configuration options
of the Provisioning CR will be restricted to exactly those required by centralized host management.

*Only spec.provisioningNetwork=Disabled mode will be accepted in the Provisioning CR.*

If any other provisioningNetwork mode is set, the CBO webhook will refuse the change
in the usual way, but if defined before upgrading the operator, the Reconcile loop must always
validate the Provisioning CR
(by setting ClusterOperator/baremetal condition[InvalidConfiguration] = true ).

Note:

1. when the Provisioning CR is set to provisioningNetwork=Disabled mode, worker
nodes would be booted via virtual media. This removes the requirement for the
Provisioning Network which can be expected to be available only in Baremetal platform types.

2. documentation will need to be added to the centralized host management documentation
explaining how to create and update a Provisioning CR for these platforms.

### User Stories

#### Story 1 - Current IPI baremetal platform use case

No change.

#### Story 2 - centralized host management use case

As a user of a hub cluster that performs central infrastructure management, and
optionally zero-touch provisioning, I need to provision hosts using the k8s-native
API (Baremetal Hosts CR) even when the hub cluster has a platform of None, OpenStack, or vSphere.


### Risks and Mitigations

There is concern that *random* customers will use this feature out of context
and create support burden. This is why we have not suggested enabling CBO on
all platforms and with full feature set. However it is still a potential issue.

Another mitigation for this is to avoid documenting this outside of the CIM/ZTP case.
For that reason this change won't be documented as a standalone feature, only in the context of CIM/ZTP.

## Design Details

### Test Plan

#### Unit Testing

We will add unit tests to confirm that cluster-baremetal-operator:
* is enabled on the required platforms.
* will restrict functionality on these platforms to ProvisioningNetork=Disabled.

#### Functional Testing

An e2e test will be written in the Assisted Installer CI that will:
1. create one of the platforms above (SNO Platform=None might be the easiest) with Assisted Service.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I have deployed this exact same scenario (with a custom-built 4.8 image and the original PR that sparked this enhancement).

2. confirm that CBO is enabled
3. create a Provisioning CR and confirm that BMO is running
4. provision a baremetal cluster

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to wait for the full cluster to be deployed. Waiting for the agent's discovery phase to be over should be enough proof. This is assuming assisted will be used for this test.

Regardless, enabling CBO and BMO in an SNO node is a good, easy-enough, test


QE will validate the remaining platforms that are supported to reduce the load
on CI.

### Graduation Criteria

#### Dev Preview -> Tech Preview

#### Tech Preview -> GA

The feature will go to GA without tech preview

#### Removing a deprecated feature

### Upgrade / Downgrade Strategy

cluster-baremetal-operator will upgrade as it currently does, this is only a
minor change in functionality.

On the platforms (None, OpenStack and vSphere) where the operator was in a disabled
state, after been upgraded it will move into an enabled state. However in all but
centralized host management use cases nothing will change as there is no Provisioning CR.

### Version Skew Strategy

None required as this is not dependant on other components.

## Implementation History

This PR is the current WIP implementation: https://github.com/openshift/cluster-baremetal-operator/pull/189

## Drawbacks

There is concern that *random* customers will use this feature out of context
and create support burden.

## Alternatives

Customers can instead create a dedicated baremetal cluster to use as the hub
cluster.

Another alternative is to additionally distribute baremetal-operator as an optional
operator with OLM. The main downsides are the complexity of releasing and
distributing the same project two different ways, and the potential for install-time
confusion or conflict over which method should be used to install it.