OCPCLOUD-3344: Update installing-cluster-api-components-in-ocp #1918

mdbooth · 2026-01-08T14:05:52Z

No description provided.

openshift-ci-robot · 2026-01-08T14:05:57Z

@mdbooth: This pull request references OCPCLOUD-3344 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-01-08T14:08:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign moadz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JoelSpeed · 2026-01-09T13:40:08Z

enhancements/cluster-api/installing-cluster-api-components-in-ocp.md

+Runtime:
+* At startup, CAPI installer reads a list of provider images associated with the current OpenShift release
+* CAPI installer pulls these images and extracts manifests and metadata from the `/capi-operator-manifests` directory
+* The revision controller creates a new Revision references all manifests relevant to the current cluster
+* The installer controller installs the new Revision using Boxcutter
+* Once successful, the installer controller deletes orphaned artifacts associated with previous Revisions


This may be further on, but so far not seeing how we exclude CRDs in this flow

I haven't included it, yet. I have an idea of how I will incorporate it (essentially by pulling unmanaged CRDs into a virtual initial component that installs CompatibilityRequirements and waits for them to be Admitted and Compatible), but I wanted to focus on getting the rest of the code written and functional before spending too much time on the detail.

JoelSpeed · 2026-01-09T13:41:19Z

enhancements/cluster-api/installing-cluster-api-components-in-ocp.md

+  unmanagedCustomResourceDefinitions:
+  - machines.cluster-api.x-k8s.io
+status:
+  currentRevision: 4.22.5-0d2d3148cd1faa581e3d2924cdd8e9122d298de41cda51cf4e55fcdc1f5d1463-1


What does the hash represent in this name?

This is so we can reflect spec changes in a uniquely-named new revision. e.g. spec currently contains unmanagedCustomResourceDefinitions. We would want a new revision if that value is altered, even if the manifests have not changed. Additionally, I anticipate a future where the spec contains a list of additional platforms to install. Again we'd want to be able to reflect this in a new revision.

Why not use the generation of the resource instead of a hash? This is what other revision like rollouts do in OpenShift. The generation will increment every time there is a spec update (and annotations and labels IIRC)

The resource generation won't change on upgrade, although the ContentID referenced from provider images will.

In the implementation I've truncated it to 8 characters. It doesn't look too nasty.

You said the hash was of the spec, the spec of? I thought you meant the spec of the ClusterAPI resource, in which case, generation would change in the same cadence as the hash

TODO: Clarify somewhere in this document which 'content' is covered by this 'ContentID'.

It's the fully substituted manifests to be applied, in the order they will be applied, plus any modifications to those APIs based on the spec (e.g. unmanagedCRD). So the source is not just the spec, but also content pulled directly from provider images.

JoelSpeed · 2026-01-09T13:44:24Z

enhancements/cluster-api/installing-cluster-api-components-in-ocp.md

+The installer will process every profile in every provider image.
+The purpose of implementing multiple profiles is to enable future selection of different manifests based on FeatureGates.
+Note that FeatureGate support is not intended for the initial implementation.


How do we know which profile to install?

profile is a placeholder name. It is the name I'm currently giving to a single (manifest + metadata) within an image. An image may contain arbitrarily many (manifest + metadata)s. I'm calling them profiles until somebody comes up with a better name.

The revision controller will consider all profiles from all images equally. It will add any profile whose metadata matches its selection criteria, meaning that arbitrarily many profiles may be installed from a single image. It would be the responsibility of the provider to ensure that the metadata is mutually exclusive if only one is required.

Ok, so each profile will have some sort of metadata file that contains, for example, a featuregate that must be enabled to apply the manifests within? The revision controller isn't looking at the name of the profile, but selecting based on metadata in this file

I am not following the profile much, how will we be consuming these?

Will the user specify the desired profile in the spec somehow, based on the specific feature gates they desire to enable?

Are these upstream or downstream feature gates?

What happens to the ENV vars currently used in the CAPI manifests to select the FeatureGates?

@damdo It's what @JoelSpeed said. The revision controller doesn't pre-select profiles; it looks at all of them. So if an image contains a default and a techpreview profile, the revision controller doesn't care what they're called and doesn't distinguish between them. It processes both of them equally. It's down to the provider maintainer to ensure that the metadata included in a profile ensures that profile is only installed when it is required.

These would be downstream featuregates. The featuregate support isn't implemented or even designed, yet. We should copy an existing selection pattern from another component. There are a few to choose from.

The current CAPI installer doesn't actually support setting envsubst vars by feature gate. It sets one environment variable unconditionally: https://github.com/openshift/cluster-capi-operator/blob/88febbea0a8ff1d432a6b26828741ce30b71b295/cmd/cluster-capi-operator/main.go#L382-L397

I'm thinking I might actually move envsubst values into profile metadata, which would allow them to:

be specified in the provider repo, not cluster-capi-operator

be targetted at their specific provider

be specified differently according to any other selection criteria we implement for profiles, e.g. featuregates

damdo

Thanks @mdbooth

Looking good! A bunch of nits

enhancements/cluster-api/installing-cluster-api-components-in-ocp.md

damdo · 2026-01-12T14:26:21Z

enhancements/cluster-api/installing-cluster-api-components-in-ocp.md

+- Failure to apply manifests on upgrade
+  - CAPI components are already installed.
+    The cluster has been upgraded.
+    The CAPI installer fails to apply the new manifests
+  - Depending on the nature of the failure, it is possible that the CAPI components from the previous version will continue to operate correctly.
+  - However, if the failure is because an upgraded Deployment has entered a crash loop, it will not be possible to provision new Machines in the cluster.
+  - Either way, the `cluster-api` ClusterOperator will be marked Degraded with a Message indicating an installation failure.
+  - The CAPI installer logs will contain details of the failure.


Will this mechanism also bubble up to the cluster-api's ClusterOperator conditions if a Deployment has issues at Runtime (not just during the installation)?

It should, yes. However, it will need a safety buffer so it doesn't trip immediately during normal temporary interruptions.

I've implemented something similar for the ephemeral errors reported by the Revision controller's Progressing condition. Essentially an ephemeral error like a 5xx from kube-apiserver is reported as Progressing=True, Reason=EphemeralError, Message=. If this situation persists for 5 minutes we also go Degraded. But we don't want to go Degraded immediately just because kube-apiserver hiccoughed.

I've actually implemented this logic directly in the Revision controller, but I wonder if I should instead move it to the (yet to be written) code which aggregates conditions in the ClusterOperator. i.e. The Revision controller would only ever set its Progressing condition, and the aggregator would upgrade that to Degraded after 5 minutes.

We could do the same for the installer controller. If it sets the Progressing condition to true if it observes a Probe failure (which I'm pretty sure is how it will see a runtime interruption of an operand Deployment). The same aggregator can then upgrade that to Degraded only if it persists for a certain length of time.

the aggregator would upgrade that to Degraded after 5 minutes.

+1 from me on this pattern

damdo · 2026-01-12T14:33:15Z

enhancements/cluster-api/installing-cluster-api-components-in-ocp.md

+- Failure to apply revision after some components successfully applied
+  - While applying a revision, some components were installed but one failed.
+  - Depending on the nature of the failure, the component that failed to apply may or may not be in a working state.
+  - If it is not in a working state, this will most likely result in an inability to provision new Machines using Cluster API.
+  - The `cluster-api` ClusterOperator will be marked Degraded with a Message indicating an installation failure.
+  - The CAPI installer logs will contain details of the failure.


It'd be nice if there was a way in k8s to atomically apply a set of manifests and roll them back if any fails.. alas

We could actually do that, tbh: we'd create a new revision which is identical to the currently applied revision. It would then re-apply the old revision before removing any dangling objects from the unsuccessfully applied revision.

We'd just need to come up with some way to poke that logic in the revision controller. Perhaps a spec.pin: CurrentAppliedRevision 🤔

We can add that later, though.

damdo · 2026-01-12T14:35:41Z

enhancements/cluster-api/installing-cluster-api-components-in-ocp.md

  * *PROs*: it would ensure compatibility with the clusterctl ecosystem and format/contract
  * *CONs*: it would require a lot of extra complexity (~500 more LOC) and dependencies to achieve a similar result

+## Reference


Is this becoming a downstream CAPI provider contract? If so, should we mention this?

Yes, this is absolutely a downstream CAPI provider contract. I wrote it to be exactly that. I don't know if this is the best place for it to live, but it's a place it can live. If I were looking for it this is one of the places I'd look, so I figured I'd put it here.

damdo · 2026-01-12T14:57:21Z

enhancements/cluster-api/installing-cluster-api-components-in-ocp.md

+This must use the `registry.ci.openshift.org` registry.
+`manifests-gen` will return an error if it detects any image reference which does not use `registry.ci.openshift.org`.
+
+`protect-cluster-resource` indicates the name of the infrastructure cluster resource type which the CAPI operator will create for this provider.


Protect? In what way?

The next sentence is intended to explain that.

I didn't add this to manifests-gen, btw, it was already there! Not the command line option, but the behaviour.

damdo · 2026-01-12T15:00:33Z

enhancements/cluster-api/installing-cluster-api-components-in-ocp.md

+* Set the namespace of all namespaced objects to `openshift-cluster-api`
+* Replace cert-manager with OpenShift Service CA:
+  Upstream CAPI providers typically include `cert-manager` metadata and manifests for webhooks.
+  `manifests-gen` will automatically translate `cert-manager` manifests and metadata to use OpenShift Service CA instead.
+* Exclude `Namespace` and `Secret` objects from the manifests: we expect these to be created by other means.
+* The following set of changes to all Deployments:
+  * Set the annotation `target.workload.openshift.io/management: {"effect": "PreferredDuringScheduling"}`.
+  * Set resource requests of all containers to `cpu: 10m` and `memory: 50Mi`.
+  * Remove resource limits from all containers.
+  * Set the terminationMessagePolicy of all containers to `FallbackToLogsOnError`.


I worry this is likely to become stale quickly.
Is it worth referencing some external resource instead? Maybe an AI generated comment of the summary of the changes made by the manifests-gen? Maybe we can add it in the AGENTS.md rules for CCAPIO.

Can we try to solve that problem later?

Co-authored-by: Damiano Donati <[email protected]>

theobarberbany · 2026-01-22T11:20:40Z

enhancements/cluster-api/installing-cluster-api-components-in-ocp.md

+In summary:

-This proposal does not seek to allow installing multiple infrastructure CAPI providers. Although we want to design things in a way so that we do not inhibit this being implemented in the future.
+* Add support for image-based CAPI manifests to the CAPI installer


Do we explicitly explain what this means (may help understanding the following sections) - and why our current system is bad? (motivation)

theobarberbany · 2026-01-22T11:38:19Z

enhancements/cluster-api/installing-cluster-api-components-in-ocp.md


-![runtime](./assets/proposed-runtime.png)
+Runtime:
+* At startup, CAPI installer reads a list of provider images associated with the current OpenShift release


IIUC, images have manifests baked in now - so we basically do this:

$ oc adm release info registry.ci.openshift.org/ocp/release:<release version> Name: 4.22.0-0.ci-2026-01-22-020420 Digest: sha256:33d624609594a63555fbd5496a8d80333268ddc441550a08bc2ce758ed795244 Created: 2026-01-22T02:04:39Z OS/Arch: multi (linux/amd64) Manifests: 837 Metadata files: 2 .... .... Images: NAME DIGEST agent-installer-api-server sha256:621fd9923cb65f6a3c51df07b2ecb46c0a8f569a3dec83bacdf76a84c01df6d1 agent-installer-csr-approver sha256:c54f4030017755cc69f4e334bd054af2fbbae857e99c4bdf9d92d70502e64a8f agent-installer-node-agent sha256:a6d9228934f501db46be58543a206c10a3b5374f50710f3788edbd911b9bf0f3 agent-installer-orchestrator sha256:abde659d8a19a92742d06b155bf8c1a148f7c54841f1033261e96a1d020768e8 ... ...

and get map of component -> image SHA, which we can then pull and extract manifests from /capi-operator-manifests.

This means on upgrade we know which manifests are in release n-1 as well as n

theobarberbany · 2026-01-22T11:38:36Z

enhancements/cluster-api/installing-cluster-api-components-in-ocp.md

+Runtime:
+* At startup, CAPI installer reads a list of provider images associated with the current OpenShift release
+* CAPI installer pulls these images and extracts manifests and metadata from the `/capi-operator-manifests` directory
+* The revision controller creates a new Revision that references all manifests relevant to the current cluster


Can we go into some more detail here? I'm not following, what's an example revision here, what is it generated from?

edit, just saw the TODO below.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 8, 2026

openshift-ci bot requested review from adambkaplan and jupierce January 8, 2026 14:08

mdbooth marked this pull request as draft January 8, 2026 14:16

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 8, 2026

Update installing-cluster-api-components-in-ocp

5631e8c

mdbooth force-pushed the capi-installer branch from 4b5dc1b to 5631e8c Compare January 8, 2026 15:07

JoelSpeed reviewed Jan 9, 2026

View reviewed changes

damdo reviewed Jan 12, 2026

View reviewed changes

Apply suggestions from code review

10d2608

Co-authored-by: Damiano Donati <[email protected]>

theobarberbany reviewed Jan 22, 2026

View reviewed changes

OCPCLOUD-3344: Update installing-cluster-api-components-in-ocp #1918

Are you sure you want to change the base?

OCPCLOUD-3344: Update installing-cluster-api-components-in-ocp #1918

Uh oh!

Conversation

mdbooth commented Jan 8, 2026

Uh oh!

openshift-ci-robot commented Jan 8, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Jan 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

damdo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

openshift-ci-robot commented Jan 8, 2026 •

edited by openshift-ci bot

Loading