Skip to content

Comments

ARO-13667 hypershift: support disabling image registry via capability#1729

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
flavianmissi:hypershift-image-registry-capability-ARO-13667
Mar 17, 2025
Merged

ARO-13667 hypershift: support disabling image registry via capability#1729
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
flavianmissi:hypershift-image-registry-capability-ARO-13667

Conversation

@flavianmissi
Copy link
Member

No description provided.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 19, 2024
@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch from 3f45963 to dda1779 Compare December 19, 2024 16:07
@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch 4 times, most recently from 09fcfa5 to a975798 Compare December 20, 2024 12:20
@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch 9 times, most recently from 6da7e55 to 59ccb87 Compare January 8, 2025 13:55
@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch 2 times, most recently from 2ed863b to 87ef4db Compare January 8, 2025 14:36
Comment on lines 107 to 131
The `CVO` reconciler needs to edit the `ClusterVersion` resource, setting
`spec.capabilities.baselineCapabilitySet` to `"None"`, then calculating the
difference between `ClusterVersionCapabilitySets[ClusterVersionCapabilitySetCurrent]`
and `HostedCluster.spec.disabledCapabilities`, assigning the resulting set to
field `spec.capabilities.additionalEnabledCapabilities`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not strictly related to this enhancement, but working through the caps I've noticed that the installer has a dedicated (non-API based) validation we might need to replicate:
https://github.com/openshift/installer/blob/main/pkg/types/validation/installconfig.go#L205-L253

Things like

if c.Platform.BareMetal != nil && !enabledCaps.Has(configv1.ClusterVersionCapabilityBaremetal)

are obviously bogus for Hypershift, but the ClusterVersionCapabilitySetCurrent will still define BareMetal, so we're also setting this implicitly.

It seems to me that a decoupled and dedicated CapabilitySet for Hypershift would be a better idea, but I defer this to @deads2k - this is clearly more work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not clear to me why the installer's baremetal logic wouldn't apply to HyperShift. Does HyperShift not run on baremetal? Or does it run on baremetal but somehow allow the baremetal component to be disabled? I wonder what things like bring-your-own-nodes workflows might look like on HyperShift, but I could see some future where folks want just the Kube API and don't need the MachineAPI (or some future ClusterAPI?) component enabled, to tell the HostedControlPlane controller that it can skip the Cluster-API Machine(Set), etc. reconcilers.

Copy link
Contributor

@tjungblu tjungblu Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking what's the impact when we run a Hypershift control plane (e.g. on AWS) with enabled baremetal capability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] the installer has a dedicated (non-API based) validation we might need to replicate:

Added a section about this under "Risks and Mitigations": https://github.com/openshift/enhancements/pull/1729/files#diff-36fbf863a674d8de905f7aa65cdb6158ec8c27b70d2fe5eed4272ca0a964a730R278-R289

@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch 4 times, most recently from 1017891 to 2ee2bed Compare January 10, 2025 12:54
Comment on lines 244 to 359
1. Does Hypershift's CVO need to support [implicity capabilities][implicit-capabilities]?

[implicit-capabilities]: https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#updates
Copy link
Member Author

@flavianmissi flavianmissi Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjenning can you help us answer this question? this came up on a slack discussion.
I don't fully understand the role of the CVO in Hypershift so I'm having difficulties understanding how potential changes in the resources belonging to a capability might affect upgrades.

Here's the relevant section from the linked enhancement:

Updates

During updates, resources may move from being associated with one capability, to another, or move from being part of the core (always installed) to being part of an optional capability. In such cases, the governing logic is that once a resource has been applied to a cluster, it must continue to be applied to the cluster (unless it is removed from the payload entirely, of course). Furthermore, if any resource from a given capability is applied to a cluster, all other resources from that capability must be applied to the cluster, even if the cluster configuration would otherwise filter out that capability. These are referred to as implicitly enabled capabilities and the purpose is to avoid scenarios that would break cluster functionality. Once a resource is applied to a cluster, it will always be reconciled in future versions, and that if any part of a capability is present, the entire capability will be present.

This means the CVO must apply the following logic:

    For the outgoing payload, find all manifests that are [currently included](https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#manifest-annotations).
    For the new payload, find all the manifests that would be included, in the absence of capability filtering.
    Match manifests using group, kind, namespace, and name.
    Union all capabilities from matched manifests with the set of previously-enabled capabilities to compute the new enabledCapabilities set.

As in [the general case](https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#capabilities-cannot-be-uninstalled), when the requested spec.capabilities diverge from enabledCapabilities, the CVO will set ImplicitlyEnabledCapabilities=True in ClusterVersion's status.conditions with a message pointing out the divergence.

cc @tjungblu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me start by saying that I didn't know about ImplicitlyEnabledCapabilities until just now, but I think the ramifications are this:

Capabilities cannot be disabled once enabled. That we already know. The effect is that if a user creates a cluster with the registry enabled or enables it after initially disabling it at install time, that cluster is locked to having the image registry enabled forever. The only possible toggle is from disabled -> enabled and it is a one-way trip.

ImplicitlyEnabledCapabilities is for converting more components into capabilities over time.

For example, in the y-release before image registry became a capability, ClusterVersionCapabilitySet4_13, ImageRegistry would not have been in the enabled capabilities set because it didn't exist yet. But once you upgraded to the y-stream where it was a capability, CV spec.capabilities would still be ClusterVersionCapabilitySet4_13 but CVO will set ImplicitlyEnabledCapabilities=True reflecting that ImageRegistry was not in the spec.capabilities but is still enabled because it was present pre-upgrade.

All that being said, I don't think we have to worry about this from an hypershift API perspective because we are not exposing the capabilities API directly on the HostedCluster resource. As long as the HostedCluster API reflects the limitations of the underlying capabilities API (i.e. that enabling is a one-way trip and disabling can only happen a cluster creation time) we should be fine. We can let the CVO deal with the ImplicitlyEnabledCapabilities and because we don't support any OCP versions before the ImageRegistry capabilities was created, we don't have to worry about the situation above.

Copy link
Member

@wking wking Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

You'll want to figure out how to handle at least some of implicit-enablement, to avoid dropping DeploymentConfig in clusters updating into OCPSTRAT-1465.

Details

It's taken us a while to get real-world examples, but 4.18's OperatorLifecycleManagerV1 is the first capability born out of the blue (for GA clusters). The functionality was tech-preview in 4.17. So a GA 4.17 cluster updating to 4.18 with the None capability set would have not had the OperatorLifecycleManagerV1 capability enabled. You don't currently allow anything like a None capability set, so this case isn't relevant for you yet.

But the other case, where a capability is removed from vCurrent, is relevant. We're expecting DeploymentConfig to be dropped soon (say in 4.20, just to simplify discussion, but that's not a commitment, I'm not sure if or when it will actually happen). Clusters updating from 4.19 to 4.20 in vCurrent will implicitly enable the capability. If you leave the cluster in vCurrent without implicitly enabling the capability, the OpenShift API server says "hmm, I guess I reject this DeploymentConfig API request..." and all the DeploymentConfig-based Pods in the cluster get garbage collected, or something. Which would probably upset the workload admins who'd been using DeploymentConfigs pre-update.

  • You could maybe narrowly dance around it by unioning ClusterVersion's status.capabilities.enabledCapabilities with the vCurrent - disabledCapabilities set, but that would be feeding hosted-cluster-side ClusterVersion status back into management-cluster-side HostedControlPlane controler activities, which we try to avoid.
  • The CVO avoids that kind of feedback loop at update-time by holding both the outgoing and incoming manifest sets in memory at once, and comparing them to see which capabilities need enabling. But the CVO is still pulling ClusterVersion status.capabilities.enabledCapabilities back into its control loop during CVO-container-restart, because we don't have an alternative location to store that state. And having status that cannot be reconstructed without a time machine to see what older clusters were reconciling can make some backup/recovery flows more difficult.
  • Another possible workaround is adding your own status property, e.g. to HostedControlPlane, to hold the implicitly-enabled capabilities on the management cluster without tainting them with hosted-cluster-side data. But this still suffers from the "control loop fed by status" backup/recovery exposure that the current standalone CVO is exposed to.
  • Another possible workaround would be to grow the enabledCapabilities property you float now, in this initial phase, and then you can hard-code logic like "when a cluster updates from 4.19, always add DeploymentConfig to enabledCapabilities" to the 4.20 HostedControlPlane controller. Or other permutations of that kind of hard-coding, which is tedious to maintain, but possibly less brittle than relying on the preservation of status properties.

Aside from moving whole capabilities like that, the current CVO code handles cases like:

  1. There was a component built into OCP, without an associated ClusterVersion capability knob in 4.y.
  2. The component maintainers try to create a new ClusterVersion capability for their component in 4.(y+1).
  3. But they missed a manifest! The PrometheusRule gets the capability annotation in 4.(y+2).

Folks with the capability disabled on 4.(y+1) will have it implicitly enabled upon updating to 4.(y+2), because we don't support having a half-enabled capability. This situation doesn't come up that often, but it has happened. If you're willing to have the occasional cluster ambiguously running a partial capability, you can probably ignore it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for your thoughtful answers.

The functionality was tech-preview in 4.17. So a GA 4.17 cluster updating to 4.18 with the None capability set would have not had the OperatorLifecycleManagerV1 capability enabled. You don't currently allow anything like a None capability set, so this case isn't relevant for you yet.

This EP proposes we start setting BaselineCapabilitySet to None, so this would become relevant...

Clusters updating from 4.19 to 4.20 in vCurrent will implicitly enable the capability. If you leave the cluster in vCurrent without implicitly enabling the capability, the OpenShift API server says "hmm, I guess I reject this DeploymentConfig API request..." and all the DeploymentConfig-based Pods in the cluster get garbage collected, or something.

Given that CVO in Hypershift (as well as Hypershift itself) currently don't handle implicitly enabled capabilities, doesn't it mean that we would already be in the state you describe if the DeploymentConfig capability gets removed from vCurrent today, even without the changes proposed in this EP?


I'm still processing the suggestions, will probably have more questions once I have fully digested each.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that CVO in Hypershift (as well as Hypershift itself) currently don't handle implicitly enabled capabilities, doesn't it mean that we would already be in the state you describe if the DeploymentConfig capability gets removed from vCurrent today, even without the changes proposed in this EP?

Yikes, you're right! I've warned the workloads folks here and here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a section about this under Risks and Mitigations: https://github.com/openshift/enhancements/pull/1729/files#diff-36fbf863a674d8de905f7aa65cdb6158ec8c27b70d2fe5eed4272ca0a964a730R233

I'm still uncertain about what the correct approach here is. Would be great if you could give us your input @deads2k.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wiring threads together, David's response seems to be over here.

@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch 5 times, most recently from e5cd838 to 2759a36 Compare January 22, 2025 09:08
@flavianmissi flavianmissi changed the title WIP ARO-13667 hypershift: support disabling image registry via capability ARO-13667 hypershift: support disabling image registry via capability Jan 22, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 22, 2025
@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch 3 times, most recently from 4ad9211 to ee7cf65 Compare January 22, 2025 13:30
@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch from ee7cf65 to 10e09f2 Compare February 6, 2025 10:57
initially disabled at cluster install time.

#### Evolving the Hypershift capabilities API

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in HCP the "baremetal" capability should be disable by default so clusters are install with out the CBO.
This makes a use case to capture capabilities opt-in semantic. A possible path forward:

type Capabilities struct {
	// disabledCapabilities when specified, sets the cluster version baselineCapabilitySet to None
	// and sets all additionalEnabledCapabilities BUT the ones supplied in disabledCapabilities.
	// This effectively disables that capability on the hosted cluster.
	//
	// When this is not supplied, the cluster will use the DefaultCapabilitySet defined for the respective
	// OpenShift version.
	//
	// Once set, this field cannot be changed.
	//
	// +listType=atomic
	// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="disabledCapabilities is immutable"
	// +optional
	DisabledCapabilities []OptionalCapability `json:"disabledCapabilities,omitempty"`

	// enabledCapabilities specifies opt-in for capabilities that are not enable by default.
	// Its values are mutually exclusive with thouse in disabledCapabilities.
	EnabledCapabilities
...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the example, @enxebre ! I updated this section to reflect this.

@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch 2 times, most recently from 23e3354 to 3673bd2 Compare March 5, 2025 13:10
@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch from 3673bd2 to 7f3cdc6 Compare March 6, 2025 08:48
@sjenning
Copy link
Contributor

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 12, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 12, 2025
@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch from 7f3cdc6 to 7190ad4 Compare March 12, 2025 16:26
@sjenning
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 17, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 17, 2025

@flavianmissi: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 3e63d73 into openshift:master Mar 17, 2025
2 checks passed
@flavianmissi flavianmissi deleted the hypershift-image-registry-capability-ARO-13667 branch March 17, 2025 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants