AGENT-1250: Simplified cluster operations for disconnected environments #1821

andfasano · 2025-07-14T09:10:58Z

/draft

This patch contains the enhancement proposal for the simplified cluster operations in the disconnected environments.

openshift-ci-robot · 2025-07-14T09:11:01Z

@andfasano: This pull request references AGENT-1250 which is a valid jira issue.

In response to this:

/draft

This patch contains the enhancement proposal for the simplified cluster operations in the disconnected environments.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

andfasano · 2025-07-14T09:11:13Z

/assign @zaneb

openshift-ci · 2025-07-14T09:21:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from zaneb. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enhancements/agent-installer/simplified_cluster_operations.md

zaneb · 2025-07-17T09:00:54Z

enhancements/agent-installer/simplified_cluster_operations.md

+  name: openshift-releases
+spec:
+  releases:
+    - "quay.io/openshift-release-dev/ocp-release:4.18.0-x86_64"


We'll need to figure out how to map these names to the temporary registry.

It's also essential for security reasons that releases here are referenced only by digest and never by tag,

enhancements/agent-installer/simplified_cluster_operations.md

zaneb · 2025-07-17T10:12:28Z

enhancements/agent-installer/simplified_cluster_operations.md

+  was always available in the environment. Removing such dependency across
+  all the possible locations appeared to be particularly complex and
+  challenging (and moreover no formal way to enforce it once removed)
+


I'd say all 3 of the alternatives from the discussion doc are worth documenting as rejected alternatives:

the initial version of this proposal, before we realised that we would have to expose the local registries via api-int.

the version using PinnedImageSet

the idea of writing a registry that exposes the containers-storage as a registry

the idea of writing a registry that exposes the containers-storage as a registry

See also https://github.com/cgwalters/cstor-dist/.

Yeah, I have chatted to Colin about that. Unfortunately there are some fundamental limitations to what is standardised in the OCI Distribution protocol that we concluded made this approach too high-risk for our purposes.

enhancements/agent-installer/simplified_cluster_operations.md

JoelSpeed · 2025-07-22T17:04:58Z

enhancements/agent-installer/simplified_cluster_operations.md

+* The `spec.releases` will be used to configure the desired release contents.
+  Adding (or removing) an entry from the list will trigger an update on all the
+  control plane nodes.


Where do the control plane nodes source the images from?

Once copied, from the disk. Did you maybe mean the exact location on disk?

I think I meant, how does the control plane know where to copy the images from, I didn't see anywhere in the API where you configure "ISO is loaded to this mount point"

This wasn't clear to me either.

A temporary registry will be started when the ISO is mounted (a udev rule matching on a disk label triggers a systemd service that starts it). The MCD will know how to get the image from that registry - it could be e.g. a static pod with a Service that gives it a well-known URL accessible within the cluster.

enhancements/agent-installer/simplified_cluster_operations.md

andfasano · 2025-07-23T17:14:25Z

(sorry, the file rename due the markdownlint failures scrambled a little bit the latest comments, I will keep addressing the unresolved one in the next commits - hopefully no more file renaming 😥 )

enhancements/agent-installer/simplified_cluster_operations.md

danielerez · 2025-07-31T08:50:18Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+to each local control plane node registry), managed by the Machine Config
+Daemon (MCD). Once the copy has been completed on all the control plane nodes,
+the user could start the usual offline upgrade process via the `oc adm upgrade
+--allow-explicit-upgrade --to-image=<release-image>`.


What is the suggested approach to extract the relevant 'release-image' from the ISO?
Or maybe since the ISO is constructed for a specific release image anyway, we can keep it transparent for users? E.g. apply a MachineConfig that starts the upgrade (and detects the right version internally)

The assumption is that the user is going to previously download a specific OCP ISO version, so it should be a known info upfront, explicitly specified by the user

This sounds completely disconnected from the graph, does that matter? Do you care that you can easily go from 4.20.100 to 4.21.0 and fall back in time by two years?

If we had OSUS here this would just work normally, and you can use symbolic names. But we don't have a registry so we don't have OSUS. We could at least verify that the current version is listed in the target release's list of previous versions and that would prevent the upgrading into an older version issue where we would've never tested such an upgrade and it will almost certainly have some regression if you upgrade to a considerably older version. Perhaps put the IRI resource into a failed state if the current version is neither the version of the IRI or in the list of previous versions?

oc adm release info 4.19.8 -o json | jq '.metadata.previous' [ "4.18.19", "4.18.20", "4.18.21", "4.18.22", "4.19.0", "4.19.1", "4.19.2", "4.19.3", "4.19.4", "4.19.5", "4.19.6", "4.19.7" ]

I like that idea, although ideally we would apply that at the time of upgrading rather than at the time of copying the image into the local registry. Is this something we should build into CVO in the absence of OSUS?

I've lost track of the specific validations the CVO does but I believe the way it's intended to work today is if you provide a symbolic version it has to be on the graph. As soon as you provide a specific pull spec (--allow-explicit-upgrade) the CVO assumes you know what you're doing, not that that's ever been documented explicitly. There's a lot of blurred lines between CLI side guards and what the API accepts.

@wking can you check me here? what protections are in place here and what do you think would be best way to keep them from making a mistake by potentially selecting the wrong mirrored image when expecting to upgrade? Maybe it's time to move all the client side protections into the ClusterVersion CR and CVO via a new interface?

It would be great if we had symbolic versions. What I think we'd need to do that is a mapping of tags to sha digests that is cryptographically signed. If that existed it would be fairly easy to build into the ISO, but the infrastructure to set it up if it doesn't would be potentially daunting. Do you know of any existing artifacts that could act as building blocks here? A brief reading of sigstore docs suggests that it won't help.

FWIW: Some time ago, before OSUS and oc-mirror were around, when performing OCP4 upgrades on disconnected environments, they had to be done with --allow-explicit-upgrade and that did have the risks you are pointing out here of not properly obeying the graph. In that era, the message used to be that you had to be careful to follow the graph on your own or you risked to not be properly supported.

However, I'd prefer to have safeguard mechanisms like the ones you are considering. The older approach of "follow the graph" on your own is too risky.

Arguably there's a high probability that most disconnected upgrades still follow this path and don't use OSUS, and that we should therefore punt this to a future enhancement rather than block on it now. (I agree that I'd prefer to have it, all things being equal.)

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

danielerez · 2025-07-31T08:51:34Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+
+![alt bootstrap phase](simplified-operations-bootstrap.png)
+
+Initially the registry image will be placed into a read-only [additional image


Currently we simply use 'podman load' on the registry tar file.
Are you referring to the same flow?

Not exactly. By placing directly the registry image in the (podman) additional image store there will be no need to previously load it (and being read-only, it could not be pruned)

Hmm, so how is it simpler than 'podman load'? I mean, we would still need some service/script to place the image in the store, right?

The assumption was to have it already available at build time, but I think you're right (for the live phase) as currently it doesn't seem to be (easily) possible to extend the ISO content, in such case we'll have to fallback to the podman load approach.
For what regards the installation phase instead (so after the rebootd), the Assisted Installer could directly copy it to the target node (as mentioned here for the backend data). If so I will rephrase this part to make it clearer

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

djoshy · 2025-08-11T18:41:33Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+At this point the user could edit the existing IRI resource to add the new
+release to trigger the copy process (from the temporary registry
+to each local control plane node registry), managed by the Machine Config
+Daemon (MCD). Once the copy has been completed on all the control plane nodes,


It might be helpful to have a singular controller for orchestrating the copy process across the MCDs running on the control plane nodes. This makes it easier to do aggregate several node states into overall conditions and to report error modes at the operator level.

This design was inspired by PinnedImageSet in the hope that much of that can be... if not reused, at least used as a template.
IIUC PinnedImageSet does have its own controller running inside the MCO. I don't think the language here was intended to exclude that, so it could probably be clarified slightly.

Ah, I see. Agreed that we should clarify if PIS is going to be reused here.

They're specified at the MachineConfigPool level and the node level images are represented in the status of the MachineConfigNode objects. We should be able to leverage that. We'd need some sort of translation between InternalReleaseImage <=> PinnedImageSets; maybe at the control-plane level?

I'm also not sure if PIS currently works in bootstrap/firstboot mode, so that might be something I can try getting clarification from the team on(cc @yuqi-zhang if you have context here). The flow above hints that IRI manifest is added at the end of the installation, so perhaps the MCO translation piece is not relevant at bootstrapping time?

Hmm, I guess I'm not too clear on this either. It reads to me that the install flow is completely handled by the agent installer (i.e. it will "set up" the registry and image content from the install iso on all the control plane nodes), and upgrade wise, that responsibility falls onto the MCD via specifications in IRI? Does IRI come into the install process at all through the worker nodes?

Re: upgrade, my read of the process outlined below is that:

the user plugs in a usb disk ("attach the extended ISO") to the first control plane node

the user writes IRI for that specific node, so that the MCD copies it

the user unplugs it, plugs it into the second control plane node

write IRI to the second node
repeat until done for control plane, then trigger the upgrade for the whole cluster, where the 3 control plane node reads off itself and then have the workers read off of the control planes?

If that's the case, I think regardless we'd need some daemon process to handle the actual "copy". We probably also need some controller handling the central state though, to ensure the user has successfully completed all copies and write some status before they can upgrade.

@djoshy it's not literally a PinnedImageSet, but I am saying that the architecture of PIS and any lessons learned is readily adaptable to the architecture of this. (i.e. it's a thing that has to run on every node, report back when it is done, and roll up status so we can see when the whole cluster is up to date.) Hopefully that can speed up the design because it is not uncharted territory.

It reads to me that the install flow is completely handled by the agent installer (i.e. it will "set up" the registry and image content from the install iso on all the control plane nodes)

This is how the Dev Preview prototype works now. The plan was indeed for this continue because that means we can make sure that all of the nodes have their images cached on disk before proceeding with the installation (although we don't yet do that).

It would certainly be advantageous if we could rely on the IRI and MCO even at install time to create the local registry, as it would avoid duplicating the implementation, and eliminate the risk of ending up with both files on disk due to the ones created by the installation not being managed by the MCO. We could monitor the MCN status and not let the install continue until it is complete on the first 2 control plane nodes, but we would have no way of verifying the third node (the one running assisted-service) after it reboots, although theoretically since we'd have two working copies in the cluster there'd be nothing to stop it from eventually converging. This would require us to run an accessible registry on the bootstrap node, and pull data over the network even though it was previously available locally (though this is the case for worker nodes and during upgrades anyway). Perhaps a middle ground is to pre-populate just the images, and let MCO set up the registry service. @andfasano WDYT?

Does IRI come into the install process at all through the worker nodes?

We're proposing that worker nodes will not have a local registry, and that the control plane will operate as an HA registry for all nodes.

the user unplugs it, plugs it into the second control plane node

No, the idea is that when it is plugged in to the first control plane node, we'll start a temporary registry that is accessible from all control plane nodes. So they'll all ~simultaneously pull the image from the node with the ISO attached.

I think regardless we'd need some daemon process to handle the actual "copy". We probably also need some controller handling the central state though, to ensure the user has successfully completed all copies and write some status before they can upgrade.

Agree.

It would certainly be advantageous if we could rely on the IRI and MCO even at install time to create the local registry

On reflection, this wouldn't work for single-node 🙁

Perhaps a middle ground is to pre-populate just the images, and let MCO set up the registry service. @andfasano WDYT?

If this means letting AS copying the registry data before the first reboot then yes, I think it'd be more convenient to let MCO setup the registry as soon as possible. The registry must be available at least before machine-config-daemon-pull.service and resolv-prepender

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

palonsoro

I'd have some questions and I also found a typo.

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

yuqi-zhang · 2025-08-14T00:38:44Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+At this point the user could edit the existing IRI resource to add the new
+release to trigger the copy process (from the temporary registry
+to each local control plane node registry), managed by the Machine Config
+Daemon (MCD). Once the copy has been completed on all the control plane nodes,


Hmm, I guess I'm not too clear on this either. It reads to me that the install flow is completely handled by the agent installer (i.e. it will "set up" the registry and image content from the install iso on all the control plane nodes), and upgrade wise, that responsibility falls onto the MCD via specifications in IRI? Does IRI come into the install process at all through the worker nodes?

Re: upgrade, my read of the process outlined below is that:

the user plugs in a usb disk ("attach the extended ISO") to the first control plane node

the user writes IRI for that specific node, so that the MCD copies it

the user unplugs it, plugs it into the second control plane node

write IRI to the second node
repeat until done for control plane, then trigger the upgrade for the whole cluster, where the 3 control plane node reads off itself and then have the workers read off of the control planes?

If that's the case, I think regardless we'd need some daemon process to handle the actual "copy". We probably also need some controller handling the central state though, to ensure the user has successfully completed all copies and write some status before they can upgrade.

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

yuqi-zhang · 2025-08-14T00:46:40Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+  enabled: true
+status:
+  releases:
+    - "quay.io/openshift-release-dev/ocp-release:4.18.0-x86_64"


Thinking about what we should report here, I wonder if each individual status entry for each payload should have subfields around progressing/success/error/etc, instead of overall conditions, so it's easier to see which payloads are there, which are being pulled, etc.

Although since a user is unlikely to specify more than 2 payloads here at a time, perhaps we just report the most recent error? Would the MCD would be constantly reconciling all releases it expects to be on disk? Or just the latest added/removed one?

Would the MCD would be constantly reconciling all releases it expects to be on disk? Or just the latest added/removed one?

It should reconcile them all. Otherwise, mid-upgrade scenarios could potentially break if the images from the older release are not being properly synchronized for some reason but still needed because their component is not yet fully upgraded.

In that case I think the status should have sub sections per release on the status. i.e. it would be nice to see how many nodes have successfully pulled each release, what errors there are, etc.

I fully agree @yuqi-zhang . If a problem of any kind arises, we would indeed need that degree of detail. Otherwise, we would only know that that "something failed somewhere".

So given this direction, what subfields will the release status have?

In general I do see the the action of adding / removing a specific release as an atomic operation, with its (global) status reported here. For the detailed error, could be the previously proposed approach be reused?

Is it possible that a user is both adding and removing a release simultaneously? If so, and there are errors both in adding and removing those release images, how would that be reported at the top level?

I think adding some more status in this list with the status (perhaps even tallying which release(s) is(are) currently in use by CVO) of each image makes sense so that we can work out problems with the individual releases

These could just be a set of conditions for each release perhaps, InUse, Ready, etc

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

sdodson · 2025-08-15T02:27:17Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+to each local control plane node registry), managed by the Machine Config
+Daemon (MCD). Once the copy has been completed on all the control plane nodes,
+the user could start the usual offline upgrade process via the `oc adm upgrade
+--allow-explicit-upgrade --to-image=<release-image>`.


This sounds completely disconnected from the graph, does that matter? Do you care that you can easily go from 4.20.100 to 4.21.0 and fall back in time by two years?

If we had OSUS here this would just work normally, and you can use symbolic names. But we don't have a registry so we don't have OSUS. We could at least verify that the current version is listed in the target release's list of previous versions and that would prevent the upgrading into an older version issue where we would've never tested such an upgrade and it will almost certainly have some regression if you upgrade to a considerably older version. Perhaps put the IRI resource into a failed state if the current version is neither the version of the IRI or in the list of previous versions?

oc adm release info 4.19.8 -o json | jq '.metadata.previous' [ "4.18.19", "4.18.20", "4.18.21", "4.18.22", "4.19.0", "4.19.1", "4.19.2", "4.19.3", "4.19.4", "4.19.5", "4.19.6", "4.19.7" ]

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

sdodson · 2025-08-15T03:00:54Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+extended RHCOS ISO is attached, an then mount it to launch a temporary registry
+to serve its content.
+After the user modified the IRI resource with the new release pullspec, the new
+OCP release image payload will be copied on each node registry using skopeo.


Skopeo is pointed directly at the host service or does it also get an IMDS entry and skopeo thinks it's talking to quay.io? If it had an IMDS entry could the IRI controller be used, unaltered, in a connected environment because it would eventually synchronize from the origin?

There's something to be said for that, and also something to be said for not doing that because users who are connected but bandwidth-constrained will hate us if we download data over the network when they already supplied it locally 😄

Either way, the answer needs to be made explicit in this proposal.

They'd have supplied it locally via an IMDS pointed at their local registry, right? I assume there's a way to say never even attempt to reach the origin and fail instead?

Sadly I don't think there is.

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

enhancements/agent-installer/simplified_cluster_operations.md

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

JoelSpeed · 2025-08-19T11:05:06Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+  enabled: true
+status:
+  releases:
+    - "quay.io/openshift-release-dev/ocp-release:4.18.0-x86_64"


So given this direction, what subfields will the release status have?

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

JoelSpeed · 2025-08-19T11:09:02Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+  adopted by the Machine Config Server (MCS).
+  This is a new requirement that will need to be properly documented. In
+  addition, to minimize the risks, the registry will have to be secured
+  via a pull-secret.


How would a user configure the appropriate pull secretes that will be allowed to pull from this registry?

Do we really need a user to do that? Auto-generating the credentials (during the registry setup) and eventually saving them into an openshift-config/iri-registry CM could be enough? I do not expect anyhow the user to have any specific need to access it (from outside the cluster), though.

We'll probably need to document how the user can force the credentials to be rotated.
I imagine that at installation time we'll generate credentials and save them in a Secret, as well as add the auth data to the cluster's pull secret.
If the user wants to rotate they'll need to delete the Secret (MCO will automatically replace?) and manually update the cluster pull secret.

This sounds like an important workflow to document in the EP

…operations.md Co-authored-by: Pablo Alonso Rodriguez <[email protected]>

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

zaneb · 2025-09-12T03:50:27Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+To allow pulling images from the cluster, an ImageDigestMirrorSet
+(IDMS) policy will be added to the cluster manifests - using the `api-int` 
+DNS entry - so that the set of registries running within
+the control plane nodes will become the nominal pullspec of the current release


This is ambiguous.

We will certainly need an IDMS because the containers inside the release payload refer to quay images by digest, and we will need to get them from the mirror.

It doesn't follow from this that the nominal pullspec (the one the release image is referred to as in CVO) is using api-int, so the word 'so' is out of place here. It could be any of:

quay.io with tag + ITMS (possibly has to be this to use the upgrade graph checking in CVO?)

quay.io with digest (covered by existing IDMS)

api-int with tag

api-int with digest

We need to be explicit about which.

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

zaneb · 2025-09-12T04:23:11Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+desired release entry value.
+
+A ValidatingAdmissionPolicy will be added to prevent the user to remove
+a release entry that is currently being used by the cluster (as reported by


We can look at ClusterVersion's status.desired.image to get the target version.
If an upgrade is in progress... presumably we have to look at status.history and include every release back to the last one with status=Completed? (Not looking forward to seeing the VAP code for this 😬)

Also how can we tell if the release is using the internal registry? Presumably we need to use api-int as the nominal pullspec to tell them apart from ones pulled directly from the Internet.

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

zaneb · 2025-09-12T04:27:52Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+##### Removal of InternalReleaseImage resource
+
+The deletion of the IRI `cluster` resource will be handled by a finalizer,
+and it will stop all the registry services running on the control plane nodes.


'stop' here is systemctl stop + systemctl disable? Or is it 'roll out new MachineConfig without the service'?
If the former, how will we get it to stick without the drift detection ree-enabling it?

From an implementation point of view, it will be handled by the IRI controller by deleting the local registry MachineConfig resource

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

zaneb · 2025-09-12T04:40:29Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+![alt install phase](simplified-operations-install.png)
+
+A systemd service injected by MCO will ensure the registry will be available
+after the reboot. 


During an upgrade, MCO also needs to replace the registry container image with the one from the new release image.

zaneb · 2025-09-12T04:40:54Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+A systemd service injected by MCO will ensure the registry will be available
+after the reboot. 
+The registry will have to run on a host network port on each node to ensure
+it's accessible from the outside, specifically port 22625 (to be [registered](https://github.com/openshift/enhancements/blob/master/dev-guide/host-port-registry.md)). 


Suggest registering as part of this PR 🙂

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

joelanford · 2025-09-16T16:52:23Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+Finally, a new `InternalReleaseImage` custom resource, managed by MCO, will
+also be added as a cluster manifest to keep track and manage the currently
+stored internal release image.
+


Add details of the OLM aspect of the installation here?

Do you maybe mean reporting the list of what was exactly included in the current release bundle? If so I will sooner update the proposal with more details on this topic (see also this comment)

cc @danmanor for additional comments

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

Added a more detailed description for its usage, and proposed a new release identifier (automatically discovered) to be used for adding a new release bundle

pablintino

Adding some minor comments. Tomorrow I'll give this a second read, but I think it's OK from a MCO point of view.

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

pablintino · 2025-09-22T16:16:17Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+    releases:
+      - name: ocp-release-bundle-4.18.0-x86_64
+        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5bca02661d61955b62889e7e2e648905b7202d5788f5ba5ab69055a73dffdb5c
+        conditions:


nit/question: I'm a bit surprised of seeing conditions outside the status.conditions section. Despite it makes sense I'd think it's worth to double check if we have already other examples like this one. If this usage of conditions is fine, should we report a global condition indicating if the IRI, from a node perspective, is ready/not to the global ones?

I went for the conditions as for the most flexible way to describe the story of each individual release bundle on a given node, useful especially for troubleshooting in case of issue. The alternative could be to switch to a more rigid model with a predefined set of fields. Initially I checked the PinnedImageSet example but it was slightly different.
Anyhow, by using the conditions approach, I wouldn't see the need of additional fields to report the global condition, as the IRI controller could look for a specific condition reason

Perhaps to help differentiate we can tweak the naming

yuqi-zhang

Some more questions inline

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

yuqi-zhang · 2025-09-23T14:43:56Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+    releases:
+      - name: ocp-release-bundle-4.18.0-x86_64
+        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5bca02661d61955b62889e7e2e648905b7202d5788f5ba5ab69055a73dffdb5c
+        conditions:


Perhaps to help differentiate we can tweak the naming

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

JoelSpeed · 2025-10-06T11:21:29Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+  availableReleases: 
+    - name: ocp-release-bundle-4.19.0-x86_64
+  releases:
+    - name: ocp-release-bundle-4.18.0-x86_64
+      image: quay.io/openshift-release-dev/ocp-release@sha256:5bca02661d61955b62889e7e2e648905b7202d5788f5ba5ab69055a73dffdb5c
+    - name: ocp-release-bundle-4.19.0-x86_64
+      image: quay.io/openshift-release-dev/ocp-release@sha256:3482dbdce3a6fb2239684d217bba6fc87453eff3bdb72f5237be4beb22a2160b


Maybe we want to think about the naming of these two fields more. Available to me is suggesting this is something I could use, and I'm not sure exactly what the second releases field is for.

If I've read correctly, availableReleases is the currently mounted release, and releases is the list of stored releases in the local registry, right?

Yes. In particular status.availableReleases indicates the currently mounted and not yet installed bundles, and it's meant to provide an auto-discovery facility to the end user.

We may want to think of a name the better distinguishes this from releases that are already installed. mountedReleases?

Available to me is too close to "I can use this for my cluster now", just thinking about how we use the term available in other contexts

maybe a more explicit installableReleases?

Just my 2 cents. If we grab the idea from the CVO ClusterVersion CR, that already have the concept of candidates, maybe availableReleases would be more aligned. IMHO mountedReleases exposes the mount implementation detail of how the feature internally works.

As I understand it, one of the ambiguity with the word "available" comes from the uncertainty whether it's "available for the cluster use" or "available to be pulled in, such that the cluster can use it"

So the argument for dis-ambiguity would be mountedRelease --- locallyAvailableRelease, although I feel like that wordplay might make it more confusing than it helps, since mounted could also be understood as "locally mounted" or similar

Alternatively, like Pablo mentions, the CVO considers the "available to be upgraded to, but not yet on the cluster" as available, so maybe following that pattern would be simpler (and we can clarify via godocs). Then we'd get availableRelease --- activeRelease/localRelease or something like that.

(I guess what I'm saying is I feel like neither field should just be releases to help differentiate, but not a strong opinion other than that)

I'd tend to agree with @pablintino, as mountedReleases seems too technical also for me.

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

openshift-ci · 2025-10-10T15:51:51Z

@andfasano: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

danielerez · 2025-10-16T14:14:53Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+The release image and operators images will be stored inside the extended RHCOS
+ISO in the storage format used by the [distribution/distribution](https://github.com/distribution/distribution)
+registry’s filesystem storage plugin.
+The Appliance builder will generate a new `release-bundle` (signed) image


Is it just a blank image with the bundle.json file? What's the process of signing this container image?

Currently yes, just the bundle.json file. I'm not aware about the specific details for the signing process, but I do expect we'll have to address it while building the OVE ISO in Konflux (I'd expect something similar to CVO, cc @bfournie in case he could have more details about it)

danielerez · 2025-10-16T14:33:36Z

enhancements/agent-installer/simplified-registry-less-cluster-operations.md

+[Assisted Service](https://github.com/openshift/assisted-service/)
+and [Assisted Installer](https://github.com/openshift/assisted-installer)
+will be enhanced accordingly to take care of this step, and to ensure it
+will be properly monitored during the installation process.


So the benefit of this approach is basically the ability to precisly monitor the installation, right?
If so, I just wonder about the priority of such an effort. I mean, it sounds like the right direction, but it's a pretty major change. Do you think it's a mandatory change that should be handled first? Comparing to the current behavior, do we get any advantages other than allowing to fail more gracefully?
Just want to understand whether it's merely a nice-to-have initiative :/

IMO it's not a nice-to-have feature but a critical step to make the installation workflow more robust and properly monitored.
Copying 40Gb+ of data takes a good amount of time on a baremetal instance (up to 1hr as per the current testing), and several things can go wrong during the process, with the potential risk of leaving the data uncompleted or corrupted (as already experienced in the 4.19 DP release) - and thus the user should get properly informed that everything went fine (or if some problem happened, to avoid having later issues during the joining process, usually very painful to troubleshoot).
During the installation phase the Assisted Installer is the current eligible actor, on each node, responsible to carry on all the required task (such as writing the RHCOS image), and monitoring it's already supported via communication with the Assisted Service, which could be used by the user (via openshift-install wait-for command, or via the local Assisted UI) for that. And in this way we could also ensure that, prior of rebooting a node to let it join the control plane, all the required data (for making it working) will be effectively available.

openshift-bot · 2025-11-14T01:15:57Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

andfasano · 2025-11-14T10:22:04Z

/remove-lifecycle stale

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 14, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 14, 2025

openshift-ci bot requested review from JoelSpeed and dustymabe July 14, 2025 09:11

openshift-ci bot assigned zaneb Jul 14, 2025

andfasano force-pushed the ocp_external_registryless_ops branch from b1878df to 4b793e5 Compare July 14, 2025 09:20

andfasano force-pushed the ocp_external_registryless_ops branch from 4b793e5 to 50f9cc0 Compare July 14, 2025 09:58

zaneb reviewed Jul 17, 2025

View reviewed changes

JoelSpeed reviewed Jul 22, 2025

View reviewed changes

andfasano changed the title ~~[WIP] AGENT-1250: Simplified cluster operations for disconnected environments~~ AGENT-1250: Simplified cluster operations for disconnected environments Jul 23, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 23, 2025

andfasano mentioned this pull request Jul 23, 2025

AGENT-1262: add NoRegistryClusterOperations feature gate openshift/api#2417

Merged

andfasano mentioned this pull request Jul 24, 2025

MGMT-20830: Support for serving an OVE image (ISO) with an embedded custom ignition openshift/assisted-service#7827

Closed

danielerez reviewed Jul 29, 2025

View reviewed changes

danielerez reviewed Jul 31, 2025

View reviewed changes

djoshy reviewed Aug 11, 2025

View reviewed changes

palonsoro reviewed Aug 12, 2025

View reviewed changes

enhancements/agent-installer/simplified-registry-less-cluster-operations.md Outdated Show resolved Hide resolved

enhancements/agent-installer/simplified-registry-less-cluster-operations.md Outdated Show resolved Hide resolved

oceanc80 reviewed Aug 13, 2025

View reviewed changes

enhancements/agent-installer/simplified-registry-less-cluster-operations.md Outdated Show resolved Hide resolved

yuqi-zhang reviewed Aug 14, 2025

View reviewed changes

sdodson reviewed Aug 15, 2025

View reviewed changes

jlebon reviewed Aug 18, 2025

View reviewed changes

enhancements/agent-installer/simplified-registry-less-cluster-operations.md Outdated Show resolved Hide resolved

enhancements/agent-installer/simplified-registry-less-cluster-operations.md Show resolved Hide resolved

JoelSpeed reviewed Aug 19, 2025

View reviewed changes

andfasano and others added 4 commits August 26, 2025 12:24

enhancement proposal for the simplified cluster operations

5adc297

fixed filename and reviewers list

e1d6f97

some initial doc reviews

40024e6

Update enhancements/agent-installer/simplified-registry-less-cluster-…

5f04841

…operations.md Co-authored-by: Pablo Alonso Rodriguez <[email protected]>

review VAP usage for IRI update/deletion

8f4a96e

andfasano commented Sep 2, 2025

View reviewed changes

enhancements/agent-installer/simplified-registry-less-cluster-operations.md Show resolved Hide resolved

andfasano added 4 commits September 4, 2025 05:16

added a new section for the storage requirements

3eca005

added goal to opt-out from the feature. Reviewed IDMS paragraph

de2acf4

updated image-registry inclusion. Added MCN status proposal

74f22df

added release name specification

850e0ce

zaneb reviewed Sep 12, 2025

View reviewed changes

minor fixes

7613429

joelanford reviewed Sep 16, 2025

View reviewed changes

enhancements/agent-installer/simplified-registry-less-cluster-operations.md Outdated Show resolved Hide resolved

joelanford reviewed Sep 16, 2025

View reviewed changes

enhancements/agent-installer/simplified-registry-less-cluster-operations.md Show resolved Hide resolved

joelanford reviewed Sep 16, 2025

View reviewed changes

enhancements/agent-installer/simplified-registry-less-cluster-operations.md Show resolved Hide resolved

joelanford reviewed Sep 16, 2025

View reviewed changes

enhancements/agent-installer/simplified-registry-less-cluster-operations.md Show resolved Hide resolved

joelanford reviewed Sep 16, 2025

View reviewed changes

enhancements/agent-installer/simplified-registry-less-cluster-operations.md Show resolved Hide resolved

andfasano added 4 commits September 18, 2025 09:53

minor non-goals refinement

07eb5d1

minor fixes

d0c292e

fixed Proposal section indentation to improve readability

9543761

added the registry bundle image

fc3926c

Added a more detailed description for its usage, and proposed a new release identifier (automatically discovered) to be used for adding a new release bundle

pablintino reviewed Sep 22, 2025

View reviewed changes

yuqi-zhang reviewed Sep 23, 2025

View reviewed changes

andfasano mentioned this pull request Sep 29, 2025

AGENT-1330: machineconfiguration/v1alpha1: add InternalReleaseImage openshift/api#2510

Open

JoelSpeed reviewed Oct 6, 2025

View reviewed changes

minor additional clarifications

36b2db0

danielerez reviewed Oct 16, 2025

View reviewed changes

andfasano mentioned this pull request Oct 21, 2025

AGENT-1353: Remove --interactive flag openshift/appliance#548

Open

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 14, 2025

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 14, 2025


		![alt bootstrap phase](simplified-operations-bootstrap.png)

		Initially the registry image will be placed into a read-only [additional image

AGENT-1250: Simplified cluster operations for disconnected environments #1821

Are you sure you want to change the base?

AGENT-1250: Simplified cluster operations for disconnected environments #1821

Uh oh!

Conversation

andfasano commented Jul 14, 2025

Uh oh!

openshift-ci-robot commented Jul 14, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andfasano commented Jul 14, 2025

Uh oh!

openshift-ci bot commented Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andfasano commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdodson Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

palonsoro Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci-robot commented Jul 14, 2025 •

edited by openshift-ci bot

Loading

sdodson Aug 15, 2025 •

edited

Loading

palonsoro Aug 19, 2025 •

edited

Loading