From 766b9a1279ddb7bd3b0e0ca22a8909e1ad2f470e Mon Sep 17 00:00:00 2001 From: Uri Lublin Date: Mon, 27 Oct 2025 19:11:21 +0200 Subject: [PATCH] Enhancement proposal for Confidential Clusters --- .../security/confidential-clusters.md | 620 ++++++++++++++++++ 1 file changed, 620 insertions(+) create mode 100644 enhancements/security/confidential-clusters.md diff --git a/enhancements/security/confidential-clusters.md b/enhancements/security/confidential-clusters.md new file mode 100644 index 0000000000..44b44db5d5 --- /dev/null +++ b/enhancements/security/confidential-clusters.md @@ -0,0 +1,620 @@ +--- +title: confidential-clusters-enhancement-proposal +authors: +- "@uril" +- "@travier" +reviewers: +- "@confidential-cluster-team" # for the Confidential Cluster operator +- "@coreos-team" # for RHCOS changes +- "TBD" # Someone from the @mco team +- "TBD" # Someone from the @installer team +approvers: +- "@sdodson" +- "TBD" +api-approvers: +- "TBD" # c.f. Confidential Cluster Operator API +creation-date: 2025-10-23 +last-updated: 2025-10-23 +status: implementable +tracking-link: +- "https://issues.redhat.com/browse/OCPSTRAT-2023" +- "https://issues.redhat.com/browse/OCPSTRAT-2316" +- "https://issues.redhat.com/browse/OCPSTRAT-1940" +see-also: +- "https://github.com/confidential-computing" +replaces: +- N/A +superseded-by: +- N/A +--- + +# OpenShift Enhancement: Confidential Clusters + +## Summary + +This enhancement proposes the integration of **confidential computing** +capabilities into **OpenShift cluster**, enabling the deployment of +**Confidential Clusters**. A confidential cluster is an OpenShift cluster where +all nodes run on Confidential Virtual Machines (CVMs) and are remotely attested +before they join the cluster. By leveraging CVMs, the memory for all workloads +and their management services is automatically shielded from the underlying host +infrastructure and each node disk is encrypted. This provides a foundational +layer of protection for sensitive data in memory. All nodes of the cluster are +also remotely attested to be running valid versions of RHCOS before they join +the cluster and on every boot. + +## Motivation + +In today's cloud-first world, organizations are increasingly migrating sensitive +workloads to public cloud environments. While cloud providers offer significant +scalability and flexibility, concerns around data confidentiality and integrity, +from the cloud provider itself or other unauthorized parties, remain a +significant barrier for highly regulated industries. + +Traditional cloud deployments expose workload memory and disk content to the +host, creating a potential attack surface. Confidential Clusters address this by +ensuring all OpenShift nodes run on CVMs, automatically encrypting and +protecting, from the host, memory of workloads and management services as well +as the content on the disk. + +It is also required to be able to attest with high confidence that those +protections are effectively in place on the cloud provider’s +infrastructure. Thus in Confidential Clusters, all nodes of the cluster are +asked to send hardware signed quotes to a remote attestation server to validate +the confidential computing features enabled for the virtual machines and to +verify the version of the operating system that is booted. + +Those added security layers enhance the security posture of OpenShift +deployments, making it a viable platform for even the most sensitive +applications. This enhancement proposal is meant to explain how we can integrate +confidential computing technology with OpenShift and expose this capability for +the management cluster's lifecycle. + +### User Stories + +Here are several scenarios where Confidential Clusters would provide immense +value: + +* As a regulated company (Finance, Healthcare, etc), I want to run my + applications and data on OpenShift in the cloud, knowing that the data in + memory and on the disk is protected from the cloud provider and unauthorized + access. + +* As a company manager, I want to provision separated, isolated confidential + OpenShift clusters for each department, such that strict data segregation and + protection is maintained for their highly sensitive operations. + +* As a data scientist or an AI developer, I want to run OpenShift AI workloads, + including training models and processing proprietary datasets, while being + confident that my data and models are protected in memory and on the disk + throughout their lifecycle. + +### Goals + +* Enable the deployment of OpenShift clusters where all nodes operate as + Confidential VMs (CVMs), minimizing exposure risk to the cloud provider or + other unauthorized entities. + +* Implement a robust remote attestation process for CVM nodes to verify their + trustworthiness before sharing secrets and joining the cluster. The remote + attestation process ensures that the software running on the node (kernel, + operating system binaries, etc.) is exactly what is configured for the cluster + and that the nodes operate in confidential mode. + +* Provide a seamless integration of confidential computing and remote + attestation from the cluster admin perspective. + +* Support cluster upgrades and other lifecycle operations while preserving + cluster confidentiality. + +### Non-Goals + +* This enhancement does not aim to protect from a malicious cluster operator or + from an attacker that managed to elevate their privileges to cluster admin. + +* This enhancement does not aim to provide data encryption outside of the + confidential computing environment (for example network encryption, additional + disk encryption), though existing OpenShift mechanisms to do that are + available. + +* This enhancement does not cover changes to application-level data + encryption. It focuses on protecting data in memory and on the disk at the + infrastructure layer. + +* This enhancement does not address the security of the underlying cloud + provider's hardware or hypervisor outside of the CVM's confidential execution + environment. + +## Proposal + +Run all OpenShift nodes on Confidential VMs (CVMs). Use remote attestation to +verify the integrity and authenticity of a new node's hardware and software +before sharing secrets with that node and allowing it to join the cluster. + +This implementation will happen in two phases. + +* In the first phase, we will consider the bootstrap node and the first boot of + each new node to be trusted. In this phase, only the confidentiality of the + cluster will be guaranteed. We will assume the attacker can read data but not + write data (to the disk, cloud metadata config, etc.). + +* In the second phase, we will remove the need to trust the bootstrap node and + the first boot of each node. Once completed, both confidentiality and + integrity will be guaranteed. + +We are working on a more detailed threat model, which will be submitted in a +later stage. + +In the first phase (confidentiality), the following changes are needed to those +components: + +* OpenShift API + * Allow nodes to be marked as confidential. This is specific per cloud + provider and per Hardware manufacturer. + * Request/Instruct cloud providers to run nodes as CVMs. + +* Installer + * Allow users to specify they want to run OpenShift as a Confidential Cluster + (cloud provider specific). + * Deploy the Confidential Cluster Operator on the bootstrap node + +* Confidential Cluster Operator + * Setup a Trustee (attestation service) instance in the cluster to attest + nodes. + * Setup attestation and resource access policies in Trustee. + * Provide a registration server for new nodes to trigger the provisioning of + secrets. + * Setup a MachineConfig to instruct new nodes to attest themselves. + * Watch for cluster or OS image updates, compute and update the set of + reference-values (expected "correct" values) in Trustee. + +* RHEL CoreOS + * Add support for composefs (native), UKI, and systemd-boot to bootc (Bootable + Containers). + * Build and upload disk images using UKI and systemd-boot to cloud providers. + * Add attestation client to the operating system, such that nodes can request + attestation and fetch secrets upon a successful attestation. + * Add a clevis trustee pin to fetch LUKS passphrase upon a successful + attestation and encrypt/decrypt the disk. + * Modify Ignition to support clevis trustee pin. + +In the second phase (integrity), the following changes are needed to those +components: + +* Installer + * Generate Trustee configuration and reference values to let administrators + setup an external Trustee instance used by the bootstrap node. + +* Confidential Cluster Operator + * Support syncing secrets and reference values to an out of cluster Trustee + instance + +* RHEL CoreOS + * Support verifying the integrity of the disk content during re-partitioning + on first boot. + * Set the PK/KEK/db/dbx configuration when uploading disk images to cloud + providers + * Modify Ignition to support fetching configs from a Trustee resource after + remote attestation. + * Measure Ignition config in a PCR value, before parsing it + +* Machine Config Operator + * Ensure that MachineConfigs are only served to attested nodes + * Option: Store MachineConfigs as Trustee resources, stop serving configs via + the MCS + +* Cluster Machine Approver + * Ensure that the logic in the CMA guarantees that only nodes passing + attestation can get their CSR signed. + +### Workflow Description + +#### Cluster Administrator Workflow + +The changes in the workflow for cluster creation differ based on the phase +implemented. + +##### Cluster creation for the first phase + +1. The cluster creator selects the Confidential Cluster option in the OpenShift +installer. +1. The rest of the installation process should not differ from the cluster +creation perspective. + +##### Cluster creation for the second phase + +1. The cluster creator chooses a domain name or IP which will be used to host +the initial, external, out of cluster, Trustee instance. This instance can be +hosted via a container on another system or using the Trustee operator in an +existing OpenShift cluster. +1. The cluster creator selects the Confidential Cluster option in the OpenShift +installer, passing in the URL of the external Trustee instance chosen above. +1. The OpenShift installer generates a set of configuration files for the + external Trustee instance. +1. If the cluster creator adds/removes/modifies MachineConfigs, the +configurations above need to be re-generated again. +1. The cluster creator configures the external Trustee instance with those + configuration files. +1. The cluster creator then resumes provisioning the cluster, starting with the +bootstrap node. +1. The cluster creator verifies that the bootstrap node has been properly + attested. +1. The rest of the installation process should not differ from the cluster +creation perspective. + +##### New node creation + +The cluster administrator flow should not change when adding new nodes to the +cluster. The Confidential Cluster Operator will perform the necessary +configuration to allow new nodes to join the cluster. + +##### Cluster update + +The cluster administrator flow should not change when updating a cluster. The +Confidential Cluster Operator will perform the necessary configuration to allow +nodes to attest to the cluster using new version of RHCOS. + +##### Shutting down and restarting Confidential Clusters + +1. The cluster administrator synchronizes the policies and secrets configured in + the Trustee instance to an external Trustee instance. +1. The cluster administrator verifies that all control plane nodes are + configured to use the external Trustee instance as fallback in the Clevis + Trustee PIN configuration. +1. Cluster shutdown +1. Before restarting any node, the Trustee instance must be made available at + the domain or IP configured above. +1. The cluster administrator restarts the control plane nodes which attests + themselves to the external Trustee instance. +1. The cluster administrator restarts the worker nodes which attests themselves + to the internal or to the external Trustee instance. + +#### User (Application Administrator) Workflow + +This enhancement does not introduce any change to user workflows. + +## API Extensions + +This enhancement introduces some new API extensions: + +* **Running nodes on cloud CVMs**: +For each supported cloud provider, confidential computing types and code need to +be added to + * OpenShift API: types_.go + * Cluster API: cluster-api-provider- + * Machine API: machine-api-provider- + * Machine API operator: add a webhook to validate confidential cluster + configuration + * OpenShift Installer: parse and setup confidential cluster configurations + +* **ConfidentialCluster CRD**: This custom resource is used to configure the + Confidential Cluster Operator and indirectly the Trustee instance that is used + to attest nodes in the cluster and provide secrets. + It is namespaced, versioned and contains: + * TrusteeImage - the container image of Trustee attestation service + * PcrsComputeImage - the container image for computing PCRs reference values + * RegisterServerImage - the container image of node registration service + * PublicTrusteeAddr - the IP address of Trustee attestation server, to be + accessed by attesting nodes + * TrusteeKbsPort - the port that Trustee serves on + * RegisterServerPort - the port that the registration service serves on + +* **Ignition spec changes**: The Ignition configuration specification will be + extended to support: + * configuring the Clevis trustee pin + * enable fetching remote config after remote attestation + +## Topology Considerations + +### Hypershift / Hosted Control Planes + +Initially, this enhancement will not support a hosted control plane +topology. However, the design can be extended to support it. + +In a HCP scenario, the operator will only be responsible for the worker +nodes. As the Confidential Cluster Operator will be hosted in the control plane, +the nodes hosting those services are considered part of the Trusted Computing +Base (TCB). + +As HCP operators are not allowed to set up MachineConfigs, we will need an +option during HCP cluster creation to set up a MachineConfig in the control +plane and tell the Confidential Cluster Operator to use it. + +### Standalone Clusters + +Standalone Clusters running on cloud providers supporting confidential virtual +machines are the primary target for this enhancement. + +In the future, we might want to extend the remote attestation feature of this +enhancement to be able to use it for Bare Metal OpenShift clusters to get +stronger guarantees that nodes have not been tampered with (i.e. "Attested +Clusters"). In this case, the nodes would not be running as Confidential VMs and +their memory would not be encrypted, but the guarantees around which operating +system version is used on each node and its integrity would be provided to +cluster operators. + +### Single-node Deployments or MicroShift + +Initially, this enhancement will not support SNO & MicroShift +deployments. However, the design can be extended to support it. + +Single node deployments require an external Trustee instance to be available +when the node boots. Thus in this scenario, the Confidential Cluster operator +and the Trustee instance would be running in a management cluster responsible +for multiple SNO/MicroShift deployments. The Operator would coordinate reference +value updates in tandem with the management tools (for example ACS). + +Confidential Clusters run on confidential VMs, so they require running on VMs +and on special hardware. + +## Implementation Details/Notes/Constraints + +### Operating system integrity and confidentiality guarantees + +To guarantee the integrity of the operating system, we are adding composefs, UKI +& systemd-boot support to bootc (Bootable Containers). Unified Kernel Images +(UKI) are bundling the kernel, initrd and kernel command line into a single PE +binary that is signed for Secure Boot. Each UKI also includes the hash of the +composefs image used for the operating system, thus strongly tying a booted UKI +with a version of the operating system. + +To make sure only Red Hat signed (or eventually customer signed) UKIs can be +booted, we will set the Secure Boot configuration for cloud instances to only +trust Red Hat’s (or the customer’s) Secure Boot certificates. + +In order to verify that those validations effectively took place, we are using a +remote attestation process which relies on the measurements of the boot chain +components via the TPM. The measurements are stored in PCR banks which are +signed by hardware components and sent to a remote Trustee instance for +validation. + +### Adding a new node to the cluster + +Each node of the cluster will be started as a confidential VM. As part of the +first boot process, in the initramfs, Ignition runs and fetches its +configuration from the cloud provider instance metadata service (user-data). + +In phase 1, we will trust that this configuration has not been tampered with. + +In phase 2, we will measure this configuration in a PCR value before processing +it. + +The initial Ignition configuration mainly consists of a directive that asks +Ignition to replace the entire configuration with the content that it will fetch +over HTTPS from the Machine Config Server (MCS). + +In phase 1, this will not be changed. + +In phase 2, the initial configuration will be modified to tell Ignition to fetch +the new configuration from a remotely attested resource endpoint. The MCS will +not serve Ignition configs directly for nodes anymore but will store those as +resources in a Trustee instance. To access those configurations, the node will +have to successfully remotely attest itself first. + +Included in the new configuration provided by the MCS, a directive tells +Ignition to fetch an additional element of configuration from a new service: the +registration server from the Confidential Cluster Operator. + +What happens in the operator as part of this registration step is described in +. + +On first boot, the content of the operating system is in clear text on the +disk. The additional configuration fetched from the registration service +includes a directive that tells Ignition to encrypt the entire root disk using +LUKS. Ignition first reads the operating system content from the disk into +memory, then re-partitions the disk, sets up LUKS and then writes back the +content in the root partition. + +When setting up the keys for unlocking the LUKS device, the configuration tells +Ignition to use the Clevis Trustee Pin which fetches a resource from a Trustee +instance that is used as secret to bind the LUKS device. To access this +resource, the node must pass remote attestation successfully. To ensure that a +node can only fetch a single secret at a time, a unique identifier is provided +in the additional Ignition configuration provided by the registration server and +this value is measured in a PCR value that is validated as part of the remote +attestation process. + +In phase 1, the content read from the disk will not be fully verified for +integrity. + +In phase 2, the content read from the disk will be verified for integrity. + +Finally, once the content of the root partition has been written back to the +disk, the system resumes booting and later joins the cluster. + +If any attestation step fails, the node keeps retrying indefinitely, in turn, +each Trustee server configured. This is required as a Trustee server may be +offline at any given point in time or because the reference values accepted by +Trustee have not yet been updated by the operator or the cluster +administrator. This infinite retry loop leaves the opportunity to the cluster +operator to investigate the failure and potentially manually update the +reference values accepted for the cluster. This is similar to how Ignition +retries infinitely until an error occurs. + +The remote attestation flow is demonstrated in this presentation: + +* + +* + +### Second boot + +On second boot, the initrd opens the LUKS device. The LUKS device header stores +the configuration needed for the Clevis Trustee Pin to perform the request to +the Trustee servers. The response to this request is the secret needed to unlock +the LUKS device and resume booting. + +### Confidential Cluster Operator + +The confidential cluster operator provides two services: + +* A registration service which provides individualized Ignition configs to each + node on first boot. + +* A Trustee instance which stores secrets (LUKS root keys). + +For each new machine registering to the service, the operator creates a CRD that +includes a uniquely generated UUID. This UUID is given back to the new node. The +operator watches for new Machine CRDs and sets up attestation and resource +policy in the Trustee instance, and generates random secret values to be used as +LUKS root keys. + +For more details about this flow, see: + + +### Cluster installation + +As part of the cluster installation process in cloud platforms, a bootstrap node +is created, which hosts a temporary control plane used to create the final +control plane and worker nodes of the cluster. + +In phase 1, the Confidential Cluster Operator is deployed on this bootstrap +node, which is considered trusted and it is used to bootstrap the trust for the +rest of the cluster. + +In phase 2, the bootstrap node itself must be attested to establish trust. It is +thus required to set up an external Trustee instance (outside of the cluster as +it does not exist yet) that is accessible from the bootstrap node to attest +itself. In the future, key material should be fetched from this external Trustee +server instead of being passed to the bootstrap node directly. + +Once the Confidential Cluster Operator is running on the bootstrap node, the +rest of the cluster is bootstrapped using the flow described above. + +### Cluster update & downgrade + +The Confidential Cluster Operator watches for changes in the desired OpenShift +release payload. When a new update is selected, the Confidential Cluster +Operator gets the URL/sha256 that points to the new container image (of RHCOS) +that is part of the desired release payload. + +It then computes the expected PCR values for this bootable container image. It +can either read a specific LABEL from the container image where those values +have been pre-computed and stored, or pull the container image itself and +directly compute the values. + +The PCR pre-calculation flow is demonstrated in this presentation: + +* + +* + +Once the new expected values have been computed, the operator updates the +reference values configured in the Trustee instance for the cluster. + +Initially, we will never remove previous reference values. Thus downgrading the +version of a node will not be an issue. In the future, reference values from +older versions of the cluster will progressively be garbage collected, to +prevent downgrade attacks. + +## Risks and Mitigations + +* **Performance Overhead**: The memory and disk encryption used for CVMs can + introduce a slight performance overhead. This will be mitigated by providing + clear guidance on performance expectations for confidential workloads. + +* **Cost**: CVMs require support for features only present in newer, more + powerful CPUs, which can lead to slightly higher costs. This is a trade-off + for enhanced security that users must accept. + +* **Attestation Complexity**: To be useful and offer real security guarantees, + the remote attestation process must be as precise as possible (i.e. we must + measure and verify as many elements as possible from the boot chain). The more + elements measured, the more complex the implementation. If any part of the + verification fails during remote attestation, the node must not be able to + boot, otherwise it would compromise the integrity of the cluster. Any mistake + thus significantly impacts the availability of the cluster. + +While the remote attestation process is complex, the role of the operator is to +manage that complexity in order to free cluster administrators from having to +manually handle setting and updating reference values. + +* **Cloud Provider Dependency**: This feature relies on underlying cloud + provider CVM capabilities. The design aims for portability where possible but + will initially target specific cloud environments with mature CVM offerings. + +* **Debugging Challenges**: Debugging on CVMs can be difficult as some + traditional methods may fail. For example, if attestation fails in the + initramfs phase, gathering logs may be challenging. This can be mitigated by + providing a way to enable debugging on a CVM that is not part of the + cluster. We are also implementing KubeVirt support as a target for development + clusters in order to let OpenShift developers reproduce potential issues + locally without having to deploy an entire cluster in a cloud environment. + +* **Trustee / Confidential Cluster Operator availability**: If one of Trustee + or the Operator is unavailable, nodes will not be able to boot. This can be + mitigated by using an external (out of cluster) Trustee instance, especially + for scenarios where the cluster is expected to be shutdown completely. + +* **Security review**: This enhancement will need careful security review. We + are working on a more detailed threat model, which will be submitted in a + later stage. + +## Drawbacks + +It introduces a lot of complexity, notably for the first boot and for +updates. While we will try to hide this complexity from the cluster +administrators as much as possible, bugs can always happen and debugging will be +harder. + +## Alternatives (Not Implemented) + +We chose to host the Operator in the cluster in order to be able to implement +the entire PCR pre-calculation logic and include it in the operator instead of +letting users compute and manage those themselves. This should make for a better +user experience. The alternative is to instead host the Trustee instance outside +of the cluster and have the Operator be a different component outside of the +cluster. This would require users to manage reference values on cluster updates +and node creations. + +## Open Questions [optional] + +To be updated with incoming questions. + +## Test Plan + +We will need E2E tests on all supported cloud platforms. + +While we don’t want to support that for production, it should also be possible +to test adding confidential nodes to a non confidential cluster where the +Confidential Cluster Operator would be running, making testing easier. + +We plan to support running on KubeVirt (CNV), at least for development and +testing, using TPM only (i.e. no Confidential Computing) remote attestation +checks. Remote attestation support will also be tested independently as part of +FCOS/RHCOS and general Image Mode / bootc testing. + +## Graduation Criteria + +**Note:** *Section not required until targeted at a release.* + +(Section not yet filled out) + +## Upgrade / Downgrade Strategy + +This component will (in the end) be part of the core OpenShift payload and +updated alongside the rest of the cluster. + +## Version Skew Strategy + +The protocol used by the nodes attestation client and the Trustee server must +match. This means that we may have to keep multiple versions of the Trustee +instance running in parallel until the boot image is updated in the cluster. + +## Operational Aspects of API Extensions + +* If the Confidential Cluster Operator is not available, new nodes will fail to + boot and join the cluster. +* If the Confidential Cluster Operator is not available, the policy and + reference values will not be updated in the cluster and updates can not be + performed. Manual updates will be required +* If the Trustee instance is not available, nodes will fail to attest themselves + on boot. Configuring a backup Trustee instance mitigates this. + +## Support Procedures + +(Section not yet filled out) + +## Infrastructure Needed [optional] + +(Section not yet filled out)