Skip to content
375 changes: 375 additions & 0 deletions enhancements/kube-apiserver/kms-encryption-foundations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,375 @@
---
title: kms-encryption-foundations
authors:
- "@ardaguclu"
reviewers:
- "@flavianmissi"
- "@ibihim"
approvers:
- "@benluddy"
api-approvers:
- "@JoelSpeed"
creation-date: 2025-12-03
last-updated: 2026-01-08
tracking-link:
- "https://issues.redhat.com/browse/OCPSTRAT-108"
see-also:
- "enhancements/kube-apiserver/encrypting-data-at-datastore-layer.md"
- "enhancements/etcd/storage-migration-for-etcd-encryption.md"
- "[encrypt data at rest with KMS](https://github.com/openshift/enhancements/pull/1872)"
replaces:
- "[KMS Encryption Provider for Etcd Secrets](https://github.com/openshift/enhancements/pull/1682/)"
---

# KMS Encryption Foundations

## Summary

Extend OpenShift encryption controllers to support external Key Management Services (KMS v2) alongside existing local encryption modes (aescbc, aesgcm).
This allows encryption keys to be stored and managed outside the cluster for enhanced security.

This enhancement:
- Uses existing `config.openshift.io/v1/APIServer` resource `encryption.type` field to enable KMS mode
- Extends encryption controllers in `openshift/library-go` to support KMS as a new encryption mode
- Maintains feature parity with existing encryption modes (migration, monitoring, key rotation)
- Provider-agnostic implementation supporting any KMS v2-compatible plugin

## Motivation

OpenShift currently manages AES keys locally for encrypting data at rest in etcd.
KMS support enables integration with external key management systems where encryption keys are stored outside the cluster, protecting against attacks where control plane nodes are compromised.

### Goals

- Support KMS v2 as a new encryption mode in existing encryption controllers
- Seamless migration between encryption modes (aescbc ↔ KMS, KMS ↔ KMS)
- Provider-agnostic implementation with minimal provider-specific code
- Feature parity with existing modes (monitoring, migration, key rotation)

### Non-Goals

- Implementing KMS plugins (provided by upstream Kubernetes/vendors)
- KMS plugin deployment/lifecycle management
- KMS plugin health checks (Tech Preview v2)
- Recovery from KMS key loss (separate EP for GA)
- Automatic `key_id` rotation detection (Tech Preview v2)

## Proposal

Extend the existing encryption controller framework in `openshift/library-go` to support KMS encryption in two phases:

**Tech Preview v1 (External Plugin Management):**

Users deploy KMS plugins manually on all control plane nodes as static pods or systemd units at a predefined socket path (`unix:///var/run/kmsplugin/kms.sock`).
Encryption controllers use the static endpoint in EncryptionConfiguration. KMS-to-KMS migrations are not supported in Tech Preview v1 since only one plugin can listen at the static socket path at a time.

**Tech Preview v2 (Managed Plugin Lifecycle):**

Users specify plugin-specific configuration for managed KMS provider types (e.g. Vault).
From the encryption controllers' perspective, the core logic remains the same; only the tracked fields change.

**Key changes in library-go:**
1. Add KMS mode constant to encryption state types
2. Track KMS configuration in encryption key secrets
3. Manage encryption key secrets with KMS configuration (actual keys are stored externally in KMS provider)
4. Detect configuration changes to trigger migration
5. Reuse existing migration controller (no changes needed)

**Additional Tech Preview v2 capabilities:**
- Poll KMS plugin Status endpoint for health checks and `key_id` changes to detect external key rotation

### Workflow Description

#### Actors in the Workflow

**cluster admin** is a human user responsible for configuring and maintaining the cluster.

**KMS** is the external Key Management Service that stores and manages the Key Encryption Key (KEK).

**KMS plugin** is a gRPC service implementing Kubernetes KMS v2 API, running as a static pod on each control plane node. It communicates with the external KMS to encrypt/decrypt data encryption keys (DEKs).

**API server operator** is the OpenShift operator (kube-apiserver-operator, openshift-apiserver-operator, or authentication-operator) managing API server deployments.

#### Encryption Controllers

**keyController** manages encryption key lifecycle. Creates encryption key secrets in `openshift-config-managed` namespace. For KMS mode, creates secrets storing KMS configuration.

**stateController** generates EncryptionConfiguration for API server consumption. Implements distributed state machine ensuring all API servers converge to same revision.
For KMS mode, generates EncryptionConfiguration using the KMS configuration.

**migrationController** orchestrates resource re-encryption. Marks resources as migrated after rewriting in etcd. Works with all encryption modes including KMS.

**pruneController** prunes inactive encryption key secrets. Maintains N keys (currently 10) for rollback scenarios.

**conditionController** determines when controllers should act. Provides status conditions (`EncryptionInProgress`, `EncryptionCompleted`, `EncryptionDegraded`).

#### Steps for Enabling KMS Encryption (Tech Preview v1)

1. Cluster admin deploys KMS plugin on all control plane nodes (listening at `unix:///var/run/kmsplugin/kms.sock`) as static pod or systemd unit and updates the APIServer resource to enable KMS encryption.
To enable the apiservers to access the KMS plugin, the `/var/run/kmsplugin` directory is mounted as a hostPath volume in all the apiserver pods.
```yaml
apiVersion: config.openshift.io/v1
kind: APIServer
spec:
encryption:
type: KMS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field already exists doesn't it? Can you explain somewhere what the current behaviour is so that we can see how it has changed with this EP? What happens to clusters already using this feature?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review. I've updated the EP by mentioning this. Please let me know your thoughts.

```

2. keyController detects the new encryption mode.

3. keyController creates encryption key secret with KMS configuration:
```yaml
apiVersion: v1
kind: Secret
metadata:
name: openshift-kube-apiserver-encryption-1
namespace: openshift-config-managed
annotations:
encryption.apiserver.operator.openshift.io/mode: "kms"
data:
encryption.apiserver.operator.openshift.io-key: "<base64-encoded-kms-config>"
Copy link
Contributor

@ibihim ibihim Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary anymore as this is now a static file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. But we need to make controllers endpoint agnostic so that they can work in the future changes. There is a similar discussion #1900 (comment)

# Contains base64-encoded structured data with KMS configuration:
# - Tech Preview v1: Static endpoint path (unix:///var/run/kmsplugin/kms.sock)
# - Tech Preview v2: Will also include key_id and other plugin-specific configuration for other kms provider types
```

4. stateController generates EncryptionConfiguration using the endpoint:
```yaml
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources: [configmap]
providers:
- kms:
name: configmap-1
endpoint: unix:///var/run/kmsplugin/kms.sock
apiVersion: v2
```

5. migrationController detects the new secret and initiates re-encryption (no code changes - works with any mode).

6. conditionController updates status conditions: `EncryptionInProgress`, then `EncryptionCompleted`.

**Note:** Automatic weekly key rotation (used for aescbc/aesgcm) is disabled for KMS since rotation is triggered externally.

#### Variation: KMS Key Rotation (Tech Preview v2)

When external KMS rotates the key internally:

1. keyController polls KMS plugin Status endpoint for `key_id`.
2. Compares `key_id` with `key_id` stored in secret `Data` field.
3. If `key_id` differs:
- Creates new encryption key secret with new `key_id`
- migrationController automatically triggers re-encryption
4. If `key_id` matches: No action.

> **Note:** API server operators are not privileged and cannot directly communicate with KMS plugins running as static pods on control plane nodes.
> Tech Preview v2 will require introducing a mechanism to poll KMS plugin Status endpoints for `key_id` changes and health monitoring, and expose this information to the operators.

**Two change detection mechanisms:**
- Tracking KMS configuration detects admin configuration changes
- Tracking key_id detects external key rotation

#### Variation: Migration Between Encryption Modes

**From aescbc to KMS:**
1. Admin deploys KMS plugin and updates APIServer: `type: KMS` with KMS configuration.
2. keyController creates KMS secret (empty data, with KMS configuration annotation).
3. migrationController re-encrypts resources using external KMS.

**From KMS to aescbc:**
1. Admin updates APIServer: `type: aescbc`.
2. keyController creates aescbc secret (with actual key material).
3. migrationController re-encrypts resources using local AES key.

Migration controller reuses existing logic - no changes required.

### User Stories

- As a cluster admin, I want to enable KMS encryption by updating the APIServer resource, so I can declaratively configure encryption without manually managing keys.
- As a cluster admin, I want the same migration and monitoring experience for KMS as local encryption, so I don't need to learn new procedures.
- As a security admin, I want encryption keys stored outside the cluster, so compromised control plane nodes cannot access keys.

### API Extensions
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoelSpeed just for clarification; since we don't plan any API updates in tech preview, do we still need your review/approval?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I suspect you don't. Do you envision any long term API changes as a result of this feature? If you expect any API changes in the long term then I would expect those to be represented in TechPreview though, so that they can be thought out and tested before we GA

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you envision any long term API changes as a result of this feature? If you expect any API changes in the long term then I would expect those to be represented in TechPreview though, so that they can be thought out and tested before we GA

Changes in the first iteration will be backported to 4.21. So we wanted to keep it as simple as possible. However, in the next iteration (very likely in 4.22 release cycle), it is almost certain that we'll have API changes (based on the results of our architectural discussions). GA is expected to be in 4.23.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What defines the first iteration vs second iteration?

How will you prove the functionality of the first iteration meets the quality bar and is stable enough to be backported?

You might want to, in that case, run this as two separate feature gates, with the first set of functionality behind the first, focus on getting that tested and promoted, and then adding the new API and second phase behind the second

Given you are now talking about GA in 4.23, that means you'll want that API TP in 4.22 ideally, I'd suggest starting to flesh that out and having an API review here. You could wait until you've completed the first phase in theory but it might delay timelines

Copy link
Member Author

@ardaguclu ardaguclu Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This EP aims to provide foundations in order to enable KMS in encryption controllers in library-go. So basically, this EP is not strictly bound to API definitions (it just tracks the changes in the API fields to initiate migration e.g. #1900 (comment)). When we have API definitions in the next iterations, the work in here should remain almost same.

What defines the first iteration vs second iteration?

The minimal part that can be backported to 4.21 will be first and the rest (that do not need to be backported) will be considered as second.

How will you prove the functionality of the first iteration meets the quality bar and is stable enough to be backported?

We'll add e2e and integration tests even in the first iteration (as called out in the EP)

You might want to, in that case, run this as two separate feature gates, with the first set of functionality behind the first, focus on getting that tested and promoted, and then adding the new API and second phase behind the second

We added API definitions in this EP but actually this EP's focus is independent from API fields. So conceptually in the next iteration, there won't be any changes in here (maybe slight changes).

Given you are now talking about GA in 4.23, that means you'll want that API TP in 4.22 ideally, I'd suggest starting to flesh that out and having an API review here. You could wait until you've completed the first phase in theory but it might delay timelines

Totally agree with you. I'm aware that our current situation is not great :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoelSpeed I'd appreciate if you can have a look at this EP, when you have a time. Even though we won't make any API changes in this version, your feedback will be valuable about the future. Thank you.

Copy link
Member Author

@ardaguclu ardaguclu Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were planning to use EncryptionType (KMS) without KMSConfig which is supposed to be deprecated. However, during our tests we've realized that this validation disallows empty KMSConfig.

Therefore, with respect to this EP, we'll need to have API change to loosen the validation.
cc: @benluddy


**APIServer Resource** ([config.openshift.io/v1](https://github.com/openshift/api/blob/master/config/v1/types_kmsencryption.go)):

**Current Behavior:**

The `encryption.type` field already supports the `KMS` value ([EncryptionType](https://github.com/openshift/api/blob/master/config/v1/types_apiserver.go#L214)), and the `KMSConfig` struct exists in the API.
These fields are gated by the `KMSEncryptionProvider` feature gate (DevPreviewNoUpgrade, TechPreviewNoUpgrade).
However, the encryption controllers do not implement KMS support. Enabling `KMSEncryptionProvider` feature gate and setting `type: KMS` have no effect - controllers ignore it and no encryption occurs.

**Tech Preview V1**

For Tech Preview v1, no new API fields are added to the APIServer resource.
Users simply set `encryption.type: KMS` ([EncryptionType](https://github.com/openshift/api/blob/6fb7fdae95fd20a36809d502cfc0e0459550d527/config/v1/types_apiserver.go#L214))
and deploy KMS plugins at the hardcoded endpoint `unix:///var/run/kmsplugin/kms.sock`. Current `KMSConfig` will not be used.
Comment on lines +205 to +207
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the current KMSConfig do today?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does nothing. There is no functionality listening KMSConfig today.


### Topology Considerations

#### Hypershift / Hosted Control Planes

Hypershift has a parallel implementation that supports AESCBC and KMS without using the encryption controllers in library-go.
Unifying the two implementations is out of scope for this enhancement.

#### Standalone Clusters

This enhancement applies to standalone clusters.
The controllers run in the cluster-kube-apiserver-operator, cluster-openshift-apiserver-operator, and cluster-authentication-operator.

#### Single-node Deployments or MicroShift

Resource consumption impact is minimal - the controllers already exist and are extended with KMS-specific logic.
Single-node deployments will see slightly increased CPU usage during key rotation detection (gRPC Status calls), but this is negligible.

MicroShift may adopt this enhancement if KMS encryption is desired, but the configuration mechanism may differ (file-based vs API resource).

#### OpenShift Kubernetes Engine

This feature does not depend on the features that are excluded from the OKE product offering.

### Implementation Details/Notes/Constraints

### Risks and Mitigations

**Risk: KMS Plugin Unavailable During Controller Sync**
- **Impact:** Controllers cannot detect key rotation
- **Mitigation:** No mitigation in Tech Preview. Tech Preview v2 will add health checks and expose it to cluster admin via operator conditions to degrade

**Risk: etcd Backup Restoration Without KMS Key Access**
- **Impact:** Cannot decrypt data if KMS key deleted/unavailable/expired
- **Mitigation:** No mitigation in Tech Preview. Document KMS key retention requirements.

### Drawbacks

- Adds complexity to encryption controllers for KMS-specific logic
- Dependency on KMS plugin health for controller operations (health checks in Tech Preview v2)

## Test Plan

**Unit Tests**:
- `key_controller_test.go`: KMS key creation, rotation detection, endpoint changes
- `migration_controller_test.go`: KMS migration scenarios
- `state_controller_test.go`: KMS state changes

**Integration Tests**:
- State transitions in encryption controllers in library-go
- Explore MOM framework for integration tests in apiserver operators (add tests if it makes sense)

**E2E Tests** (v1):
- Migration between identity ↔ KMS

**E2E Tests** (v2):
- Full cluster with KMS encryption enabled
- Migration between encryption modes (aescbc → KMS, KMS → KMS)
- Verify data re-encryption completes

## Graduation Criteria

### Dev Preview -> Tech Preview

None

### Tech Preview -> GA

- Dynamic `key_id` fetching via KMS plugin Status endpoint
- Full support for key rotation, with automated data re-encryption
- Migration support between different KMS providers, with automated data re-encryption
- Health check preconditions (block operations when plugin unhealthy)
- Comprehensive integration and E2E test coverage
- Production validation in multiple environments

### Removing a deprecated feature

N/A

## Upgrade / Downgrade Strategy

**Upgrade:**

This feature is gated by TechPreviewNoUpgrade feature gate. Upgrades are not permitted in Tech Preview.

**Downgrade:**

When KMS encryption is enabled and actively used, downgrade is not supported if the previous version lacks KMS support. The API server requires access to encryption keys to decrypt resources stored in etcd.

To downgrade:
1. Migrate from KMS to a supported encryption mode (aescbc or aesgcm or identity)
2. Wait for migration to complete
3. Proceed with cluster downgrade

## Version Skew Strategy

Encryption controllers run in operator pods (not nodes). Version skew concerns:
- **kube-apiserver:** Must support KMS v2 API (Kubernetes 1.27+)
- **library-go:** Operators must use same library-go version
- **KMS plugin:** No version skew concerns - plugins communicate with apiservers via the standardized KMS v2 API contract, ensuring compatibility regardless of plugin version

No special handling required.

## Operational Aspects of API Extensions

**Monitoring:**
- Operator conditions: `EncryptionControllerDegraded`, `EncryptionMigrationControllerProgressing`, `KMSPluginDegraded`
- Metrics: `apiserver_storage_transformation_operations_total`, `apiserver_storage_transformation_duration_seconds`

**Impact:**
- API latency: KMS call required, mitigated by DEK caching
- API throughput: minor reduction under normal conditions

### Failure Modes

**KMS Plugin Unavailable:**
- New resource creation fails
- Existing resources readable (if DEKs remain cached in API server memory; cache clears on restart)
- Detection: `KMSPluginDegraded=True`
- Recovery: Plugin restart (automatic or manual)

**Invalid KMS Configuration:**
- Plugin fails to start
- Detection: Plugin container crash loops
- Recovery: Fix APIServer configuration

**Key Rotation Stuck:**
- Migration unable to complete
- Detection: `EncryptionMigrationControllerProgressing=True` for extended period
- Recovery: Check migration controller logs, verify KMS health

## Support Procedures

### Detecting KMS Rotation Issues
```bash
# Check encryption key secrets
oc get secrets -n openshift-config-managed -l encryption.apiserver.operator.openshift.io/component=encryption-key

# Check controller logs
oc logs -n openshift-kube-apiserver-operator deployment/kube-apiserver-operator | grep -i kms
```

### Disabling KMS Encryption

1. Update APIServer: `spec.encryption.type: "aescbc"`
2. Wait for migration to complete
3. KMS plugin pods removed by operators

**etcd Backup/Restore:**
- Before backup: Document KMS configuration, verify key availability
- Before restore: Verify KMS key accessible, credentials valid
- Critical: Deleting KMS key makes backups unrestorable

## Alternatives (Not Implemented)

### Alternative: Separate KMS-Specific Controllers

Instead of extending existing controllers, create new KMS-only controllers.

**Why not chosen:**
- Code duplication (migration logic, state management)
- User confusion (different controllers for different encryption types)
- More operational burden (additional monitoring, alerts)


## Infrastructure Needed

None - extends existing library-go code.