Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
169 changes: 169 additions & 0 deletions docs/proposals/20240807-in-place-updates-implementation-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,172 @@ sequenceDiagram
MS2 (NewMS)-->>MS Controller: Yes, M1!
MS Controller->>M1: Remove annotation ".../pending-acknowledge-move": ""
```

## Notes about in-place update implementation for KubeadmControlPlane

- In-place updates respect the existing control plane update strategy:
- KCP controller uses `rollingUpdate` strategy with `maxSurge` (0 or 1)
- When `maxSurge` is 0, no new machines are created during rollout; updates are performed only on existing machines via in-place updates or by scaling down outdated machines
- When `maxSurge` is 1:
- The controller first scales up by creating one new machine to maximize fault tolerance
- Once `maxReplicas` (desiredReplicas + 1) is reached, it evaluates whether to in-place update or scale down old machines
- For each old machine needing rollout, the controller evaluates if it is eligible for in-place update. If so, it performs the in-place update on that machine. Otherwise, it scales down the outdated machine (which will be replaced by a new one in the next reconciliation cycle)
- This pattern repeats until all machines are up-to-date, it then scales back to the desired replica count

- The implementation respects the existing set of responsibilities:
- KCP controller manages control plane Machines directly
- KCP controller enforces `maxSurge` limits during rolling updates
- KCP controller decides when to scale up, scale down, or perform in-place updates
- KCP controller runs preflight checks to ensure the control plane is stable before in-place updates
- KCP controller calls the `CanUpdateMachine` hook to verify if extensions can handle the changes
- When in-place update is possible, the KCP controller triggers the update by reconciling the desired state

- The in-place update decision flow:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The in-place update decision flow:
- The in-place update decision flow definition is:

or something in those lines. there is a missing verb or adjective.

- If `currentReplicas < maxReplicas` (desiredReplicas + maxSurge), scale up first to maximize fault tolerance
- If `currentReplicas >= maxReplicas`, select a machine needing rollout and evaluate options:
- Check if selected Machine is eligible for in-place update (determined by `UpToDate` function)
- Check if we already have enough up-to-date replicas (if `currentUpToDateReplicas >= desiredReplicas`, skip in-place and scale down)
- Run preflight checks to ensure control plane stability
- Call the `CanUpdateMachine` hook on registered runtime extensions
- If all checks pass, trigger in-place update. Otherwise, fall back to scale down/recreate
- This flow repeats on each reconciliation until all machines are up-to-date

- Orchestration of in-place updates uses two key annotations:
- `in-place-updates.internal.cluster.x-k8s.io/update-in-progress` - Marks a Machine as undergoing in-place update
- `runtime.cluster.x-k8s.io/pending-hooks` - Tracks pending `UpdateMachine` runtime hooks

The following schemas provide an overview of the in-place update workflow for KCP.

Workflow #1: KCP controller determines that a Machine can be updated in-place and triggers the update.

```mermaid
sequenceDiagram
autonumber
participant KCP Controller
participant RX as Runtime Extension
participant M1 as Machine
participant IM1 as InfraMachine
participant KC1 as KubeadmConfig

KCP Controller->>KCP Controller: Select Machine for rollout
KCP Controller->>KCP Controller: Run preflight checks on control plane
KCP Controller->>RX: CanUpdateMachine(current, desired)?
RX-->>KCP Controller: Yes, with patches to indicate supported changes

KCP Controller->>M1: Set annotation "update-in-progress": ""
KCP Controller->>IM1: Apply desired InfraMachine spec<br/>Set annotation "update-in-progress": ""
KCP Controller->>KC1: Apply desired KubeadmConfig spec<br/>Set annotation "update-in-progress": ""
KCP Controller->>M1: Apply desired Machine spec<br/>Set annotation "pending-hooks": "UpdateMachine"
```

Workflow #2: The Machine controller detects the pending `UpdateMachine` hook and calls the runtime extension to perform the update.

```mermaid
sequenceDiagram
autonumber
participant Machine Controller
participant RX as Runtime Extension
participant M1 as Machine
participant IM1 as InfraMachine
participant KC1 as KubeadmConfig

Machine Controller-->>M1: Has "update-in-progress" and "pending-hooks: UpdateMachine"?
M1-->>Machine Controller: Yes!

Machine Controller->>RX: UpdateMachine(desired state)
RX-->>Machine Controller: Status: InProgress, RetryAfterSeconds: 30

Note over Machine Controller: Wait and retry

Machine Controller->>RX: UpdateMachine(desired state)
RX-->>Machine Controller: Status: Done

Machine Controller->>IM1: Remove annotation "update-in-progress"
Machine Controller->>KC1: Remove annotation "update-in-progress"
Machine Controller->>M1: Remove annotation "update-in-progress"<br/>Remove "UpdateMachine" from "pending-hooks"
```

Workflow #3: The KCP controller waits for in-place update to complete before proceeding with further operations.

```mermaid
sequenceDiagram
autonumber
participant KCP Controller
participant M1 as Machine

KCP Controller-->>M1: Is in-place update in progress?
M1-->>KCP Controller: Yes! ("update-in-progress" or "pending-hooks: UpdateMachine")

Note over KCP Controller: Wait for update to complete<br/>Requeue on Machine changes

KCP Controller-->>M1: Is in-place update in progress?
M1-->>KCP Controller: No! (annotations removed)

Note over KCP Controller: Continue with next Machine rollout or other operations
```

## Notes about managedFields refactoring for in-place updates (KCP/MS)

To enable correct in-place updates of BootstrapConfigs and InfraMachines, CAPI v1.12 introduced a refactored managedFields structure. This change was necessary for the following reasons:

- In CAPI <= v1.11, BootstrapConfigs/InfraMachines were only created, never updated
- Starting with CAPI v1.12, BootstrapConfigs/InfraMachines need to be updated during in-place updates. SSA is used because it provides proper handling of co-ownership of fields and enables unsetting fields during updates

### A "two field managers" approach

The refactoring uses **two separate field managers** to enable different responsibilities:

1. **Metadata manager** (`capi-kubeadmcontrolplane-metadata` / `capi-machineset-metadata`):
- Continuously syncs labels and annotations
- Updates on every reconciliation via `syncMachines`

2. **Spec manager** (`capi-kubeadmcontrolplane` / `capi-machineset`):
- Manages the spec and in-place update specific annotations
- Updates only when creating objects or triggering in-place updates

### ManagedFields structure comparison

**CAPI <= v1.11** (legacy):
- Machine:
- spec, labels, and annotations are owned by `capi-kubeadmcontrolplane` / `capi-machineset` (Apply)
- BootstrapConfig / InfraMachine:
- labels and annotations are owned by `capi-kubeadmcontrolplane` / `capi-machineset` (Apply)
- spec is owned by `manager` (Update)

**CAPI >= v1.12** (new):
- Machine (unchanged):
- spec, labels, and annotations are owned by `capi-kubeadmcontrolplane` / `capi-machineset` (Apply)
- BootstrapConfig / InfraMachine:
- labels and annotations are owned by `capi-kubeadmcontrolplane-metadata` / `capi-machineset-metadata` (Apply)
- spec is owned by `capi-kubeadmcontrolplane` / `capi-machineset` (Apply)

### Object creation workflow (CAPI >= v1.12)

When creating new BootstrapConfig/InfraMachine:

1. **Initial creation**:
- Apply BootstrapConfig/InfraMachine with spec (manager: `capi-kubeadmcontrolplane` / `capi-machineset`)
- Remove managedFields for labels and annotations
- Result: labels and annotations are orphaned, spec is owned

2. **First syncMachines call** (happens immediately after):
- Apply labels and annotations (manager: `capi-kubeadmcontrolplane-metadata` / `capi-machineset-metadata`)
- Result: Final desired managedField structure is established

3. **Ready for operations**:
- Continuous `syncMachines` calls update labels/annotations without affecting the spec of a Machine
- In-place updates can now properly update spec fields and unset fields as needed

### In-place update object modifications

When triggering in-place updates:

1. Apply BootstrapConfig/InfraMachine with:
- Updated spec (owned by the spec manager)
- `update-in-progress` annotation (owned by spec manager)
- For InfraMachine: `cloned-from` annotations (owned by the spec manager)

2. Result after the in-place update trigger:
- labels and annotations are owned by the metadata manager
- spec is owned by the spec manager
- in-progress and cloned-from annotations are owned by the spec manager
Loading