-
Notifications
You must be signed in to change notification settings - Fork 101
/
Copy pathremediation-controller.md
162 lines (144 loc) · 5.58 KB
/
remediation-controller.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# CAPM3 machine health check and remediation
## Introduction
The `Cluster API` includes an optional
[MachineHealthCheck (MHC)](https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html)
component that implements automated health checking capability. With
`CAPM3 Remediation Controller` it is possible to plug in Metal3 specific
remediation strategies to remediate an unhealthy nodes while relying on Cluster
API MHC to determine those nodes as unhealthy.
## CAPI MachineHealthCheck
A MachineHealthCheck is a Cluster API resource, which allows users to define
conditions under which Machines within a Cluster should be considered unhealthy.
Users can also specify a timeout for each of the conditions that they define to
check on the Machine’s Node. If any of these conditions are met for the duration
of the timeout, the Machine will be remediated. Within CAPM3 we use MHC to
create remediation requests based on `Metal3RemediationTemplate` and
`Metal3Remediation` CRDs to plug in our own remediation solution.
## Remediation Controller
`CAPM3 Remediation Controller (RC)` reconciles `Metal3Remediation` objects
created by CAPI MachineHealthCheck. The RC locates a Machine with the same name
as the Metal3Remediation CR and uses existing BMO and CAPM3 APIs to remediate
associated unhealthy baremetal nodes. Our remediation controller supports
`reboot strategy` specified in Metal3Remediation CRD and uses the same object to
store state of the current remediation cycle.
### Basic Remediation workflow
- RC watches for the presence of Metal3Remediation CR.
- Based on the remediation strategy defined in `.spec.strategy.type` in
Metal3Remediation, RC uses BMO APIs to get hosts back into a healthy or
manageable state.
- RC uses `.status.phase` to save the states of the remediation. Available
states are `running`, `waiting`, `deleting machine`.
- After RC have finished its remediation, it will wait for the Metal3Remediation
CR to be removed. (When using CAPI MachineHealthCheck controller, MHC will
noticed the Node becomes healthy and deletes the instantiated
MachineRemediation CR.).
### Workflow during retry and after remediation failure
- `.spec.strategy.retryLimit` and `.spec.strategy.timeout` defined in
Metal3Remediation are used to set limit for reboot retries and time to wait
between retries.
- If RCs last `.spec.strategy.timeout` for Node to become healthy expires, it
sets `capi.MachineOwnerRemediatedCondition` to False on Machine object to
start deletion of the unhealthy Machine and the corresponding
Metal3Remediation.
- If RCs last `.spec.strategy.timeout` for Node to become healthy expires, it
annotates BareMetalHost with `capi.metal3.io/unhealthyannotation`.
---
### Configuration
Use the following examples as a basis for creating a MachineHealthCheck and
Metal3Remediation for worker nodes:
```yaml
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
name: worker-healthcheck
namespace: metal3
spec:
# clusterName is required to associate this MachineHealthCheck
# with a particular cluster
clusterName: test1
# (Optional) maxUnhealthy prevents further remediation if the cluster is
# already partially unhealthy
maxUnhealthy: 100%
# (Optional) nodeStartupTimeout determines how long a MachineHealthCheck
# should wait for
# a Node to join the cluster, before considering a Machine unhealthy.
# Defaults to 10 minutes if not specified.
# Set to 0 to disable the node startup timeout.
# Disabling this timeout will prevent a Machine from being considered
# unhealthy when the Node it created has not yet registered with the cluster.
# This can be useful when Nodes take a long time to start up or when you only
# want condition based checks for Machine health.
nodeStartupTimeout: 0m
# selector is used to determine which Machines should be health checked
selector:
matchLabels:
nodepool: nodepool-0
# Conditions to check on Nodes for matched Machines, if any condition is
# matched for the duration of its timeout, the Machine is considered unhealthy
unhealthyConditions:
- type: Ready
status: Unknown
timeout: 300s
- type: Ready
status: "False"
timeout: 300s
remediationTemplate: # added infrastructure reference
kind: Metal3RemediationTemplate
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
name: worker-remediation-request
```
```yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: Metal3RemediationTemplate
metadata:
name: worker-remediation-request
namespace: metal3
spec:
template:
spec:
strategy:
type: "Reboot"
retryLimit: 2
timeout: 300s
```
Use the following examples as a basis for creating a MachineHealthCheck and
Metal3Remediation for controlplane nodes:
```yaml
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
name: controlplane-healthcheck
namespace: metal3
spec:
clusterName: test1
maxUnhealthy: 100%
nodeStartupTimeout: 0m
selector:
matchLabels:
cluster.x-k8s.io/control-plane: ""
unhealthyConditions:
- type: Ready
status: Unknown
timeout: 300s
- type: Ready
status: "False"
timeout: 300s
remediationTemplate: # added infrastructure reference
kind: Metal3RemediationTemplate
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
name: controlplane-remediation-request
```
```yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: Metal3RemediationTemplate
metadata:
name: controlplane-remediation-request
namespace: metal3
spec:
template:
spec:
strategy:
type: "Reboot"
retryLimit: 1
timeout: 300s
```