Skip to content

medik8s/machine-deletion-remediation

Repository files navigation

Machine-API Driven Remediation

This operator conforms to the External Remediation of NodeHealthCheck and is designed to work with Node Health Check to reprovision unhealthy nodes using the Machine API. It functions by following the annotation on the Node to the associated Machine object, confirms that it has an owning controller (e.g. MachineSetController), and deletes it. Once the Machine CR has been deleted, the owning controller creates a replacement.

Pre-requisites

  • Machine API based cluster that is able to programmatically destroy and create cluster nodes
  • Nodes are associated with Machines
  • Machines are declaratively managed
  • Node Health Check is installed and running

Installation

  • Deploy MDR (Machine-deletion-remediation) to a container in the cluster pod. Try make deploy, official images coming soon.
  • Load the yaml manifest of the MDR template (see below).
  • Modifying NodeHealthCheck CR to use MDR as it's remediator. This is basically a specific use case of an External Remediation of NodeHealthCheck. In order to set up: make sure that Node Health Check is running, Machine-deletion-remediation controller exists and then create the necessary CRs.

Example CRs

An example MDR template object.

   apiVersion: machine-deletion-remediation.medik8s.io/v1alpha1
   kind: MachineDeletionRemediationTemplate
   metadata:
     name: group-x
     namespace: default
   spec:
     template:
       spec: {}

These CRs are created by the admin and are used as a template by NodeHealthCheck for creating the CRs that represent a request for a Node to be recovered.

Configuring NodeHealthCheck to use the example group-x template above.

apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
  name: nodehealthcheck-sample
spec:
  remediationTemplate:
    kind: MachineDeletionRemediationTemplate
    apiVersion: machine-deletion-remediation.medik8s.io/v1alpha1
    name: group-x
    namespace: default

While the admin may define many NodeHealthCheck domains, they can all use the same MDR template if desired.

An example remediation request for Node worker-0-21 (NOTE: uid is the nodehealthcheck-sample's UID).

apiVersion: machine-deletion-remediation.medik8s.io/v1alpha1
kind: MachineDeletionRemediation
metadata:
  name: worker-0-21
  namespace: default
spec: {}

These CRs are created by NodeHealthCheck when it detects a failed node. The MDR operator watches for them to be created, looks up the Machine CR and deletes Node associated with it. MDR CRs are deleted by NodeHealthCheck when it sees the Node is healthy again.