openshift · bergerhoffer · May 30, 2019 · May 15, 2019 · kalexand-rh · Jun 12, 2019
diff --git a/_topic_map.yml b/_topic_map.yml
@@ -639,6 +639,19 @@ Topics:
 - Name: What huge pages do and how they are consumed by apps
   File: what-huge-pages-do-and-how-they-are-consumed-by-apps
 ---
+Name: Disaster recovery
+Dir: disaster_recovery
+Distros: openshift-origin,openshift-enterprise
+Topics:
+- Name: Backing up etcd data
+  File: backing-up-etcd
+- Name: Recovering from lost master hosts
+  File: scenario-1-infra-recovery
+- Name: Restoring back to a previous cluster state
+  File: scenario-2-restoring-cluster-state
+- Name: Recovering from expired control plane certificates
+  File: scenario-3-expired-certs
+---
 Name: CLI reference
 Dir: cli_reference
 Distros: openshift-enterprise,openshift-origin,openshift-dedicated

diff --git a/disaster_recovery/backing-up-etcd.adoc b/disaster_recovery/backing-up-etcd.adoc
@@ -0,0 +1,9 @@
+[id="backup-etcd"]
+= Backing up etcd
+include::modules/common-attributes.adoc[]
+:context: backup-etcd
+
+toc::[]
+
+// Backing up etcd data
+include::modules/backup-etcd.adoc[leveloffset=+1]
diff --git a/disaster_recovery/images b/disaster_recovery/images
@@ -0,0 +1 @@
+../images
diff --git a/disaster_recovery/modules b/disaster_recovery/modules
@@ -0,0 +1 @@
+../modules
diff --git a/disaster_recovery/scenario-1-infra-recovery.adoc b/disaster_recovery/scenario-1-infra-recovery.adoc
@@ -0,0 +1,21 @@
+[id="dr-infrastructure-recovery"]
+= Recovering from lost master hosts
+include::modules/common-attributes.adoc[]
+:context: dr-infrastructure-recovery
+
+toc::[]
+
+This document describes the process to recover from a complete loss of a master host. This includes
+situations where a majority of master hosts have been lost, leading to etcd quorum loss and the cluster going offline.
+
+At a high level, the procedure is to:
+
+. Restore etcd quorum on a remaining master host.
+. Create new master hosts.
+. Correct DNS and load balancer entries.
+. Grow etcd to full membership.
+
+If the majority of master hosts have been lost, you will need a xref:../disaster_recovery/backing-up-etcd.html#backing-up-etcd-data_backup-etcd[backed up etcd snapshot] to restore etcd quorum on the remaining master host.
+
+// Recovering from lost master hosts
+include::modules/dr-recover-lost-control-plane-hosts.adoc[leveloffset=+1]
diff --git a/disaster_recovery/scenario-2-restoring-cluster-state.adoc b/disaster_recovery/scenario-2-restoring-cluster-state.adoc
@@ -0,0 +1,11 @@
+[id="dr-restoring-cluster-state"]
+= Restoring back to a previous cluster state
+include::modules/common-attributes.adoc[]
+:context: dr-restoring-cluster-state
+
+toc::[]
+
+In order to restore the cluster to a previous state, you must have previously xref:../disaster_recovery/backing-up-etcd.html#backing-up-etcd-data_backup-etcd[backed up etcd data] by creating a snapshot. You will use this snapshot to restore the cluster state.
+
+// Restoring back to a previous cluster state
+include::modules/dr-restoring-cluster-state.adoc[leveloffset=+1]
diff --git a/disaster_recovery/scenario-3-expired-certs.adoc b/disaster_recovery/scenario-3-expired-certs.adoc
@@ -0,0 +1,9 @@
+[id="dr-recovering-expired-certs"]
+= Recovering from expired control plane certificates
+include::modules/common-attributes.adoc[]
+:context: dr-recovering-expired-certs
+
+toc::[]
+
+// Recovering from expired control plane certificates
+include::modules/dr-recover-expired-control-plane-certs.adoc[leveloffset=+1]
diff --git a/modules/backup-etcd.adoc b/modules/backup-etcd.adoc
@@ -0,0 +1,24 @@
+// Module included in the following assemblies:
+//
+// * disaster_recovery/backing-up-etcd.adoc
+
+[id="backing-up-etcd-data_{context}"]
+= Backing up etcd data
+
+Follow these steps to back up etcd data by creating a snapshot. This snapshot can be saved and used at a later time if you need to restore etcd.
+
+.Prerequisites
+
+* SSH access to a master host.
+
+.Procedure
+
+. Access a master host as the root user.
+
+. Run the `etcd-snapshot-backup.sh` script and pass in the location to save the etcd snapshot to.
++
+----
+$ sudo /usr/local/bin/etcd-snapshot-backup.sh ./assets/backup/snapshot.db
+----
++
+In this example, the snapshot is saved to `./assets/backup/snapshot.db` on the master host.
diff --git a/modules/dr-recover-expired-control-plane-certs.adoc b/modules/dr-recover-expired-control-plane-certs.adoc
@@ -0,0 +1,189 @@
+// Module included in the following assemblies:
+//
+// * disaster_recovery/scenario-3-expired-certs.adoc
+
+[id="dr-scenario-3-recovering-expired-certs_{context}"]
+= Recovering from expired control plane certificates
+
+Follow this procedure to recover from a situation where your control plane certificates have expired.
+
+.Prerequisites
+
+* SSH access to master hosts.
+
+.Procedure
+
+. Access a master host with an expired certificate as the root user.
+
+. Obtain the `cluster-kube-apiserver-operator` image reference for a release.
++
+----
+# RELEASE_IMAGE=<release_image> <1>
+----
+<1> An example value for `<release_image>` is `quay.io/openshift-release-dev/ocp-release:4.1.0`.
++
+----
+# KAO_IMAGE=$( oc adm release info --registry-config='/var/lib/kubelet/config.json' "${RELEASE_IMAGE}" --image-for=cluster-kube-apiserver-operator )
+----
+
+. Pull the `cluster-kube-apiserver-operator` image.
++
+----
+# podman pull --authfile=/var/lib/kubelet/config.json "${KAO_IMAGE}"
+----
+
+. Create a recovery API server.
++
+----
+# podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:Z --entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" recovery-apiserver create
+----
+
+. Run the `export KUBECONFIG` command from the output of the above command, which is needed for the `oc` commands later in this procedure.
++
+----
+# export KUBECONFIG=/<path_to_recovery_kubeconfig>/admin.kubeconfig
+----
+
+. Wait for the recovery API server to come up.
++
+----
+# until oc get namespace kube-system 2>/dev/null 1>&2; do echo 'Waiting for recovery apiserver to come up.'; sleep 1; done
+----
+
+. Run the `regenerate-certificates` command. It fixes the certificates in the API, overwrites the old certificates on the local drive, and restarts static Pods to pick them up.
++
+----
+# podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:Z --entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" regenerate-certificates
+----
+
+. After the certificates are fixed in the API, use the following commands to force new rollouts for the control plane. It will reinstall itself on the other nodes because the kubelet is connected to API servers using an internal load balancer.
++
+----
+# oc patch kubeapiserver cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
+----
++
+----
+# oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
+----
++
+----
+# oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
+----
+
+. Create a bootstrap kubeconfig with a valid user.
+
+.. Create a file called `restore_kubeconfig.sh` with the following contents.
++
+----
+#!/bin/bash
+
+set -eou pipefail
+
+# context
+intapi=$(oc get infrastructures.config.openshift.io cluster -o "jsonpath={.status.apiServerURL}")
+context="$(oc config current-context)"
+# cluster
+cluster="$(oc config view -o "jsonpath={.contexts[?(@.name==\"$context\")].context.cluster}")"
+server="$(oc config view -o "jsonpath={.clusters[?(@.name==\"$cluster\")].cluster.server}")"
+# token
+ca_crt_data="$(oc get secret -n openshift-machine-config-operator node-bootstrapper-token -o "jsonpath={.data.ca\.crt}" | base64 --decode)"
+namespace="$(oc get secret -n openshift-machine-config-operator node-bootstrapper-token  -o "jsonpath={.data.namespace}" | base64 --decode)"
+token="$(oc get secret -n openshift-machine-config-operator node-bootstrapper-token -o "jsonpath={.data.token}" | base64 --decode)"
+
+export KUBECONFIG="$(mktemp)"
+kubectl config set-credentials "kubelet" --token="$token" >/dev/null
+ca_crt="$(mktemp)"; echo "$ca_crt_data" > $ca_crt
+kubectl config set-cluster $cluster --server="$intapi" --certificate-authority="$ca_crt" --embed-certs >/dev/null
+kubectl config set-context kubelet --cluster="$cluster" --user="kubelet" >/dev/null
+kubectl config use-context kubelet >/dev/null
+cat "$KUBECONFIG"
+----
+
+.. Make the script executable.
++
+----
+# chmod +x restore_kubeconfig.sh
+----
+
+.. Execute the script and save the output to a file called `kubeconfig`.
++
+----
+# ./restore_kubeconfig.sh > kubeconfig
+----
+
+.. Copy the `kubeconfig` file to all master hosts and move it to `/etc/kubernetes/kubeconfig`.
+
+. Recover the kubelet on all masters.
+
+.. On a master host, stop the kubelet.
++
+----
+# systemctl stop kubelet
+----
+
+.. Delete stale kubelet data.
++
+----
+# rm -rf /var/lib/kubelet/pki /var/lib/kubelet/kubeconfig
+----
+
+.. Restart the kubelet.
++
+----
+# systemctl start kubelet
+----
+
+.. Repeat these steps on all other master hosts.
+
+. If necessary, recover the kubelet on the worker nodes.
++
+After the master nodes are restored, the worker nodes might restore themselves. You can verify this by running the `oc get nodes` command. If the worker nodes are not listed, then perform the following steps on each worker node.
++
+.. Stop the kubelet.
++
+----
+# systemctl stop kubelet
+----
+
+.. Delete stale kubelet data.
++
+----
+# rm -rf /var/lib/kubelet/pki /var/lib/kubelet/kubeconfig
+----
+
+.. Restart the kubelet.
++
+----
+# systemctl start kubelet
+----
+
+. Approve the pending `node-bootstrapper` certificates signing requests (CSRs).
+
+.. Get the list of current CSRs.
++
+----
+# oc get csr
+----
+
+.. Review the details of a CSR to verify it is valid.
++
+----
+# oc describe csr <csr_name> <1>
+----
+<1> `<csr_name>` is the name of a CSR from the list of current CSRs.
+
+.. Approve each valid CSR.
++
+----
+# oc adm certificate approve <csr_name>
+----
++
+Be sure to approve all pending `node-bootstrapper` CSRs.
+
+. Destroy the recovery API server because it is no longer needed.
++
+----
+# podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:Z --entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" recovery-apiserver destroy
+----
++
+Wait for the control plane to restart and pick up the new certificates. This might take up to 10 minutes.