Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions _topic_map.yml
Original file line number Diff line number Diff line change
Expand Up @@ -639,6 +639,19 @@ Topics:
- Name: What huge pages do and how they are consumed by apps
File: what-huge-pages-do-and-how-they-are-consumed-by-apps
---
Name: Disaster recovery
Dir: disaster_recovery
Distros: openshift-origin,openshift-enterprise
Topics:
- Name: Backing up etcd data
File: backing-up-etcd
- Name: Recovering from lost master hosts
File: scenario-1-infra-recovery
- Name: Restoring back to a previous cluster state
File: scenario-2-restoring-cluster-state
- Name: Recovering from expired control plane certificates
File: scenario-3-expired-certs
---
Name: CLI reference
Dir: cli_reference
Distros: openshift-enterprise,openshift-origin,openshift-dedicated
Expand Down
9 changes: 9 additions & 0 deletions disaster_recovery/backing-up-etcd.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
[id="backup-etcd"]
= Backing up etcd
include::modules/common-attributes.adoc[]
:context: backup-etcd

toc::[]

// Backing up etcd data
include::modules/backup-etcd.adoc[leveloffset=+1]
1 change: 1 addition & 0 deletions disaster_recovery/images
1 change: 1 addition & 0 deletions disaster_recovery/modules
21 changes: 21 additions & 0 deletions disaster_recovery/scenario-1-infra-recovery.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
[id="dr-infrastructure-recovery"]
= Recovering from lost master hosts
include::modules/common-attributes.adoc[]
:context: dr-infrastructure-recovery

toc::[]

This document describes the process to recover from a complete loss of a master host. This includes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/This document describes the process to/You can
s/host. This includes/host, including

situations where a majority of master hosts have been lost, leading to etcd quorum loss and the cluster going offline.

At a high level, the procedure is to:

. Restore etcd quorum on a remaining master host.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd chunk the module in this assembly more finely and remove this list. With finer chunking, the TOC would be sufficient. You might also be able to reuse some of the modules between scenarios 1 and 2.

. Create new master hosts.
. Correct DNS and load balancer entries.
. Grow etcd to full membership.

If the majority of master hosts have been lost, you will need a xref:../disaster_recovery/backing-up-etcd.html#backing-up-etcd-data_backup-etcd[backed up etcd snapshot] to restore etcd quorum on the remaining master host.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will


// Recovering from lost master hosts
include::modules/dr-recover-lost-control-plane-hosts.adoc[leveloffset=+1]
11 changes: 11 additions & 0 deletions disaster_recovery/scenario-2-restoring-cluster-state.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[id="dr-restoring-cluster-state"]
= Restoring back to a previous cluster state
include::modules/common-attributes.adoc[]
:context: dr-restoring-cluster-state

toc::[]

In order to restore the cluster to a previous state, you must have previously xref:../disaster_recovery/backing-up-etcd.html#backing-up-etcd-data_backup-etcd[backed up etcd data] by creating a snapshot. You will use this snapshot to restore the cluster state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously
will


// Restoring back to a previous cluster state
include::modules/dr-restoring-cluster-state.adoc[leveloffset=+1]
9 changes: 9 additions & 0 deletions disaster_recovery/scenario-3-expired-certs.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
[id="dr-recovering-expired-certs"]
= Recovering from expired control plane certificates
include::modules/common-attributes.adoc[]
:context: dr-recovering-expired-certs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a sentence or two about when and why your certs expired?

toc::[]

// Recovering from expired control plane certificates
include::modules/dr-recover-expired-control-plane-certs.adoc[leveloffset=+1]
24 changes: 24 additions & 0 deletions modules/backup-etcd.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
// Module included in the following assemblies:
//
// * disaster_recovery/backing-up-etcd.adoc

[id="backing-up-etcd-data_{context}"]
= Backing up etcd data

Follow these steps to back up etcd data by creating a snapshot. This snapshot can be saved and used at a later time if you need to restore etcd.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow these steps to


.Prerequisites

* SSH access to a master host.

.Procedure

. Access a master host as the root user.

. Run the `etcd-snapshot-backup.sh` script and pass in the location to save the etcd snapshot to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is the script available by default on all master hosts in that location?
s/to./to:
s/pass in/specify

+
----
$ sudo /usr/local/bin/etcd-snapshot-backup.sh ./assets/backup/snapshot.db
----
+
In this example, the snapshot is saved to `./assets/backup/snapshot.db` on the master host.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this where we want to keep the snapshot?

189 changes: 189 additions & 0 deletions modules/dr-recover-expired-control-plane-certs.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
// Module included in the following assemblies:
//
// * disaster_recovery/scenario-3-expired-certs.adoc

[id="dr-scenario-3-recovering-expired-certs_{context}"]
= Recovering from expired control plane certificates

Follow this procedure to recover from a situation where your control plane certificates have expired.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "You can generate new control plane certificates if yours expired."


.Prerequisites

* SSH access to master hosts.

.Procedure

. Access a master host with an expired certificate as the root user.

. Obtain the `cluster-kube-apiserver-operator` image reference for a release.
+
----
# RELEASE_IMAGE=<release_image> <1>
----
<1> An example value for `<release_image>` is `quay.io/openshift-release-dev/ocp-release:4.1.0`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I know how to get this value?

+
----
# KAO_IMAGE=$( oc adm release info --registry-config='/var/lib/kubelet/config.json' "${RELEASE_IMAGE}" --image-for=cluster-kube-apiserver-operator )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this command belong to another step?

----

. Pull the `cluster-kube-apiserver-operator` image.
+
----
# podman pull --authfile=/var/lib/kubelet/config.json "${KAO_IMAGE}"
----

. Create a recovery API server.
+
----
# podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:Z --entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" recovery-apiserver create
----

. Run the `export KUBECONFIG` command from the output of the above command, which is needed for the `oc` commands later in this procedure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what part of the previous command I need to use in this one.
Not "above"

! above adj
Do not use to indicate a relative location in a document, as in “the above restrictions.” Use “previous” or
“preceding.”

I'd remove the second clause and add something like "You must export the KUBECONFIG to run oc commands in the following steps."

+
----
# export KUBECONFIG=/<path_to_recovery_kubeconfig>/admin.kubeconfig
----

. Wait for the recovery API server to come up.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd s/come up/become available

+
----
# until oc get namespace kube-system 2>/dev/null 1>&2; do echo 'Waiting for recovery apiserver to come up.'; sleep 1; done
----

. Run the `regenerate-certificates` command. It fixes the certificates in the API, overwrites the old certificates on the local drive, and restarts static Pods to pick them up.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move the second sentence after the command.

+
----
# podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:Z --entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" regenerate-certificates
----

. After the certificates are fixed in the API, use the following commands to force new rollouts for the control plane. It will reinstall itself on the other nodes because the kubelet is connected to API servers using an internal load balancer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move the second sentence after the command.

+
----
# oc patch kubeapiserver cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
----
+
----
# oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
----
+
----
# oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
----

. Create a bootstrap kubeconfig with a valid user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I know if I have a valid user?


.. Create a file called `restore_kubeconfig.sh` with the following contents.
+
----
#!/bin/bash

set -eou pipefail

# context
intapi=$(oc get infrastructures.config.openshift.io cluster -o "jsonpath={.status.apiServerURL}")
context="$(oc config current-context)"
# cluster
cluster="$(oc config view -o "jsonpath={.contexts[?(@.name==\"$context\")].context.cluster}")"
server="$(oc config view -o "jsonpath={.clusters[?(@.name==\"$cluster\")].cluster.server}")"
# token
ca_crt_data="$(oc get secret -n openshift-machine-config-operator node-bootstrapper-token -o "jsonpath={.data.ca\.crt}" | base64 --decode)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rphillips is this a long lived token?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure

@deads2k do you know if this token is long lived?

namespace="$(oc get secret -n openshift-machine-config-operator node-bootstrapper-token -o "jsonpath={.data.namespace}" | base64 --decode)"
token="$(oc get secret -n openshift-machine-config-operator node-bootstrapper-token -o "jsonpath={.data.token}" | base64 --decode)"

export KUBECONFIG="$(mktemp)"
kubectl config set-credentials "kubelet" --token="$token" >/dev/null
ca_crt="$(mktemp)"; echo "$ca_crt_data" > $ca_crt
kubectl config set-cluster $cluster --server="$intapi" --certificate-authority="$ca_crt" --embed-certs >/dev/null
kubectl config set-context kubelet --cluster="$cluster" --user="kubelet" >/dev/null
kubectl config use-context kubelet >/dev/null
cat "$KUBECONFIG"
----

.. Make the script executable.
+
----
# chmod +x restore_kubeconfig.sh
----

.. Execute the script and save the output to a file called `kubeconfig`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not "execute"

+
----
# ./restore_kubeconfig.sh > kubeconfig
----

.. Copy the `kubeconfig` file to all master hosts and move it to `/etc/kubernetes/kubeconfig`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/move it to /etc/kubernetes/kubeconfig./move it to the /etc/kubernetes/kubeconfig directory.


. Recover the kubelet on all masters.

.. On a master host, stop the kubelet.
+
----
# systemctl stop kubelet
----

.. Delete stale kubelet data.
+
----
# rm -rf /var/lib/kubelet/pki /var/lib/kubelet/kubeconfig
----

.. Restart the kubelet.
+
----
# systemctl start kubelet
----

.. Repeat these steps on all other master hosts.

. If necessary, recover the kubelet on the worker nodes.
+
After the master nodes are restored, the worker nodes might restore themselves. You can verify this by running the `oc get nodes` command. If the worker nodes are not listed, then perform the following steps on each worker node.
+
.. Stop the kubelet.
+
----
# systemctl stop kubelet
----

.. Delete stale kubelet data.
+
----
# rm -rf /var/lib/kubelet/pki /var/lib/kubelet/kubeconfig
----

.. Restart the kubelet.
+
----
# systemctl start kubelet
----

. Approve the pending `node-bootstrapper` certificates signing requests (CSRs).

.. Get the list of current CSRs.
+
----
# oc get csr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to run these commands as root? If not, s/#/$

----

.. Review the details of a CSR to verify it is valid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/verify it/verify that it

+
----
# oc describe csr <csr_name> <1>
----
<1> `<csr_name>` is the name of a CSR from the list of current CSRs.

.. Approve each valid CSR.
+
----
# oc adm certificate approve <csr_name>
----
+
Be sure to approve all pending `node-bootstrapper` CSRs.

. Destroy the recovery API server because it is no longer needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it is no longer needed

+
----
# podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:Z --entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" recovery-apiserver destroy
----
+
Wait for the control plane to restart and pick up the new certificates. This might take up to 10 minutes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know if it's restarted? This might need to be a separate step.

s/This/This process

Loading