-
Notifications
You must be signed in to change notification settings - Fork 1.9k
OSDOCS-429: Adding disaster recovery docs #14859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| [id="backup-etcd"] | ||
| = Backing up etcd | ||
| include::modules/common-attributes.adoc[] | ||
| :context: backup-etcd | ||
|
|
||
| toc::[] | ||
|
|
||
| // Backing up etcd data | ||
| include::modules/backup-etcd.adoc[leveloffset=+1] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| ../images |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| ../modules |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| [id="dr-infrastructure-recovery"] | ||
| = Recovering from lost master hosts | ||
| include::modules/common-attributes.adoc[] | ||
| :context: dr-infrastructure-recovery | ||
|
|
||
| toc::[] | ||
|
|
||
| This document describes the process to recover from a complete loss of a master host. This includes | ||
| situations where a majority of master hosts have been lost, leading to etcd quorum loss and the cluster going offline. | ||
|
|
||
| At a high level, the procedure is to: | ||
|
|
||
| . Restore etcd quorum on a remaining master host. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd chunk the module in this assembly more finely and remove this list. With finer chunking, the TOC would be sufficient. You might also be able to reuse some of the modules between scenarios 1 and 2. |
||
| . Create new master hosts. | ||
| . Correct DNS and load balancer entries. | ||
| . Grow etcd to full membership. | ||
|
|
||
| If the majority of master hosts have been lost, you will need a xref:../disaster_recovery/backing-up-etcd.html#backing-up-etcd-data_backup-etcd[backed up etcd snapshot] to restore etcd quorum on the remaining master host. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
|
||
| // Recovering from lost master hosts | ||
| include::modules/dr-recover-lost-control-plane-hosts.adoc[leveloffset=+1] | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| [id="dr-restoring-cluster-state"] | ||
| = Restoring back to a previous cluster state | ||
| include::modules/common-attributes.adoc[] | ||
| :context: dr-restoring-cluster-state | ||
|
|
||
| toc::[] | ||
|
|
||
| In order to restore the cluster to a previous state, you must have previously xref:../disaster_recovery/backing-up-etcd.html#backing-up-etcd-data_backup-etcd[backed up etcd data] by creating a snapshot. You will use this snapshot to restore the cluster state. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
|
||
| // Restoring back to a previous cluster state | ||
| include::modules/dr-restoring-cluster-state.adoc[leveloffset=+1] | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| [id="dr-recovering-expired-certs"] | ||
| = Recovering from expired control plane certificates | ||
| include::modules/common-attributes.adoc[] | ||
| :context: dr-recovering-expired-certs | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we have a sentence or two about when and why your certs expired? |
||
| toc::[] | ||
|
|
||
| // Recovering from expired control plane certificates | ||
| include::modules/dr-recover-expired-control-plane-certs.adoc[leveloffset=+1] | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| // Module included in the following assemblies: | ||
| // | ||
| // * disaster_recovery/backing-up-etcd.adoc | ||
|
|
||
| [id="backing-up-etcd-data_{context}"] | ||
| = Backing up etcd data | ||
|
|
||
| Follow these steps to back up etcd data by creating a snapshot. This snapshot can be saved and used at a later time if you need to restore etcd. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
|
||
| .Prerequisites | ||
|
|
||
| * SSH access to a master host. | ||
|
|
||
| .Procedure | ||
|
|
||
| . Access a master host as the root user. | ||
|
|
||
| . Run the `etcd-snapshot-backup.sh` script and pass in the location to save the etcd snapshot to. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, is the script available by default on all master hosts in that location? |
||
| + | ||
| ---- | ||
| $ sudo /usr/local/bin/etcd-snapshot-backup.sh ./assets/backup/snapshot.db | ||
| ---- | ||
| + | ||
| In this example, the snapshot is saved to `./assets/backup/snapshot.db` on the master host. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this where we want to keep the snapshot? |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,189 @@ | ||
| // Module included in the following assemblies: | ||
| // | ||
| // * disaster_recovery/scenario-3-expired-certs.adoc | ||
|
|
||
| [id="dr-scenario-3-recovering-expired-certs_{context}"] | ||
| = Recovering from expired control plane certificates | ||
|
|
||
| Follow this procedure to recover from a situation where your control plane certificates have expired. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe "You can generate new control plane certificates if yours expired." |
||
|
|
||
| .Prerequisites | ||
|
|
||
| * SSH access to master hosts. | ||
|
|
||
| .Procedure | ||
|
|
||
| . Access a master host with an expired certificate as the root user. | ||
|
|
||
| . Obtain the `cluster-kube-apiserver-operator` image reference for a release. | ||
| + | ||
| ---- | ||
| # RELEASE_IMAGE=<release_image> <1> | ||
| ---- | ||
| <1> An example value for `<release_image>` is `quay.io/openshift-release-dev/ocp-release:4.1.0`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do I know how to get this value? |
||
| + | ||
| ---- | ||
| # KAO_IMAGE=$( oc adm release info --registry-config='/var/lib/kubelet/config.json' "${RELEASE_IMAGE}" --image-for=cluster-kube-apiserver-operator ) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this command belong to another step? |
||
| ---- | ||
|
|
||
| . Pull the `cluster-kube-apiserver-operator` image. | ||
| + | ||
| ---- | ||
| # podman pull --authfile=/var/lib/kubelet/config.json "${KAO_IMAGE}" | ||
| ---- | ||
|
|
||
| . Create a recovery API server. | ||
| + | ||
| ---- | ||
| # podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:Z --entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" recovery-apiserver create | ||
| ---- | ||
|
|
||
| . Run the `export KUBECONFIG` command from the output of the above command, which is needed for the `oc` commands later in this procedure. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what part of the previous command I need to use in this one.
I'd remove the second clause and add something like "You must export the |
||
| + | ||
| ---- | ||
| # export KUBECONFIG=/<path_to_recovery_kubeconfig>/admin.kubeconfig | ||
| ---- | ||
|
|
||
| . Wait for the recovery API server to come up. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd s/come up/become available |
||
| + | ||
| ---- | ||
| # until oc get namespace kube-system 2>/dev/null 1>&2; do echo 'Waiting for recovery apiserver to come up.'; sleep 1; done | ||
| ---- | ||
|
|
||
| . Run the `regenerate-certificates` command. It fixes the certificates in the API, overwrites the old certificates on the local drive, and restarts static Pods to pick them up. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd move the second sentence after the command. |
||
| + | ||
| ---- | ||
| # podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:Z --entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" regenerate-certificates | ||
| ---- | ||
|
|
||
| . After the certificates are fixed in the API, use the following commands to force new rollouts for the control plane. It will reinstall itself on the other nodes because the kubelet is connected to API servers using an internal load balancer. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd move the second sentence after the command. |
||
| + | ||
| ---- | ||
| # oc patch kubeapiserver cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge | ||
| ---- | ||
| + | ||
| ---- | ||
| # oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge | ||
| ---- | ||
| + | ||
| ---- | ||
| # oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge | ||
| ---- | ||
|
|
||
| . Create a bootstrap kubeconfig with a valid user. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do I know if I have a valid user? |
||
|
|
||
| .. Create a file called `restore_kubeconfig.sh` with the following contents. | ||
bergerhoffer marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| + | ||
| ---- | ||
| #!/bin/bash | ||
|
|
||
| set -eou pipefail | ||
|
|
||
| # context | ||
| intapi=$(oc get infrastructures.config.openshift.io cluster -o "jsonpath={.status.apiServerURL}") | ||
| context="$(oc config current-context)" | ||
| # cluster | ||
| cluster="$(oc config view -o "jsonpath={.contexts[?(@.name==\"$context\")].context.cluster}")" | ||
| server="$(oc config view -o "jsonpath={.clusters[?(@.name==\"$cluster\")].cluster.server}")" | ||
| # token | ||
| ca_crt_data="$(oc get secret -n openshift-machine-config-operator node-bootstrapper-token -o "jsonpath={.data.ca\.crt}" | base64 --decode)" | ||
|
||
| namespace="$(oc get secret -n openshift-machine-config-operator node-bootstrapper-token -o "jsonpath={.data.namespace}" | base64 --decode)" | ||
| token="$(oc get secret -n openshift-machine-config-operator node-bootstrapper-token -o "jsonpath={.data.token}" | base64 --decode)" | ||
|
|
||
| export KUBECONFIG="$(mktemp)" | ||
| kubectl config set-credentials "kubelet" --token="$token" >/dev/null | ||
| ca_crt="$(mktemp)"; echo "$ca_crt_data" > $ca_crt | ||
| kubectl config set-cluster $cluster --server="$intapi" --certificate-authority="$ca_crt" --embed-certs >/dev/null | ||
| kubectl config set-context kubelet --cluster="$cluster" --user="kubelet" >/dev/null | ||
| kubectl config use-context kubelet >/dev/null | ||
| cat "$KUBECONFIG" | ||
| ---- | ||
|
|
||
| .. Make the script executable. | ||
| + | ||
| ---- | ||
| # chmod +x restore_kubeconfig.sh | ||
| ---- | ||
|
|
||
| .. Execute the script and save the output to a file called `kubeconfig`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not "execute" |
||
| + | ||
| ---- | ||
| # ./restore_kubeconfig.sh > kubeconfig | ||
| ---- | ||
|
|
||
| .. Copy the `kubeconfig` file to all master hosts and move it to `/etc/kubernetes/kubeconfig`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/move it to |
||
|
|
||
| . Recover the kubelet on all masters. | ||
|
|
||
| .. On a master host, stop the kubelet. | ||
| + | ||
| ---- | ||
| # systemctl stop kubelet | ||
| ---- | ||
|
|
||
| .. Delete stale kubelet data. | ||
| + | ||
| ---- | ||
| # rm -rf /var/lib/kubelet/pki /var/lib/kubelet/kubeconfig | ||
| ---- | ||
|
|
||
| .. Restart the kubelet. | ||
| + | ||
| ---- | ||
| # systemctl start kubelet | ||
| ---- | ||
|
|
||
| .. Repeat these steps on all other master hosts. | ||
|
|
||
| . If necessary, recover the kubelet on the worker nodes. | ||
| + | ||
| After the master nodes are restored, the worker nodes might restore themselves. You can verify this by running the `oc get nodes` command. If the worker nodes are not listed, then perform the following steps on each worker node. | ||
| + | ||
| .. Stop the kubelet. | ||
| + | ||
| ---- | ||
| # systemctl stop kubelet | ||
| ---- | ||
|
|
||
| .. Delete stale kubelet data. | ||
| + | ||
| ---- | ||
| # rm -rf /var/lib/kubelet/pki /var/lib/kubelet/kubeconfig | ||
| ---- | ||
|
|
||
| .. Restart the kubelet. | ||
| + | ||
| ---- | ||
| # systemctl start kubelet | ||
| ---- | ||
|
|
||
| . Approve the pending `node-bootstrapper` certificates signing requests (CSRs). | ||
|
|
||
| .. Get the list of current CSRs. | ||
| + | ||
| ---- | ||
| # oc get csr | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you need to run these commands as root? If not, s/#/$ |
||
| ---- | ||
|
|
||
| .. Review the details of a CSR to verify it is valid. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/verify it/verify that it |
||
| + | ||
| ---- | ||
| # oc describe csr <csr_name> <1> | ||
| ---- | ||
| <1> `<csr_name>` is the name of a CSR from the list of current CSRs. | ||
|
|
||
| .. Approve each valid CSR. | ||
| + | ||
| ---- | ||
| # oc adm certificate approve <csr_name> | ||
| ---- | ||
| + | ||
| Be sure to approve all pending `node-bootstrapper` CSRs. | ||
|
|
||
| . Destroy the recovery API server because it is no longer needed. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| + | ||
| ---- | ||
| # podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:Z --entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" recovery-apiserver destroy | ||
| ---- | ||
| + | ||
| Wait for the control plane to restart and pick up the new certificates. This might take up to 10 minutes. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do you know if it's restarted? This might need to be a separate step. s/This/This process |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/This document describes the process to/You can
s/host. This includes/host, including