From a38a0cd4ac9dd28bb4566dc9a216e5844f7c60ab Mon Sep 17 00:00:00 2001
From: David Eads <deads@redhat.com>
Date: Thu, 22 Aug 2019 13:16:19 -0400
Subject: [PATCH 1/2] add auto-recovery design

---
 kube-apiserver/auto-cert-recovery.md | 126 +++++++++++++++++++++++++++
 1 file changed, 126 insertions(+)
 create mode 100644 kube-apiserver/auto-cert-recovery.md

diff --git a/kube-apiserver/auto-cert-recovery.md b/kube-apiserver/auto-cert-recovery.md
new file mode 100644
index 0000000000..f72c3c4dcf
--- /dev/null
+++ b/kube-apiserver/auto-cert-recovery.md
@@ -0,0 +1,126 @@
+---
+title: automatic-cert-recovery-for-kube-apiserver-kube-controller-manager
+authors:
+  - "@deads2k"
+reviewers:
+  - "@tnozicka"
+  - "@sttts"
+approvers:
+  - "@sttts"
+creation-date: 2019-08-22
+last-updated: 2019-08-22
+status: implementable
+see-also:
+replaces:
+superseded-by:
+---
+
+# Automatic Cert Recovery for kube-apiserver and kube-controller-manager
+
+## Release Signoff Checklist
+
+- [x] Enhancement is `implementable`
+- [ ] Design details are appropriately documented from clear requirements
+- [ ] Test plan is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [openshift/docs]
+
+## Summary
+
+Fully automate the recovery of kube-apiserver and kube-controller-manager certificates currently documented [here](https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html).
+Currently, there are are helper commands to make the effort more practical, but we think we can fully automate the process
+to avoid human error and intervention.
+
+
+## Motivation
+
+If the kube-apiserver and kube-controller-manager operators are offline for an extended period of time, the cluster
+cannot automatically restart itself because the certificates are invalid.  This comes up in training clusters where 
+clusters are suspended frequently.  It also comes up for products like code-ready-containers which creates and suspends
+VMs for later restart.  It is theoretically possible to automate the recovery steps, but they are slow and error prone.
+
+### Goals
+
+1. Provide a zero touch, always-on certificate recovery for the kube-apiserver and kube-controller-manager 
+
+### Non-Goals
+
+1. Provide automation for any other part of disaster recovery.
+2. Provide mechanisms to keep certificates up to date for any other component (kubelet for instance).
+3. Provide mechanisms to approve CSRs.  That is still the domain of the cloud team.
+
+## Proposal
+
+We will take our existing `cluster-kube-apiserver-operator regenerated-certificates` command and create a simple, non-leader-elected
+controller which will watch for expired certificates and regenerate them.  It will connect to the kube-apiserver using
+localhost with an SNI name option wired to a 10 year cert.  When there is no work to do, this controller wil do nothing.
+The recovery flow will look like this:
+
+1. kas-static-pod/kube-apiserver starts with expired certificates
+2. kas-static-pod/cert-syncer connects to localhost kube-apiserver with using a long-lived SNI cert (localhost-recovery).  It sees expired certs.
+3. kas-static-poc/cert-regenerator connects to localhost kube-apiserver with a long-lived SNI cert (localhost-recovery).  It sees expired certs and refreshes them as appropriate.  Being in the same
+ repo, it uses the same logic.  We will probably add an overall option to the library-go cert rotation to say, "only refresh on expired"
+ so that it never collides with the operator during normal operation.  The library-go cert rotation impl is resilient to 
+ multiple actors already.  
+4. kas-static-pod/cert-syncer sees updated certs and places them for reload. (this already works)
+5. kas-static-pod/kube-apiserver starts serving with new certs. (this already works)
+6. kcm-static-pod/kube-controller-manager starts with expired certificates
+7. kcm-static-pod/cert-syncer connects to localhost kube-apiserver with using a long-lived SNI cert (localhost-recovery).  It sees expired certs.
+8. kcm-static-poc/cert-regenerator connects to localhost kube-apiserver with a long-lived SNI cert (localhost-recovery).  It sees expired certs and refreshes them as appropriate.  Being in the same
+ repo, it uses the same logic.  We will probably add an overall option to the library-go cert rotation to say, "only refresh on expired"
+ so that it never collides with the operator during normal operation.  The library-go cert rotation impl is resilient to 
+ multiple actors already.  
+9. kcm-static-pod/cert-syncer sees updated certs and places them for reload. (this already works)
+10. kcm-static-pod/kube-controller-manager wires up a library-go/pkg/controller/fileobserver to the CSR signer and suicides on the update
+
+### Implementation Details/Notes/Constraints 
+
+This requires these significant pieces
+
+- [ ] kcm fileobserver
+- [ ] kcm-o to rewire configuration to auto-refresh CSR signer 
+- [ ] kcm-o to provide a cert regeneration controller 
+- [ ] kas-o to provide a cert regeneration controller
+- [ ] kas-o to create and wire a long-lived serving cert/key pair for localhost-recovery
+- [ ] library-go cert rotation library to support an override for only rotating when certs are expired
+
+### Risks and Mitigations
+
+1. If we wire the communiction unsafely we can get a CVE.
+2. If we don't delay past "normal" rotation, the kas-o logs will be hard to interpret.
+3. If something goes wrong, manual recovery may be harder.
+
+## Design Details
+
+### Test Plan
+
+Disaster recovery tests are still outstanding with an epic that may not be approved.  Lack of testing here doesn't introduce
+additional risk beyond that already accepted.
+
+This will be tested as part of normal disaster recovery tests.  It's built on already unit tested libraries and affects
+destination files already checked with unit and e2e tests.
+
+### Graduation Criteria
+
+This will start as GA.
+
+### Upgrade / Downgrade Strategy
+
+Being attached to the existing static pod, upgrades and downgrades will produce matching containers, so our producer and 
+consumers are guaranteed to match.
+
+### Version Skew Strategy
+
+Because each deployment in the payload is atomic, it will not skew.  There are no external changes.
+
+## Implementation History
+
+Major milestones in the life cycle of a proposal should be tracked in `Implementation
+History`.
+
+## Drawbacks
+
+This process can be run by a laborious and error prone manual process that three existent teams have already had trouble with.
+
+## Alternatives
+

From 3e93e2c487a4dfa708b28c00673ef73b68995e1c Mon Sep 17 00:00:00 2001
From: David Eads <deads@redhat.com>
Date: Fri, 11 Oct 2019 11:46:52 -0400
Subject: [PATCH 2/2] add clarifiation to automatic cert recovery flow

---
 .../kube-apiserver}/auto-cert-recovery.md     | 23 +++++++++++++------
 1 file changed, 16 insertions(+), 7 deletions(-)
 rename {kube-apiserver => enhancements/kube-apiserver}/auto-cert-recovery.md (82%)

diff --git a/kube-apiserver/auto-cert-recovery.md b/enhancements/kube-apiserver/auto-cert-recovery.md
similarity index 82%
rename from kube-apiserver/auto-cert-recovery.md
rename to enhancements/kube-apiserver/auto-cert-recovery.md
index f72c3c4dcf..3363cec097 100644
--- a/kube-apiserver/auto-cert-recovery.md
+++ b/enhancements/kube-apiserver/auto-cert-recovery.md
@@ -8,7 +8,7 @@ reviewers:
 approvers:
   - "@sttts"
 creation-date: 2019-08-22
-last-updated: 2019-08-22
+last-updated: 2019-10-11
 status: implementable
 see-also:
 replaces:
@@ -28,7 +28,7 @@ superseded-by:
 ## Summary
 
 Fully automate the recovery of kube-apiserver and kube-controller-manager certificates currently documented [here](https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html).
-Currently, there are are helper commands to make the effort more practical, but we think we can fully automate the process
+Currently, there are helper commands to make the effort more practical, but we think we can fully automate the process
 to avoid human error and intervention.
 
 
@@ -53,13 +53,15 @@ VMs for later restart.  It is theoretically possible to automate the recovery st
 
 We will take our existing `cluster-kube-apiserver-operator regenerated-certificates` command and create a simple, non-leader-elected
 controller which will watch for expired certificates and regenerate them.  It will connect to the kube-apiserver using
-localhost with an SNI name option wired to a 10 year cert.  When there is no work to do, this controller wil do nothing.
+localhost with an SNI name option wired to a 10 year cert.  When there is no work to do, this controller will do nothing.
+This controller will run as another container in our existing static pods.
 The recovery flow will look like this:
 
 1. kas-static-pod/kube-apiserver starts with expired certificates
 2. kas-static-pod/cert-syncer connects to localhost kube-apiserver with using a long-lived SNI cert (localhost-recovery).  It sees expired certs.
-3. kas-static-poc/cert-regenerator connects to localhost kube-apiserver with a long-lived SNI cert (localhost-recovery).  It sees expired certs and refreshes them as appropriate.  Being in the same
- repo, it uses the same logic.  We will probably add an overall option to the library-go cert rotation to say, "only refresh on expired"
+3. kas-static-poc/cert-regenerator connects to localhost kube-apiserver with a long-lived SNI cert (localhost-recovery).
+ It sees expired certs and refreshes them as appropriate.  Being in the same  repo, it uses the same logic.
+ We will add an overall option to the library-go cert rotation to say, "only refresh on expired"
  so that it never collides with the operator during normal operation.  The library-go cert rotation impl is resilient to 
  multiple actors already.  
 4. kas-static-pod/cert-syncer sees updated certs and places them for reload. (this already works)
@@ -72,6 +74,13 @@ The recovery flow will look like this:
  multiple actors already.  
 9. kcm-static-pod/cert-syncer sees updated certs and places them for reload. (this already works)
 10. kcm-static-pod/kube-controller-manager wires up a library-go/pkg/controller/fileobserver to the CSR signer and suicides on the update
+11. At this point, kas and kcm are both up and running with valid serving certs and valid CSR signers.
+12. Kubelets will start creating CSRs for signers, but the machine approver is down.
+ **A cluster-admin must manually approve client CSRs for the master kubelets**
+13. Master kubelets are able to communicate to the kas and get listings of pods, 
+    1. kcm creates pods for operators including sdn-o and sdn, 
+    2. kube-scheduler sees a valid kas serving cert and schedules those pods to masters, 
+    3. master kubelets run the sdn, sdn-o, kas-o, kcm-o, ks-o and the system re-bootstraps. 
 
 ### Implementation Details/Notes/Constraints 
 
@@ -83,6 +92,7 @@ This requires these significant pieces
 - [ ] kas-o to provide a cert regeneration controller
 - [ ] kas-o to create and wire a long-lived serving cert/key pair for localhost-recovery
 - [ ] library-go cert rotation library to support an override for only rotating when certs are expired
+- [ ] remove old manual recovery commands
 
 ### Risks and Mitigations
 
@@ -118,9 +128,8 @@ Because each deployment in the payload is atomic, it will not skew.  There are n
 Major milestones in the life cycle of a proposal should be tracked in `Implementation
 History`.
 
-## Drawbacks
+## Alternatives
 
 This process can be run by a laborious and error prone manual process that three existent teams have already had trouble with.
 
-## Alternatives