-
Notifications
You must be signed in to change notification settings - Fork 533
service-ca-rotation: Ensure pod restart before expiry of pre-rotation CA #193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -102,29 +102,83 @@ and the functionality could be separately extracted for operator reuse. | |
| original expiry of the pre-rotation CA to refresh without breaking trust. | ||
|
|
||
| - The duration of the service CA should be extended from its current value of | ||
| 12 months to a value that exceeds the maximum expected upgrade interval | ||
| (currently 12 months). A value of 14 months should be sufficient. | ||
| - Timelines: | ||
| - With 12 month CA duration: | ||
| 12 months to 26 months and the minimum CA duration should be extended to 13 | ||
| months. These values ensure that pods will be guaranteed to be restarted in | ||
| a cluster supported by Red Hat before the expiry of the pre-rotation CA. | ||
| - Pods in the OCP control plane automatically refresh key material | ||
| (certificates and CA bundles) without pod restart. | ||
| - For all other pods, the timing of upgrades is key to determining CA | ||
| duration: | ||
| - Services that do not refresh key material automatically must be | ||
| restarted after CA rotation. | ||
| - Upgrades restart all pods, so ensuring that an upgrade takes place | ||
| after rotation and prior to expiry of the pre-rotation CA guarantees | ||
| that services will always be using valid key material. | ||
| - Computing CA duration: | ||
| - The maximum interval, *I*, between upgrades of a cluster supported by | ||
| Red Hat is 12 months. | ||
| - At most 3 OCP releases are supported at one time, and 3 releases are | ||
| expected in a given year. | ||
| - In the event that an LTS release strategy is chosen, a supported | ||
| cluster would still be expected to be upgraded at least once per year | ||
| if only with a patch release. | ||
| - When the minimum remaining CA duration, *M*, is reached, automatic | ||
| rotation will be triggered. *M* must be greater than *I* to ensure that | ||
| an upgrade occurs before the expiry of the pre-rotation CA. | ||
| - Let the interval between creation of a CA and its rotation be *R*. | ||
| - *R* must be greater than *I* to ensure that an upgrade occurs before | ||
| the expiry of the pre-rotation CA. | ||
| - *R* must be greater than or equal to the minimum remaining CA | ||
| duration, *M*, to ensure that an upgrade occurs before a subsequent | ||
| rotation. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this I don't understand. With
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The only requirement is that an upgrade/restart occurs after a rotation. Once that occurs, there is no harm in either a subsequent rotation or the expiry of the pre-rotation CA. |
||
| - Since cross-signing is only supported between the current and | ||
| previous CAs, rotating before all pods have been restarted to use key | ||
| material from the current CA risks breaking trust with key material | ||
| issued by the previous CA. | ||
| - The interval between creation and rotation, *R*, can be computed as the | ||
| total duration *D* less the minimum remaining duration *M*: | ||
| - *R* = *D* - *M* | ||
| - Reordering to solve for *D*: | ||
| - *D* = *R* + *M* | ||
| - Since each of *R* and *M* must be greater than *I*: | ||
| - *D* = *R* + *M* > 2 * *I* | ||
| - Since *R* >= *M* should be true, simplify to *R* = *M*: | ||
| - *D* = 2 * *R* > 2 * *I* | ||
| - Substitute *I* = 12: | ||
| - *D* = 2 * *R* > 24 | ||
| - Picking *R* = *M* = 13 will satisfy the relation, resulting in *D* = 26 | ||
| - Worst-case timelines with old and new values for CA duration: | ||
| - Let minimum duration *M* = 6 months, total duration *D* = 12 months: | ||
| - T+0m - Cluster installed with new CA or existing CA is rotated (CA-1) | ||
| - T+6m - Automated rotation replaces CA-1 with CA-2 when CA-1 duration < 6m | ||
| - T+12m - Cluster is upgraded and all pods are restarted | ||
| - T+12m - CA-1 expires. If cluster was not upgraded before this | ||
| happens, services using the old key material may | ||
| break. | ||
| - With 14 month CA duration: | ||
| - Let minimum duration *M* = 13 months, total duration *D* = 26 months: | ||
| - T+0m - Cluster installed with new CA or existing CA is rotated (CA-1) | ||
| - T+8m - Automated rotation replaces CA-1 with CA-2 when CA-1 duration < 6m | ||
| - T+12m - Cluster is upgraded and all pods are restarted | ||
| - T+14m - CA-1 expires. No impact because of the restart at time of upgrade | ||
| - Services that do not refresh key material automatically must be restarted | ||
| after ca rotation. | ||
| - Upgrades restart all pods, and ensuring that an upgrade takes place after | ||
| rotation and prior to expiry of the pre-rotation ca guarantees that | ||
| services will always be using valid key material. | ||
| - If the ca duration does not exceed the expected upgrade interval, there is | ||
| the possibility, however slight, that an upgrade will not occur after | ||
| automated rotation and before expiry of the pre-rotation ca. | ||
| - T+13m - Automated rotation replaces CA-1 with CA-2 when CA-1 duration < 13m | ||
| - T+24m - Cluster is upgraded and all pods are restarted | ||
| - T+26m - CA-1 expires. No impact because of the restart at time of upgrade | ||
| - Since all clusters that do not support automated rotation will have CAs with | ||
| a 12 month total duration, the remaining CA duration for those clusters will | ||
| always be less than the minimum upgrade interval of 12 months. Rotation will | ||
| be triggered when upgrading to a release that supports automated CA rotation | ||
| and the cluster administrator should manually restart all pods. | ||
| - Without this intervention, pod restart before expiry of the pre-rotation | ||
| CA cannot be guaranteed. | ||
| - The requirement to manually restart pods after a cluster is upgraded to | ||
| a release that supports automated CA rotation is a one-time thing. All | ||
| subsequent upgrades and rotations will ensure restart before expiry | ||
| without manual intervention. | ||
|
Comment on lines
+171
to
+174
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should not include restarting control-plane pods that should be rotated by their operators
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why would it make sense to differentiate between pods running user workloads and control plane pods if manual restart involves restarting all pods?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They should be capable of rotating the certs for their payloads themselves, but I suppose it's safer to restart all anyway... |
||
| - The requirement to manually restart pods after upgrading to a release | ||
| that supports automated rotation will need to be communicated via | ||
| documentation. | ||
| - A Prometheus alert doesn't seem like a viable option except for the | ||
| current pending release (4.4) due to the requirement to key it off of | ||
| an existing metric, and the general policy of not backporting metrics | ||
| and alert additions to previous releases of OCP. | ||
|
|
||
| - Generate an intermediate CA certificate with the same public key as the new CA | ||
| but signed with the private key of the current CA. | ||
|
|
@@ -250,11 +304,15 @@ or CA bundle (i.e. recognize when it changes, load the new value). | |
|
|
||
| - Since trust verification does not depend on interaction with a cluster, the | ||
| code to do so could be shared between unit and e2e testing. | ||
| - https://github.com/openshift/service-ca-operator/blob/master/test/util/rotate.go#L22 | ||
|
|
||
| - Unit testing should be used to validate the logic of the rotation method. | ||
| - https://github.com/openshift/service-ca-operator/blob/master/pkg/operator/rotate_test.go#L176 | ||
|
|
||
| - E2E testing should be used to validate the rotation method in combination with | ||
| the required API interaction. | ||
| - https://github.com/openshift/service-ca-operator/blob/master/test/e2e/e2e_test.go#L1174 | ||
| - https://github.com/openshift/service-ca-operator/blob/master/test/e2e/e2e_test.go#L1182 | ||
|
|
||
| - Exploratory testing should be performed to determine if rotation induces | ||
| unexpected behavior in other components. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's been whispers of an LTS version, wouldn't that break this assumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik there will be no such thing as an LTS release that doesn't still require an upgrade at least once a year if only to a patch release. Any upgrade will do, not just between minor versions.
@smarterclayton Can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added mention of LTS and upgrade to patch release in fixup #2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't require that customers, "upgrade", but in general customers should be updating production clusters more than once a year; given that we are providing them (at least with micro-updates) weekly updates.