OCPBUGS-30119: certrotation: update all secret types to kubernetes.io/tls preserving existing content#1681
Conversation
…ng existing content Pre-4.7 clusters had `SecretTypeTLS` type set, so on 4.15 upgrade these clusters had the secret delete and recreated. This commit will ensure these are being created without cert and key being regenerated
|
@vrutkovs: This pull request references Jira Issue OCPBUGS-30119, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
c296833 to
84116c5
Compare
benluddy
left a comment
There was a problem hiding this comment.
I'm still concerned about the recreate path of ApplySecretImproved. If control flow is interrupted between the delete and create (e.g. due to process crash), or the context being used for the client requests is canceled, what happens to the cert data? It seems to me like we would lose it in that case. Even if unlikely, the consequences could be painful. Could we temporarily stage the data in a second secret before deleting the target secret, or something similar?
| secret.Type = corev1.SecretTypeTLS | ||
| needsMetadataUpdate = true |
There was a problem hiding this comment.
Would it be worthwhile to check that the secret we're converting has tls.crt and tls.key before updating the type, since updating the type makes validation stricter (https://github.com/kubernetes/kubernetes/blob/d51e2da869350b2b20a33d060e17bd1d2165e02d/pkg/apis/core/validation/validation.go#L6392-L6398), or will it be enough to see UpdateSecretFailed event spam?
There was a problem hiding this comment.
Right, we'd get UpdateSecretFailed event and have it re-created. If the existing secret doesn't have tls.crt and tls.key then its contents is unusable and it needs to be recreated
There was a problem hiding this comment.
Recreated by whom? We don't react to events but the secret content, right?
There was a problem hiding this comment.
Recreated by ApplySecret, which attempts apply first, but if it fails it does delete+create. If create with kubernetes.io/tls and existing content fails, it would throw an error, delete the content and generate a new cert.
There was a problem hiding this comment.
it would throw an error, delete the content and generate a new cert
It would throw the error, but probably would loop
There was a problem hiding this comment.
Updated ensureMetadataUpdate to update secret type (removing data if tls.crt/tls.key are missing) in this case, added unit tests
Right, |
Is it safe to solve separately? IIUC, merging this PR to migrate secret types will present a lot of chances to force an unnecessary rotation on older clusters. |
|
We already do "migration" in 4.15 with content replace anyway, so this PR helps in happy path. Corner case - cert sync is interrupted right during the migration - can be safely fixed outside of this PR |
😂 Oh, of course. Thanks! |
|
/lgtm |
|
@vrutkovs: This pull request references Jira Issue OCPBUGS-30119, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| // apply necessary metadata (possibly via delete+recreate) if secret exists | ||
| // this is done before content update to prevent unexpected rollouts | ||
| if ensureMetadataUpdate(signingCertKeyPairSecret, c.Owner, c.AdditionalAnnotations) && len(signingCertKeyPairSecret.ResourceVersion) > 0 { | ||
| actualSigningCertKeyPairSecret, _, err := resourceapply.ApplySecret(ctx, c.Client, c.EventRecorder, signingCertKeyPairSecret) |
There was a problem hiding this comment.
nit: the following seems as a bit of a cleaner method to achieve the same
var err error
signingCertKeyPairSecret, _, err = resourceapply.ApplySecret(ctx, c.Client, c.EventRecorder, signingCertKeyPairSecret)There was a problem hiding this comment.
Following the pattern as in if needed, reason := needNewSigningCertKeyPair
84116c5 to
1ca53d1
Compare
b20d854 to
dc9a1e5
Compare
| // Existing secret not found - no need to update metadata (will be done by needNewSigningCertKeyPair / NeedNewTargetCertKeyPair) | ||
| if len(secret.ResourceVersion) == 0 { | ||
| return false | ||
| } |
There was a problem hiding this comment.
this probably belongs to the other function, otherwise nothing sets the ownerRef - or does it?
There was a problem hiding this comment.
Yeah, good point, this also shows different function flows
dc9a1e5 to
ed06b9e
Compare
|
/approve |
|
Adding approved label based on #1681 (comment) |
| } | ||
|
|
||
| // convert outdated secret type (created by pre 4.7 installer) | ||
| if secret.Type != corev1.SecretTypeTLS { |
There was a problem hiding this comment.
It should behave the same in practice, but since we're correcting for a specific mistake, how about:
| if secret.Type != corev1.SecretTypeTLS { | |
| if secret.Type == "SecretTypeTLS" { |
And as an aside, can we have an issue to track removing this after N releases?
There was a problem hiding this comment.
Confusingly SecretTypeTLS is kubernetes.io/tls and not "SecretTypeTLS". So if secret.Type != corev1.SecretTypeTLS would make sure any non-kubernetes.io/tls secret would converted to kubernetes.io/tls (clearing up data if necessary)
There was a problem hiding this comment.
Right, I understand, but my point is that we expect any secret managed by this controller will either have type kubernetes.io/tls or, if it was created pre-4.7, SecretTypeTLS. If we discover that there is some instance of a managed secret with a completely unexpected type, I would rather understand why our expectation was wrong than stomp it.
There was a problem hiding this comment.
Its perfectly possible for users to find a way to replace these certificates behind cert-syncer back and replace it with invalid contents. This code will ensure that we attempt to convert cases like pre-4.7 if possible, but also prioritize availability - invalid content would be stomped for cluster to continue functioning. Why the bad secret was stomped can be found out via audit logs (or etcd contents).
This change will also protects us from other similar cases like "old/misconfigured installer has generated certificates which we no longer expect"
| if len(actual.Annotations) == 0 { | ||
| t.Errorf("expected certificates to be annotated") | ||
| } |
There was a problem hiding this comment.
This looks redundant, any value for annotations that would fail here is also going to fail below.
There was a problem hiding this comment.
Yes, but I'd prefer to a clear error message
There was a problem hiding this comment.
Fine with me given that nothing out of the control of this test should be touching annotations.
| if len(actual.Annotations) == 0 { | ||
| t.Errorf("expected certificates to be annotated") | ||
| } | ||
| ownershipValue, found := actual.Annotations[annotations.OpenShiftComponent] |
There was a problem hiding this comment.
I prefer to avoid sharing values like this between tests and implementation, since it creates the opportunity to accidentally make a breaking change without seeing a test failure.
Feel free to ignore if you disagree or are comfortable with it in this instance because it is declared in openshift/api.
There was a problem hiding this comment.
it creates the opportunity to accidentally make a breaking change without seeing a test failure
We explicitly set AdditionalAnnotations in https://github.com/openshift/library-go/pull/1681/files#diff-d7a4e30c6635930d8721bf060fe2803587326298e0eace9bada256f1b7377459R228-R230
There was a problem hiding this comment.
That link didn't work for me. In this case I'm specifically referring to a change to the constant OpenShiftComponent, which would be a breaking change but would not cause this test to fail.
There was a problem hiding this comment.
Ah, hmm, JiraComponent and OpenshiftComponent are linked in https://github.com/openshift/library-go/blob/master/pkg/operator/certrotation/annotations.go#L27-L30, not sure how to make sure unit tests don't depend on it though
| t.Error(actions[1]) | ||
| } | ||
| if !actions[2].Matches("create", "secrets") { | ||
| t.Error(actions[0]) |
There was a problem hiding this comment.
| t.Error(actions[0]) | |
| t.Error(actions[2]) |
| t.Error(actions[1]) | ||
| } | ||
| if !actions[2].Matches("create", "secrets") { | ||
| t.Error(actions[0]) |
There was a problem hiding this comment.
| t.Error(actions[0]) | |
| t.Error(actions[2]) |
| t.Error(actions[1]) | ||
| } | ||
| if !actions[2].Matches("create", "secrets") { | ||
| t.Error(actions[0]) |
There was a problem hiding this comment.
There are few more of these:
| t.Error(actions[0]) | |
| t.Error(actions[2]) |
| _, _, err := resourceapply.ApplySecret(ctx, c.Client, c.EventRecorder, signingCertKeyPairSecret) | ||
| // apply necessary metadata (possibly via delete+recreate) if secret exists | ||
| // this is done before content update to prevent unexpected rollouts | ||
| if ensureMetadataUpdate(signingCertKeyPairSecret, c.Owner, c.AdditionalAnnotations) && ensureSecretTLSTypeSet(signingCertKeyPairSecret) { |
There was a problem hiding this comment.
Why is this && and not ||?
There was a problem hiding this comment.
We want to update labels before content changes only when:
- secret type won't change (
ensureSecretTLSTypeSetreturns true) - metadata (annotations, ownership) needs updating (
ensureMetadataUpdatereturns true)
The function are also mutating signingCertKeyPairSecret, changing it type and clearing data, so the secret would be deleted + created during if needed, reason := needNewSigningCertKeyPair clause. It needs to be done once to avoid updating it twice for metadata and content ("update no annotations" and "update SecretTLSType secrets" unit test cases)
There was a problem hiding this comment.
You'll want to make sure your secret always passes both functions, right?
Either operator might short-circuit in a different scenario.
There was a problem hiding this comment.
You'll want to make sure your secret always passes both functions, right?
correct, if either doesn't pass possible updates will be applied by ApplySecret in needNewSigningCertKeyPair
There was a problem hiding this comment.
so the secret would be deleted + created during if needed, reason := needNewSigningCertKeyPair clause
Thanks, this is the part that I was missing.
There was a problem hiding this comment.
This would stuck on invalid secret type with valid metadata, #1687 fixes this case
4bde05d to
dfce6ae
Compare
dfce6ae to
c1b3d39
Compare
|
@vrutkovs: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: benluddy, stlaz, vrutkovs The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@vrutkovs: Jira Issue OCPBUGS-30119: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-30119 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-4.14 |
|
@sdodson: #1681 failed to apply on top of branch "release-4.14": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Fix included in accepted release 4.16.0-0.nightly-2024-03-06-073110 |
Pre-4.7 clusters had
SecretTypeTLStype set, so on 4.15 upgrade these clusters had the secret delete and recreated. This commit will ensure these are being created without cert and key being regenerated.Fixes two issues wrt handling secrets:
kubernetes.io/tls. If it cannot be done (i.e.tls.crtortls.keyis missing) it would be regenerated withSignerUpdateRequiredevent