Skip to content

OCPBUGS-30119: certrotation: update all secret types to kubernetes.io/tls preserving existing content#1681

Merged
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
vrutkovs:signer-rotate-preserve-type
Mar 5, 2024
Merged

OCPBUGS-30119: certrotation: update all secret types to kubernetes.io/tls preserving existing content#1681
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
vrutkovs:signer-rotate-preserve-type

Conversation

@vrutkovs
Copy link
Copy Markdown
Contributor

@vrutkovs vrutkovs commented Mar 1, 2024

Pre-4.7 clusters had SecretTypeTLS type set, so on 4.15 upgrade these clusters had the secret delete and recreated. This commit will ensure these are being created without cert and key being regenerated.

Fixes two issues wrt handling secrets:

  • existing secret type is converted to kubernetes.io/tls. If it cannot be done (i.e. tls.crt or tls.key is missing) it would be regenerated with SignerUpdateRequired event
  • if the type can be converted existing data is preserved, so born-before-4.7 clusters look identical to 4.15 clusters in regards to secret type

…ng existing content

Pre-4.7 clusters had `SecretTypeTLS` type set, so on 4.15 upgrade these clusters had the secret delete and recreated. This commit will ensure these are being created without cert and key being regenerated
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 1, 2024
@openshift-ci-robot
Copy link
Copy Markdown

@vrutkovs: This pull request references Jira Issue OCPBUGS-30119, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Pre-4.7 clusters had SecretTypeTLS type set, so on 4.15 upgrade these clusters had the secret delete and recreated. This commit will ensure these are being created without cert and key being regenerated

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from hexfusion, stlaz and wangke19 March 1, 2024 11:28
@vrutkovs
Copy link
Copy Markdown
Contributor Author

vrutkovs commented Mar 1, 2024

/retest

@vrutkovs vrutkovs force-pushed the signer-rotate-preserve-type branch from c296833 to 84116c5 Compare March 1, 2024 11:59
Copy link
Copy Markdown
Contributor

@benluddy benluddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still concerned about the recreate path of ApplySecretImproved. If control flow is interrupted between the delete and create (e.g. due to process crash), or the context being used for the client requests is canceled, what happens to the cert data? It seems to me like we would lose it in that case. Even if unlikely, the consequences could be painful. Could we temporarily stage the data in a second secret before deleting the target secret, or something similar?

Comment thread pkg/operator/certrotation/metadata.go Outdated
Comment on lines +18 to +19
secret.Type = corev1.SecretTypeTLS
needsMetadataUpdate = true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worthwhile to check that the secret we're converting has tls.crt and tls.key before updating the type, since updating the type makes validation stricter (https://github.com/kubernetes/kubernetes/blob/d51e2da869350b2b20a33d060e17bd1d2165e02d/pkg/apis/core/validation/validation.go#L6392-L6398), or will it be enough to see UpdateSecretFailed event spam?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we'd get UpdateSecretFailed event and have it re-created. If the existing secret doesn't have tls.crt and tls.key then its contents is unusable and it needs to be recreated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recreated by whom? We don't react to events but the secret content, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recreated by ApplySecret, which attempts apply first, but if it fails it does delete+create. If create with kubernetes.io/tls and existing content fails, it would throw an error, delete the content and generate a new cert.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would throw an error, delete the content and generate a new cert

It would throw the error, but probably would loop

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated ensureMetadataUpdate to update secret type (removing data if tls.crt/tls.key are missing) in this case, added unit tests

@vrutkovs
Copy link
Copy Markdown
Contributor Author

vrutkovs commented Mar 1, 2024

the context being used for the client requests is canceled, what happens to the cert data? It seems to me like we would lose it in that case

Right, ApplySecretImproved is not atomic - imo we should first create a copy, then do delete + recreate. I'd however solve this separately, as it may affect a lot more cases - and doesn't need to be merged urgently. Tracking this in https://issues.redhat.com/browse/API-1716

@benluddy
Copy link
Copy Markdown
Contributor

benluddy commented Mar 1, 2024

the context being used for the client requests is canceled, what happens to the cert data? It seems to me like we would lose it in that case

Right, ApplySecretImproved is not atomic - imo we should first create a copy, then do delete + recreate. I'd however solve this separately, as it may affect a lot more cases - and doesn't need to be merged urgently

Is it safe to solve separately? IIUC, merging this PR to migrate secret types will present a lot of chances to force an unnecessary rotation on older clusters.

@vrutkovs
Copy link
Copy Markdown
Contributor Author

vrutkovs commented Mar 1, 2024

We already do "migration" in 4.15 with content replace anyway, so this PR helps in happy path. Corner case - cert sync is interrupted right during the migration - can be safely fixed outside of this PR

@benluddy
Copy link
Copy Markdown
Contributor

benluddy commented Mar 1, 2024

"migration"

😂 Oh, of course. Thanks!

@benluddy
Copy link
Copy Markdown
Contributor

benluddy commented Mar 1, 2024

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 1, 2024
@openshift-ci-robot
Copy link
Copy Markdown

@vrutkovs: This pull request references Jira Issue OCPBUGS-30119, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

Details

In response to this:

Pre-4.7 clusters had SecretTypeTLS type set, so on 4.15 upgrade these clusters had the secret delete and recreated. This commit will ensure these are being created without cert and key being regenerated.

Fixes two issues wrt handling secrets:

  • existing secret type is converted to kubernetes.io/tls. If it cannot be done (i.e. tls.crt or tls.key is missing) it would be regenerated with SignerUpdateRequired event
  • if the type can be converted existing data is preserved, so born-before-4.7 clusters look identical to 4.15 clusters in regards to secret type

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

// apply necessary metadata (possibly via delete+recreate) if secret exists
// this is done before content update to prevent unexpected rollouts
if ensureMetadataUpdate(signingCertKeyPairSecret, c.Owner, c.AdditionalAnnotations) && len(signingCertKeyPairSecret.ResourceVersion) > 0 {
actualSigningCertKeyPairSecret, _, err := resourceapply.ApplySecret(ctx, c.Client, c.EventRecorder, signingCertKeyPairSecret)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the following seems as a bit of a cleaner method to achieve the same

var err error
signingCertKeyPairSecret, _, err = resourceapply.ApplySecret(ctx, c.Client, c.EventRecorder, signingCertKeyPairSecret)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the pattern as in if needed, reason := needNewSigningCertKeyPair

@vrutkovs vrutkovs force-pushed the signer-rotate-preserve-type branch from 84116c5 to 1ca53d1 Compare March 4, 2024 13:07
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 4, 2024
@vrutkovs vrutkovs force-pushed the signer-rotate-preserve-type branch 3 times, most recently from b20d854 to dc9a1e5 Compare March 4, 2024 16:00
Comment thread pkg/operator/certrotation/metadata.go Outdated
Comment on lines +10 to +13
// Existing secret not found - no need to update metadata (will be done by needNewSigningCertKeyPair / NeedNewTargetCertKeyPair)
if len(secret.ResourceVersion) == 0 {
return false
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably belongs to the other function, otherwise nothing sets the ownerRef - or does it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point, this also shows different function flows

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@vrutkovs vrutkovs force-pushed the signer-rotate-preserve-type branch from dc9a1e5 to ed06b9e Compare March 4, 2024 16:18
@stlaz
Copy link
Copy Markdown
Contributor

stlaz commented Mar 4, 2024

/approve

@sdodson sdodson added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 4, 2024
@sdodson
Copy link
Copy Markdown
Member

sdodson commented Mar 4, 2024

Adding approved label based on #1681 (comment)

}

// convert outdated secret type (created by pre 4.7 installer)
if secret.Type != corev1.SecretTypeTLS {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should behave the same in practice, but since we're correcting for a specific mistake, how about:

Suggested change
if secret.Type != corev1.SecretTypeTLS {
if secret.Type == "SecretTypeTLS" {

And as an aside, can we have an issue to track removing this after N releases?

Copy link
Copy Markdown
Contributor Author

@vrutkovs vrutkovs Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confusingly SecretTypeTLS is kubernetes.io/tls and not "SecretTypeTLS". So if secret.Type != corev1.SecretTypeTLS would make sure any non-kubernetes.io/tls secret would converted to kubernetes.io/tls (clearing up data if necessary)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I understand, but my point is that we expect any secret managed by this controller will either have type kubernetes.io/tls or, if it was created pre-4.7, SecretTypeTLS. If we discover that there is some instance of a managed secret with a completely unexpected type, I would rather understand why our expectation was wrong than stomp it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its perfectly possible for users to find a way to replace these certificates behind cert-syncer back and replace it with invalid contents. This code will ensure that we attempt to convert cases like pre-4.7 if possible, but also prioritize availability - invalid content would be stomped for cluster to continue functioning. Why the bad secret was stomped can be found out via audit logs (or etcd contents).
This change will also protects us from other similar cases like "old/misconfigured installer has generated certificates which we no longer expect"

Comment on lines +54 to +56
if len(actual.Annotations) == 0 {
t.Errorf("expected certificates to be annotated")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks redundant, any value for annotations that would fail here is also going to fail below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I'd prefer to a clear error message

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me given that nothing out of the control of this test should be touching annotations.

if len(actual.Annotations) == 0 {
t.Errorf("expected certificates to be annotated")
}
ownershipValue, found := actual.Annotations[annotations.OpenShiftComponent]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to avoid sharing values like this between tests and implementation, since it creates the opportunity to accidentally make a breaking change without seeing a test failure.

Feel free to ignore if you disagree or are comfortable with it in this instance because it is declared in openshift/api.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it creates the opportunity to accidentally make a breaking change without seeing a test failure

We explicitly set AdditionalAnnotations in https://github.com/openshift/library-go/pull/1681/files#diff-d7a4e30c6635930d8721bf060fe2803587326298e0eace9bada256f1b7377459R228-R230

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That link didn't work for me. In this case I'm specifically referring to a change to the constant OpenShiftComponent, which would be a breaking change but would not cause this test to fail.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, hmm, JiraComponent and OpenshiftComponent are linked in https://github.com/openshift/library-go/blob/master/pkg/operator/certrotation/annotations.go#L27-L30, not sure how to make sure unit tests don't depend on it though

t.Error(actions[1])
}
if !actions[2].Matches("create", "secrets") {
t.Error(actions[0])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Error(actions[0])
t.Error(actions[2])

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

t.Error(actions[1])
}
if !actions[2].Matches("create", "secrets") {
t.Error(actions[0])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Error(actions[0])
t.Error(actions[2])

t.Error(actions[1])
}
if !actions[2].Matches("create", "secrets") {
t.Error(actions[0])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are few more of these:

Suggested change
t.Error(actions[0])
t.Error(actions[2])

_, _, err := resourceapply.ApplySecret(ctx, c.Client, c.EventRecorder, signingCertKeyPairSecret)
// apply necessary metadata (possibly via delete+recreate) if secret exists
// this is done before content update to prevent unexpected rollouts
if ensureMetadataUpdate(signingCertKeyPairSecret, c.Owner, c.AdditionalAnnotations) && ensureSecretTLSTypeSet(signingCertKeyPairSecret) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this && and not ||?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to update labels before content changes only when:

  • secret type won't change (ensureSecretTLSTypeSet returns true)
  • metadata (annotations, ownership) needs updating (ensureMetadataUpdate returns true)

The function are also mutating signingCertKeyPairSecret, changing it type and clearing data, so the secret would be deleted + created during if needed, reason := needNewSigningCertKeyPair clause. It needs to be done once to avoid updating it twice for metadata and content ("update no annotations" and "update SecretTLSType secrets" unit test cases)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to make sure your secret always passes both functions, right?

Either operator might short-circuit in a different scenario.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to make sure your secret always passes both functions, right?

correct, if either doesn't pass possible updates will be applied by ApplySecret in needNewSigningCertKeyPair

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the secret would be deleted + created during if needed, reason := needNewSigningCertKeyPair clause

Thanks, this is the part that I was missing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would stuck on invalid secret type with valid metadata, #1687 fixes this case

@vrutkovs vrutkovs force-pushed the signer-rotate-preserve-type branch 2 times, most recently from 4bde05d to dfce6ae Compare March 5, 2024 10:56
@vrutkovs vrutkovs force-pushed the signer-rotate-preserve-type branch from dfce6ae to c1b3d39 Compare March 5, 2024 11:34
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 5, 2024

@vrutkovs: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@benluddy
Copy link
Copy Markdown
Contributor

benluddy commented Mar 5, 2024

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2024
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 5, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy, stlaz, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit 18ee827 into openshift:master Mar 5, 2024
@openshift-ci-robot
Copy link
Copy Markdown

@vrutkovs: Jira Issue OCPBUGS-30119: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-30119 has been moved to the MODIFIED state.

Details

In response to this:

Pre-4.7 clusters had SecretTypeTLS type set, so on 4.15 upgrade these clusters had the secret delete and recreated. This commit will ensure these are being created without cert and key being regenerated.

Fixes two issues wrt handling secrets:

  • existing secret type is converted to kubernetes.io/tls. If it cannot be done (i.e. tls.crt or tls.key is missing) it would be regenerated with SignerUpdateRequired event
  • if the type can be converted existing data is preserved, so born-before-4.7 clusters look identical to 4.15 clusters in regards to secret type

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdodson
Copy link
Copy Markdown
Member

sdodson commented Mar 5, 2024

/cherry-pick release-4.14

@openshift-cherrypick-robot
Copy link
Copy Markdown

@sdodson: #1681 failed to apply on top of branch "release-4.14":

Applying: certrotation: update all secret types to `kubernetes.io/tls` preserving existing content
Using index info to reconstruct a base tree...
M	pkg/operator/certrotation/signer.go
M	pkg/operator/certrotation/target.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/operator/certrotation/target.go
CONFLICT (content): Merge conflict in pkg/operator/certrotation/target.go
Auto-merging pkg/operator/certrotation/signer.go
CONFLICT (content): Merge conflict in pkg/operator/certrotation/signer.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 certrotation: update all secret types to `kubernetes.io/tls` preserving existing content
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot
Copy link
Copy Markdown
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-03-06-073110

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants