Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API-1802: cert-rotation: allow specifying multiple target certs in CertRotationController #1722

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

vrutkovs
Copy link
Member

@vrutkovs vrutkovs commented Apr 19, 2024

Instead of defining several controllers managing the same signer/CA bundle pair and different target certs the same controller can accept a list of target certs to create.

Tested with

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 19, 2024
@openshift-ci openshift-ci bot requested review from hexfusion and stlaz April 19, 2024 13:24
@vrutkovs vrutkovs force-pushed the cert-rotation-multiple-targets branch 2 times, most recently from 9130e0c to d10d787 Compare April 22, 2024 12:17
@vrutkovs vrutkovs changed the title WIP cert-rotation: allow specifying multiple target certs in CertRotationController cert-rotation: allow specifying multiple target certs in CertRotationController Apr 22, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2024
@vrutkovs vrutkovs changed the title cert-rotation: allow specifying multiple target certs in CertRotationController API-1802: cert-rotation: allow specifying multiple target certs in CertRotationController Apr 24, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 24, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 24, 2024

@vrutkovs: This pull request references API-1802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set.

In response to this:

Instead of defining several controllers managing the same signer/CA bundle pair and different target certs the same controller can accept a list of target certs to create.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@vrutkovs
Copy link
Member Author

vrutkovs commented May 9, 2024

/cc @tkashem @p0lyn0mial

@openshift-ci openshift-ci bot requested review from p0lyn0mial and tkashem May 9, 2024 06:44
Copy link
Member

@dinhxuanvu dinhxuanvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 20, 2024
Copy link
Contributor

@tkashem tkashem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) RotatedSigningCASecret creates the signer CA secret
b) CABundleConfigMap creates a configmap, and
c) RotatedSelfSignedCertKeySecret creates a secret

I like the idea of a controller doing one thing, can we explore the idea of the individual controllers?
a) SignerCAController: this controller manager the signer secret object.
b) CABundleController: it watched the secret object from a and creates a single configmap and manages it.
c) CertKeySecretController: this can watch objects from a and b and creates a secret with cert/key and manages it.

With this, we can have N instances of CertKeySecretController, where each instance derives it cert/key from a single instance of a and b

WithPostStartHooks(
c.targetCertRecheckerPostRunHook,
).
ToController("CertRotationController", recorder.WithComponentSuffix("cert-rotation-controller").WithComponentSuffix(name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultipleTargetCertRotationController, so we have two distinct names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

}
}(ch)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid making any runtime behavioral changes to NewCertRotationController if possible, we have the following options:
a) completely separate implementations: CertRotationController for NewCertRotationController, and MultipleTargetCertRotationController for NewCertRotationControllerMultipleTargets

b) abstract out the targetCertRecheckerPostRunHook implementations: singleTargetCertRecheckerPostRunHook and multiTargetCertRecheckerPostRunHook. This way we can reuse CertRotationController for both single and multiple. NewCertRotationControllerMultipleTargets will use multiTargetCertRecheckerPostRunHook.

c) can we have a single channel <-chan time.Time to be shared by multiple instances of CertCreator (for example ServingRotation), then the logic inside targetCertRecheckerPostRunHook does not need to change at all.

I prefer c, if doable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked this to make it look like b). Added a test which verifies goroutines don't leak

I don't quite understand what c) is meant for - make controller accept a single channel reused across all CertCreators? Not sure what's the benefit of that

targetRefresh := refresher.RecheckChannel()
aggregateTargetRefresher := make(chan struct{})
for _, ch := range targetRefreshers {
go func(c <-chan struct{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go func does not have the same guarantee as go wait.Until(func() {}, time.Minute, ctx.Done())

for _, ch := range targetRefreshers {
go func(c <-chan struct{}) {
for msg := range c {
aggregateTargetRefresher <- msg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goroutine leaking: <-ctx.Done(), could be a problem for integration tests that check goroutine leakages?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx.Done would close them iiuc, but yeah, worth adding a unit test which uses runhook and ensures no goroutines are leaking

@vrutkovs
Copy link
Member Author

vrutkovs commented Jun 25, 2024

can we explore the idea of the individual controllers?

That's possible, but the goal is to create target certs, signers and CA are merely prerequisites to it. We could have separate signer/ca controllers, but we might end up with signer certs not producing any target certs or CA bundles without any new signers etc.

@vrutkovs vrutkovs force-pushed the cert-rotation-multiple-targets branch from d10d787 to a528215 Compare June 26, 2024 15:06
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2024
Copy link
Contributor

openshift-ci bot commented Jun 26, 2024

New changes are detected. LGTM label has been removed.

@vrutkovs vrutkovs force-pushed the cert-rotation-multiple-targets branch from a528215 to d3c0949 Compare June 26, 2024 15:07

// Ensure both target certs have been called exactly three times
// initial sync and two hook calls for target certs
// TODO[vrutkovs]: informers make unpredictable number of calls
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to tackle that - or how to make sure two hook syncs were included in regular informer sync. informerFactory and NewCertRotationControllerMultipleTargets promise to sync every minute, but it happens much more often

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 27, 2024

@vrutkovs: This pull request references API-1802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

Instead of defining several controllers managing the same signer/CA bundle pair and different target certs the same controller can accept a list of target certs to create.

Tested with

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 27, 2024

@vrutkovs: This pull request references API-1802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

Instead of defining several controllers managing the same signer/CA bundle pair and different target certs the same controller can accept a list of target certs to create.

Tested with

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 27, 2024

@vrutkovs: This pull request references API-1802 which is a valid jira issue.

In response to this:

Instead of defining several controllers managing the same signer/CA bundle pair and different target certs the same controller can accept a list of target certs to create.

Tested with

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@p0lyn0mial
Copy link
Contributor

An alternative design would be to have a controller per certificate – this would actually preserve the current behaviour. I prefer having a single controller per certificate as it is easier to debug, report status, retry on error, and reason about. Thoughts?

Our issue is that both the RotatedSigningCASecret and CABundleConfigMap are reconciled by multiple controllers (NewCertRotationController) without any coordination.

I think the least invasive change would be to make both RotatedSigningCASecret and CABundleConfigMap thread-safe. Is there an easy way to achieve this?

Another idea that comes to mind (I think Abu suggested the same) is to turn both RotatedSigningCASecret and CABundleConfigMap into controllers and slightly modify NewCertRotationController to read secrets instead. Thoughts ?

@vrutkovs
Copy link
Member Author

An alternative design would be to have a controller per certificate – this would actually preserve the current behaviour. I prefer having a single controller per certificate as it is easier to debug, report status, retry on error, and reason about. Thoughts?

Similar to Abu's idea in #1722 (review)? That would prevent races, but would make sequence of events to do a proper rollout complicated (three different controllers would need to be properly synced so that CA bundle would be updated before target cert etc.). Also may lead to "orphan" controllers managing CA bundle without a target cert.

Our issue is that both the RotatedSigningCASecret and CABundleConfigMap are reconciled by multiple controllers (NewCertRotationController) without any coordination.

Yes, this issue still potentially remains. The PR focuses on solving a much more widespread issue of multiple target certs

I think the least invasive change would be to make both RotatedSigningCASecret and CABundleConfigMap thread-safe. Is there an easy way to achieve this?

I don't know if its feasible, as its not just thread-safety but also process-safety we're concerned about

@p0lyn0mial
Copy link
Contributor

would make sequence of events to do a proper rollout complicated (three different controllers would need to be properly synced so that CA bundle would be updated before target cert etc.)

Don't we do it all the time?

The aggregator controller waits until a service is created before it wires an HTTP handler.
The degraded webhook controller waits until a webhook is created before it can validate it.

What I like about these and other controllers is that when you look inside, you will see that these controllers are reconciling a single resource. They are simply reading their prerequisites and reacting to any changes before reconciling.

In our case, it would boil down to three separate controllers that reconcile their resources, where the second and the third controller read the crypto material from the lister before reconciling.

For example: NewCertRotationController would:

Read signerCA from the lister 
Parse the signerCA 
Stop on any error 

Read the caBundle from the lister 
Parse the caBundle 
Stop on any error 

Reconcile clientCertificate with signerCA and caBundle

When there are issues with the signerCA you go to RotatedSigningCASecret to debug it since it is responsible for generating the resource. Thoughts ?

@p0lyn0mial
Copy link
Contributor

I don't quite get why you need to multi-thread those with goroutines or even need multiple control loops

It is not about multi-threading for performance reasons. It is about having a single control loop per resource. I think this is already a well-established pattern upstream. I think that having a single controller that manages a resource is easy to understand and debug.

Sure, this PR is just interim fix for us to make it until 4.17 feature freeze. Once we establish one way of process and thread safety - and have e2e tests passing - we can experiment with more substantial code rework

This is a crucial piece of code, and I wouldn't rush it. Besides, we still need to fix kubelet, client-go, and tons of other things before the platform will be able to recover itself from expired certificates. Thus, I don't see the point in developing temporary solutions. We are not dealing with an escalation that requires an immediate fix. I would rather implement a proper fix or not fix it at all.

@tjungblu
Copy link
Contributor

tjungblu commented Jul 9, 2024

This is a crucial piece of code, and I wouldn't rush it.

so you propose to rewrite the entire codebase to fit some upstream pattern? :) sounds great 👍

@p0lyn0mial
Copy link
Contributor

so you propose to rewrite the entire codebase to fit some upstream pattern? :) sounds great 👍

I'm proposing wrapping CABundleConfigMap and RotatedSigningCASecret into separate controllers (the controllers would simply call the Ensure... methods) and writing a new controller for CertRotation.

The new CertRotation wouldn't differ much from the existing one. The only difference would be reading the signer and the CA from the cache/lister. We could also consider removing the side chan for the hostname for simplicity - just as you did for your controller.

Then, when we need to compose these controllers we would only have a singe instance of CABundleConfigMap and RotatedSigningCASecret and many instances of CertRotation.

Does it make sense to you as well ?

@vrutkovs
Copy link
Member Author

vrutkovs commented Jul 9, 2024

I'm proposing wrapping

This is what we agreed to few weeks back and noone is debating that choice. The immediate question is why all of this is being discussed in an unrelated PR with a temporary fix for 4.17 - and why is it being stalled for several weeks already

@vrutkovs
Copy link
Member Author

vrutkovs commented Jul 9, 2024

This is a crucial piece of code, and I wouldn't rush it. Besides, we still need to fix kubelet, client-go, and tons of other things before the platform will be able to recover itself from expired certificates. Thus, I don't see the point in developing temporary solutions. We are not dealing with an escalation that requires an immediate fix. I would rather implement a proper fix or not fix it at all.

We kind of have to. If the rework is significant and 4.18 branches we won't be able to backport it.

Besides, we still need to fix kubelet, client-go, and tons of other things before the platform will be able to recover itself from expired certificates

For indefinite suspend period - yes. For 90 days / 1 year on SNO - no, not really, we already recover with approved manual steps.

We are not dealing with an escalation that requires an immediate fix

We do, this feature (limited to 90 days etc.) is on 4.17 plan

@vrutkovs vrutkovs force-pushed the cert-rotation-multiple-targets branch from d3c0949 to 126e202 Compare July 11, 2024 15:19
Copy link
Contributor

openshift-ci bot commented Jul 11, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dinhxuanvu, vrutkovs
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vrutkovs
Copy link
Member Author

/test unit

@vrutkovs
Copy link
Member Author

/cc @soltysh @deads2k

@openshift-ci openshift-ci bot requested review from deads2k and soltysh July 23, 2024 10:22
RotatedSelfSignedCertKeySecret RotatedSelfSignedCertKeySecret

// RotatedTargetSecrets contains a list of key and cert signed by a signing CA to rotate.
RotatedTargetSecrets []RotatedSelfSignedCertKeySecret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will change the external API and clients would need to update if they're accessing this field directly.

Since you're already adding a new "Create" function for the multi rotation controller, would it make more sense to have two cert controllers? One for single rotation and one for multi.

Copy link
Member Author

@vrutkovs vrutkovs Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will change the external API and clients would need to update if they're accessing this field directly

Correct, this would require users to update the code potentially. So far we identified that only cluster-kube-apiserver-operator and cluster-kube-control-manager-operator use multitarget controllers (and they don't need to access RotatedTargetSecrets directly)

would it make more sense to have two cert controllers? One for single rotation and one for multi.

It may be beneficial later to move common parts into functions and reuse them, however at this point only two different New... functions are used externally

vrutkovs added 3 commits July 29, 2024 12:58
…Controller

Instead of defining several controllers managing the same signer/CA
bundle pair and different target certs the same controller can accept
a list of target certs to create.
Collect a list of secrets/configmaps tracked in controllers to
make sure any secret/configmap is not being declared twice
to avoid races when updating it
@vrutkovs vrutkovs force-pushed the cert-rotation-multiple-targets branch from 6a2734e to 26bcb48 Compare July 29, 2024 11:05
@deads2k
Copy link
Contributor

deads2k commented Jul 31, 2024

Is the goal of this change to avoid races between multiple updating controllers? If so, this solution appears to miss the critical change required to resolve the cross-binary active-active problem. The core mistake is using the .Apply functions that smash on conflicts. Changing this code back to the original .Update will cause conflicts and force retries. The retry would fail until the cache updates and only one operator will win the update.

@vrutkovs
Copy link
Member Author

Is the goal of this change to avoid races between multiple updating controllers?

This prevents races within one controller (thread-level). Component PRs like this ensure that multiple controllers don't run simultaneously.

The core mistake is using the .Apply functions that smash on conflicts

We don't use Apply in applyConfigMap/applySecret - counterintuitively these use Update:

actual, err := client.ConfigMaps(required.Namespace).Update(ctx, existingCopy, metav1.UpdateOptions{})

@deads2k
Copy link
Contributor

deads2k commented Jul 31, 2024

We don't use Apply in applyConfigMap/applySecret - counterintuitively these use Update:

By having it read the live secret and use its RV instead of the RV the controller used to make decisions, conflicts are avoided and produce a situation that looks like

  1. client/1 reads rv/6 to make decision about update
  2. client/2 reads rv/6 to make decision about update
  3. client/1 calls applySecret first,
    1. apply secret reads secret rv/6
    2. apply secret writes secret rv/7
  4. client/2 calls applySecret second
    1. apply secret reads secret rv/7 *** this is not good.
    2. apply secret writes secret rv/8

Using .Update instead should eliminate the read in 4.1 and result in a conflict instead of the race.

@vrutkovs
Copy link
Member Author

vrutkovs commented Jul 31, 2024

Am I missing something? We already use .Update there - also discussed it with Lukasz in #1763 (comment) and he wants me to change this to use .Update too.

Also, this PR would prevent client/1 and client/2 from ever happening. If applySecret needs to be updated to handle races it should be done in a separate PR (as the change would affect a lot more components)

@p0lyn0mial
Copy link
Contributor

Am I missing something? We already use .Update there - also discussed it with Lukasz in #1763 (comment) and he wants me to change this to use .Update too.

@vrutkovs let's set up a pair programming session tomorrow/this-week where I can show you how we could use optimistic concurrency control built into kube to solve the race.

@vrutkovs
Copy link
Member Author

vrutkovs commented Aug 2, 2024

Ah, the problem here was:

  • we do a first Get and modify the secret
  • applySecret is called
  • applySecret does another Get and then does Update

Without second Get we get conflicts as expected - tested this in openshift/cluster-kube-apiserver-operator#1719. So perhaps we should have a setting for applySecret to skip the first Get to avoid this?

However, its orthogonal to this PR - we can both remove racing controllers and make sure applySecret can be thread-safe.

Copy link
Contributor

openshift-ci bot commented Oct 14, 2024

@vrutkovs: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants