-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API-1802: cert-rotation: allow specifying multiple target certs in CertRotationController #1722
base: master
Are you sure you want to change the base?
API-1802: cert-rotation: allow specifying multiple target certs in CertRotationController #1722
Conversation
9130e0c
to
d10d787
Compare
@vrutkovs: This pull request references API-1802 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/cc @tkashem @p0lyn0mial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a) RotatedSigningCASecret
creates the signer CA secret
b) CABundleConfigMap
creates a configmap, and
c) RotatedSelfSignedCertKeySecret
creates a secret
I like the idea of a controller doing one thing, can we explore the idea of the individual controllers?
a) SignerCAController
: this controller manager the signer secret object.
b) CABundleController
: it watched the secret object from a
and creates a single configmap and manages it.
c) CertKeySecretController
: this can watch objects from a
and b
and creates a secret with cert/key and manages it.
With this, we can have N instances of CertKeySecretController
, where each instance derives it cert/key from a single instance of a
and b
WithPostStartHooks( | ||
c.targetCertRecheckerPostRunHook, | ||
). | ||
ToController("CertRotationController", recorder.WithComponentSuffix("cert-rotation-controller").WithComponentSuffix(name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MultipleTargetCertRotationController
, so we have two distinct names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
} | ||
}(ch) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to avoid making any runtime behavioral changes to NewCertRotationController
if possible, we have the following options:
a) completely separate implementations: CertRotationController
for NewCertRotationController
, and MultipleTargetCertRotationController
for NewCertRotationControllerMultipleTargets
b) abstract out the targetCertRecheckerPostRunHook
implementations: singleTargetCertRecheckerPostRunHook
and multiTargetCertRecheckerPostRunHook
. This way we can reuse CertRotationController
for both single and multiple. NewCertRotationControllerMultipleTargets
will use multiTargetCertRecheckerPostRunHook
.
c) can we have a single channel <-chan time.Time
to be shared by multiple instances of CertCreator
(for example ServingRotation
), then the logic inside targetCertRecheckerPostRunHook
does not need to change at all.
I prefer c
, if doable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked this to make it look like b). Added a test which verifies goroutines don't leak
I don't quite understand what c) is meant for - make controller accept a single channel reused across all CertCreators
? Not sure what's the benefit of that
targetRefresh := refresher.RecheckChannel() | ||
aggregateTargetRefresher := make(chan struct{}) | ||
for _, ch := range targetRefreshers { | ||
go func(c <-chan struct{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go func
does not have the same guarantee as go wait.Until(func() {}, time.Minute, ctx.Done())
for _, ch := range targetRefreshers { | ||
go func(c <-chan struct{}) { | ||
for msg := range c { | ||
aggregateTargetRefresher <- msg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
goroutine leaking: <-ctx.Done()
, could be a problem for integration tests that check goroutine leakages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx.Done
would close them iiuc, but yeah, worth adding a unit test which uses runhook and ensures no goroutines are leaking
That's possible, but the goal is to create target certs, signers and CA are merely prerequisites to it. We could have separate signer/ca controllers, but we might end up with signer certs not producing any target certs or CA bundles without any new signers etc. |
d10d787
to
a528215
Compare
New changes are detected. LGTM label has been removed. |
a528215
to
d3c0949
Compare
|
||
// Ensure both target certs have been called exactly three times | ||
// initial sync and two hook calls for target certs | ||
// TODO[vrutkovs]: informers make unpredictable number of calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how to tackle that - or how to make sure two hook syncs were included in regular informer sync. informerFactory
and NewCertRotationControllerMultipleTargets
promise to sync every minute, but it happens much more often
An alternative design would be to have a controller per certificate – this would actually preserve the current behaviour. I prefer having a single controller per certificate as it is easier to debug, report status, retry on error, and reason about. Thoughts? Our issue is that both the I think the least invasive change would be to make both Another idea that comes to mind (I think Abu suggested the same) is to turn both |
Similar to Abu's idea in #1722 (review)? That would prevent races, but would make sequence of events to do a proper rollout complicated (three different controllers would need to be properly synced so that CA bundle would be updated before target cert etc.). Also may lead to "orphan" controllers managing CA bundle without a target cert.
Yes, this issue still potentially remains. The PR focuses on solving a much more widespread issue of multiple target certs
I don't know if its feasible, as its not just thread-safety but also process-safety we're concerned about |
Don't we do it all the time? The aggregator controller waits until a service is created before it wires an HTTP handler. What I like about these and other controllers is that when you look inside, you will see that these controllers are reconciling a single resource. They are simply reading their prerequisites and reacting to any changes before reconciling. In our case, it would boil down to three separate controllers that reconcile their resources, where the second and the third controller read the crypto material from the lister before reconciling. For example: NewCertRotationController would:
When there are issues with the signerCA you go to |
It is not about multi-threading for performance reasons. It is about having a single control loop per resource. I think this is already a well-established pattern upstream. I think that having a single controller that manages a resource is easy to understand and debug.
This is a crucial piece of code, and I wouldn't rush it. Besides, we still need to fix kubelet, client-go, and tons of other things before the platform will be able to recover itself from expired certificates. Thus, I don't see the point in developing temporary solutions. We are not dealing with an escalation that requires an immediate fix. I would rather implement a proper fix or not fix it at all. |
so you propose to rewrite the entire codebase to fit some upstream pattern? :) sounds great 👍 |
I'm proposing wrapping The new Then, when we need to compose these controllers we would only have a singe instance of Does it make sense to you as well ? |
This is what we agreed to few weeks back and noone is debating that choice. The immediate question is why all of this is being discussed in an unrelated PR with a temporary fix for 4.17 - and why is it being stalled for several weeks already |
We kind of have to. If the rework is significant and 4.18 branches we won't be able to backport it.
For indefinite suspend period - yes. For 90 days / 1 year on SNO - no, not really, we already recover with approved manual steps.
We do, this feature (limited to 90 days etc.) is on 4.17 plan |
d3c0949
to
126e202
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dinhxuanvu, vrutkovs The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
b2298d8
to
6a2734e
Compare
/test unit |
RotatedSelfSignedCertKeySecret RotatedSelfSignedCertKeySecret | ||
|
||
// RotatedTargetSecrets contains a list of key and cert signed by a signing CA to rotate. | ||
RotatedTargetSecrets []RotatedSelfSignedCertKeySecret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will change the external API and clients would need to update if they're accessing this field directly.
Since you're already adding a new "Create" function for the multi rotation controller, would it make more sense to have two cert controllers? One for single rotation and one for multi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will change the external API and clients would need to update if they're accessing this field directly
Correct, this would require users to update the code potentially. So far we identified that only cluster-kube-apiserver-operator and cluster-kube-control-manager-operator use multitarget controllers (and they don't need to access RotatedTargetSecrets
directly)
would it make more sense to have two cert controllers? One for single rotation and one for multi.
It may be beneficial later to move common parts into functions and reuse them, however at this point only two different New...
functions are used externally
…Controller Instead of defining several controllers managing the same signer/CA bundle pair and different target certs the same controller can accept a list of target certs to create.
Collect a list of secrets/configmaps tracked in controllers to make sure any secret/configmap is not being declared twice to avoid races when updating it
6a2734e
to
26bcb48
Compare
Is the goal of this change to avoid races between multiple updating controllers? If so, this solution appears to miss the critical change required to resolve the cross-binary active-active problem. The core mistake is using the |
This prevents races within one controller (thread-level). Component PRs like this ensure that multiple controllers don't run simultaneously.
We don't use
|
By having it read the live secret and use its RV instead of the RV the controller used to make decisions, conflicts are avoided and produce a situation that looks like
Using .Update instead should eliminate the read in 4.1 and result in a conflict instead of the race. |
Am I missing something? We already use .Update there - also discussed it with Lukasz in #1763 (comment) and he wants me to change this to use .Update too. Also, this PR would prevent client/1 and client/2 from ever happening. If |
@vrutkovs let's set up a pair programming session tomorrow/this-week where I can show you how we could use optimistic concurrency control built into kube to solve the race. |
Ah, the problem here was:
Without second However, its orthogonal to this PR - we can both remove racing controllers and make sure |
@vrutkovs: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Instead of defining several controllers managing the same signer/CA bundle pair and different target certs the same controller can accept a list of target certs to create.
Tested with
workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=300","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"
workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=600","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"
workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=900","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"
workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=1200","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"
workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=1500","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"
workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=1800","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"
workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=2100","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"
workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=2400","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"