-
Notifications
You must be signed in to change notification settings - Fork 220
Bug 2117524: Update CRLs when they expire #828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 2117524: Update CRLs when they expire #828
Conversation
|
@rfredette: This pull request references Bugzilla bug 2117524, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
ced092e to
0e8a65f
Compare
|
/bugzilla refresh |
|
@rfredette: This pull request references Bugzilla bug 2117524, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@rfredette: This pull request references Bugzilla bug 2117524, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
gcs278
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good just question about testing
| if nextCRLUpdate, ok := ctx.Value("nextCRLUpdate").(time.Time); ok && !nextCRLUpdate.IsZero() { | ||
| log.Info("Requeueing when next CRL expires", "requeue time", nextCRLUpdate.String(), "time until requeue", time.Until(nextCRLUpdate)) | ||
| //Re-reconcile when any of the CRLs expire | ||
| return reconcile.Result{RequeueAfter: time.Until(nextCRLUpdate)}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a risk of this requeueing multiple times for the same next CRL update? I.e. on every reconcile, do we queue up this request? Is that okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts @rfredette?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took me a while to get back to you on this one. I think that's a valid concern. I'm not sure at first glance what's the best way to solve it, but let me look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gcs278 I've pushed a change that seems to fix this from my manual testing; please take a look when you get a chance and let me know what you think.
I added a global variable to track when the next CRL update is expected to be, and now the reconcile is only triggered when the next update as computed in desiredCRLConfigMap will be sooner than any already pending requeues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a risk of this requeueing multiple times for the same next CRL update? I.e. on every reconcile, do we queue up this request? Is that okay?
Are you worried that the queue may never be empty? I think it is all right. Is there some other potential problem with the logic as it was when you asked about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you worried that the queue may never be empty?
I don't think that's what I meant. I just noticed we queue up a request on every reconcile for the same next CRL update. That's not a big problem, just seemed a bit extraneous. We might have 10 or 20 requests that will all trigger at the same time for the same (or immediately after each other) nextCRLUpdate, depending on how many times the CRL was reconciled.
I don't know if there is an easy solution to say "this request is already in the queue, don't queue again". Just pointing it out more than anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to do a bit more testing on this, and I got the operator to requeue a reconcile for the same time 10 times, but when the time came to do the reconcile, it seems to execute the reconcile twice then stop retrying (at least for a bit; when the new CRL expired 5 minutes later it still re-reconciled). I'm not sure why it's twice; maybe there's a rate limit implemented somewhere, but it doesn't look like this will flood the operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we're all right here. I did a little more investigation:
- controller-runtime's
reconcileHandlercalls ourReconcilemethod (technically, it calls the controller'sReconcilemethod, but that ultimately results in ours' being called) and requeues the request usingAddAfterif the result has nil error and non-zeroRequeueAfter:cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go
Lines 320 to 333 in 362a3ec
result, err := c.Reconcile(ctx, req) switch { case err != nil: c.Queue.AddRateLimited(req) ctrlmetrics.ReconcileErrors.WithLabelValues(c.Name).Inc() ctrlmetrics.ReconcileTotal.WithLabelValues(c.Name, labelError).Inc() log.Error(err, "Reconciler error") case result.RequeueAfter > 0: // The result.RequeueAfter request will be lost, if it is returned // along with a non-nil error. But this is intended as // We need to drive to stable reconcile loops before queuing due // to result.RequestAfter c.Queue.Forget(obj) c.Queue.AddAfter(req, result.RequeueAfter) AddAfteradds the item to the queue usingAddor sends awaitForto a channel:cluster-ingress-operator/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go
Lines 161 to 179 in 362a3ec
// AddAfter adds the given item to the work queue after the given delay func (q *delayingType) AddAfter(item interface{}, duration time.Duration) { // don't add if we're already shutting down if q.ShuttingDown() { return } q.metrics.retry() // immediately add things with no delay if duration <= 0 { q.Add(item) return } select { case <-q.stopCh: // unblock if ShutDown() is called case q.waitingForAddCh <- &waitFor{data: item, readyAt: q.clock.Now().Add(duration)}: - The queue has a goroutine that reads the
waitForoff the channel and adds it to its internalwaitingForQueuequeue usinginsertor adds it to the queue usingAdd:cluster-ingress-operator/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go
Lines 253 to 257 in 362a3ec
case waitEntry := <-q.waitingForAddCh: if waitEntry.readyAt.After(q.clock.Now()) { insert(waitingForQueue, waitingEntryByData, waitEntry) } else { q.Add(waitEntry.data) insertchecks whether there is an existing entry for the reconcile request; if there is one, it updates the existing entry in the queue instead of pushing a duplicate entry onto the queue:cluster-ingress-operator/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go
Lines 267 to 277 in 362a3ec
// insert adds the entry to the priority queue, or updates the readyAt if it already exists in the queue func insert(q *waitForPriorityQueue, knownEntries map[t]*waitFor, entry *waitFor) { // if the entry already exists, update the time only if it would cause the item to be queued sooner existing, exists := knownEntries[entry.data] if exists { if existing.readyAt.After(entry.readyAt) { existing.readyAt = entry.readyAt heap.Fix(q, existing.index) } return - The queue itself defines
Addto prevent adding duplicates:cluster-ingress-operator/vendor/k8s.io/client-go/util/workqueue/queue.go
Lines 119 to 137 in 362a3ec
// Add marks item as needing processing. func (q *Type) Add(item interface{}) { q.cond.L.Lock() defer q.cond.L.Unlock() if q.shuttingDown { return } if q.dirty.has(item) { return } q.metrics.add(item) q.dirty.insert(item) if q.processing.has(item) { return } q.queue = append(q.queue, item)
Between Ryan's testing and my understanding of the queue and delaying_queue logic, I believe that there should be no issue with duplicate items in the queue or the delaying_queue's internal channel or queue.
| crlConfigmap.SetOwnerReferences([]metav1.OwnerReference{ownerRef}) | ||
|
|
||
| return true, &crlConfigmap, nil | ||
| return true, &crlConfigmap, context.WithValue(ctx, "nextCRLUpdate", nextCRLUpdate), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never seen context used like this, but that could just because we've never needed to. What's the reason behind using context vs. returning a nextCRLUpdatevariable? Is it because it's time/deadline related? Just curious, I don't have an opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth for a while on how to return nextCRLUpdate because it's kind of tangental to the main purpose of desiredCRLConfigMap, and adding that kind of extra return value for callers to manage feels like a bad design choice.
Ideally, nextCRLUpdate would be calculated in its own function to keep the code more clear, but doing that would require all the CRLs to be parsed a second time, and I understand that in certain scenarios, users are running up against the max size of a configmap (1MiB) for the CRL configmap, so parsing again could be pretty wasteful.
I think this is a reasonable compromise, where nextCRLUpdate is available for Reconcile to deal with it, but intermediate functions don't need to worry about it other than to pass the context back up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can buy that, sounds like you just want to decouple nextCRLUpdate from the output of desiredCRLConfigMap.
| // indicating whether a configmap is desired, the configmap if one is desired, | ||
| // the context (containing the next CRL update time as "nextCRLUpdate"), and an | ||
| // error if one occurred | ||
| func desiredCRLConfigMap(ctx context.Context, ic *operatorv1.IngressController, ownerRef metav1.OwnerReference, clientCAData []byte, crls map[string]*pkix.CertificateList) (bool, *corev1.ConfigMap, context.Context, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed we don't have unit testing for this function. Should we have a unit test or E2E test to verify this CRL update logic? Can we have static CA's in our test code that can trigger some of logical paths?
As far as expiration, I saw in CoreDNS they pass in now as an argument, that way in unit testing they can change now (aka teleport) to the future, to test logical code paths for expiration. This is probably not trivial though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw in CoreDNS they pass in now as an argument, that way in unit testing they can change now (aka teleport) to the future, to test logical code paths for expiration.
I haven't included tests because I haven't had a good idea on how to automate testing it, since the certificates are time sensitive. My plan is to follow up later with an e2e test that generates certificates and CRLs at test run time, but this is a great idea for unit testing.
|
cluster bootstrap failed on these tests, but I've since been able to bring up a cluster with clusterbot, so I don't think it's an issue with this PR? /retest |
|
/retest |
1 similar comment
|
/retest |
|
pre-merge test passed, see https://bugzilla.redhat.com/show_bug.cgi?id=2117524#c16 |
|
/label qe-approved |
|
@lihongan I've pushed another change to address one of Grant's comments. It seems fine to me when I test it, but I'd appreciate if you could verify that everything still looks good. |
|
@rfredette retest the PR with cluster-bot and no issues found, the new CRL can be downloaded and configmap is updated as well. operator logs: |
|
|
||
| var log = logf.Logger.WithName(controllerName) | ||
|
|
||
| var currentCRLUpdate time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is using a single global variable going to work properly when there are multiple ingresscontrollers with CRLs?
8d8f052 to
0e8a65f
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Miciah The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
e2e-aws-operator failed because e2e-gcp-vn-serial failed because Also, some disruption tests failed: /test e2e-gcp-ovn-serial |
|
/test e2e-azure-operator |
|
/assign |
|
{Operator degraded (NodeInstaller_InstallerPodFailed): NodeInstallerDegraded: 1 nodes are failing on revision 14: /test e2e-aws-operator |
|
Oct 27 03:31:38.882 E ns/openshift-ingress-canary pod/ingress-canary-t5pbh node/ci-op-j31fxmni-9dec8-kwjmp-worker-c-mcfzm uid/dc52d4fb-36e4-475a-9af4-a082f90cfc1e container/serve-healthcheck-canary reason/ContainerExit code/2 cause/Error serving on 8888\nserving on 8080\nServing canary healthcheck request\nServing canary healthcheck request\nServing canary healthcheck request /test e2e-gcp-ovn-serial |
|
@rfredette this failure may need investigation. error running backup collection: errors occurred while gathering data: /test e2e-azure-operator |
|
Cluster setup failed before tests began level=error msg=Error: creating EC2 Instance: InvalidParameterValue: Value (ci-op-hs93zt1b-43abb-4lmvz-bootstrap-profile) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name /test e2e-aws-operator |
|
e2e-gcp-ovn-serial failed because disruption tests failed: /test e2e-gcp-ovn-serial |
|
e2e-aws-operator failed because kube-apiserver, kube-controller-manager, and ovnkube-master ran into problems during pod rollouts. Also, must-gather and deprovisoning failed. |
|
@Miciah: Overrode contexts on behalf of Miciah: ci/prow/e2e-aws-operator DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/refresh |
|
e2e-gcp-ovn-serial failed because disruption tests failed: /override ci/prow/e2e-gcp-ovn-serial |
|
@Miciah: Overrode contexts on behalf of Miciah: ci/prow/e2e-gcp-ovn-serial DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/refresh |
|
/tide refresh |
|
/test all |
|
e2e-aws-operator failed because kube-apiserver reported |
|
@Miciah: Overrode contexts on behalf of Miciah: ci/prow/e2e-aws-operator DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
e2e-gcp-ovn-serial failed because disruption tests failed. Similar failures have been observed in many other PRs, so I believe that the failures are not related to the changes in this PR. |
|
@Miciah: Overrode contexts on behalf of Miciah: ci/prow/e2e-gcp-ovn-serial DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@rfredette: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@rfredette: All pull requests linked via external trackers have merged: Bugzilla bug 2117524 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherry-pick release-4.11 |
|
@rfredette: new pull request created: #853 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Determine when the next CRL needs to be updated based on the CRL's nextUpdate field. Re-reconcile at that time in order to pick up the updated CRL.