-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart destination, proxy-injector controllers on config change. #11440
Restart destination, proxy-injector controllers on config change. #11440
Conversation
Signed-off-by: Cameron Boulton <[email protected]>
Hi @iAnomaly. Thanks for submitting a fix. Could you help us understand the problem you encountered more specifically? It looks like you hit an issue related to RBAC changing? How did that problem arise? How does tracking the SHA of the rbac template resolve that for you? |
Looks like you need to run @olix0r, I believe this is related to needing to restart the webhook controllers when the webhook credentials change. It's a bit misleading because the credentials secret is in the |
Signed-off-by: Cameron Boulton <[email protected]>
Different/unrelated integration tests now failing because Docker.io APIs are down? Not sure if either of you can rerun those for me at some point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick turnaround @iAnomaly . This looks good to me, but could you also add the annotation for the tap-injector template? Also, I took the liberty of editing this PR's description for easier reference without having to look into the original issue 😉
Good call on Thanks for the PR description. |
Just kidding, |
Sorry for the back and forth @iAnomaly , but I took another look at this and now realize why we had dropped these annotations... The proxy-injector, tap-injector and jaeger-injector make use of this server which automatically reloads the server if the TLS cert changes. And the policy validator does the same (with the help of the |
Chart versions:
Steps being performed:
When I follow the selector/targets for
|
I couldn't reproduce the problem locally, but this might be an issue of Secrets not propagating quickly enough (I know that's a problem with ConfigMaps, not sure about Secrets). Could you help me checking this by introducing a delay (at least a minute) in between the control-plane upgrade and the extensions upgrades? |
I can confirm even with a delay/ This isn't a great workaround for us though as it will consistently make CI/CD take 30s longer for each environment/cluster we deploy Linkerd to AND if Secret/ConfigMap update event propagation delay is the root cause here its also possible that could delay past 30s or even 60s in rare cases where the API/event processing is considerably back pressured. Is there a strong technical reason for utilizing server hot reload vs. Pod replacement? I suppose hot reload is needed to support Cert Manager/external certificates for example that might update the Secret without a Helm upgrade? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a strong technical reason for utilizing server hot reload vs. Pod replacement? I suppose hot reload is needed to support Cert Manager/external certificates for example that might update the Secret without a Helm upgrade?
Great point, yes, that hot-reload logic is necessary for external (cert-manager) driven Secret updates. After discussing with colleagues we came to the conclusion that this PR is the right approach, there were cert management hasn't been outsourced. But please mind that upgrades in these conditions can produce temporary downtime, while the workloads get rolled out with the new Secrets mounted. This is why for production we recommend not relying on this self-signed certificate generation that linkerd uses by default, and manage them explicitly, with something like cert-manager.
This edge release includes a fix addressing an issue during upgrades for instances not relying on automated webhook certificate management (like cert-manager provides). * Added a `checksum/config` annotation to the destination and proxy injector deployment manifests to force restarting those workloads whenever their webhook secrets change during upgrade (thanks @iAnomaly!) ([#11440]) * Fixed policy controller error when deleting a gateway API HTTPRoute resource ([#11471])
## edge-23.10.2 This edge release includes a fix addressing an issue during upgrades for instances not relying on automated webhook certificate management (like cert-manager provides). * Added a `checksum/config` annotation to the destination and proxy injector deployment manifests, to force restarting those workloads whenever their webhook secrets change during upgrade (thanks @iAnomaly!) ([#11440]) * Fixed policy controller error when deleting a Gateway API HTTPRoute resource ([#11471]) [#11440]: #11440 [#11471]: #11471
Is this something that would be back ported to older releases? Or a doc on how to add these to 2.12+. I currently just set my hook to ignore everything so I can upgrade. |
…1440) Fixes #6940 Added a `checksum/config` annotation into the destination, proxy-injector and tap-injector workloads, whose value is calculated as the SHA256 of the template file containing the TLS cert they depend on. This is necessary so that every time those other files change (they get re-generated on every upgrade or config update via `linkerd upgrade`), the workloads change as well. We had this in place before, but with the 2.12 helm charts migrations we dropped it by mistake. Signed-off-by: Cameron Boulton <[email protected]>
## stable-2.14.5 This stable release fixes a proxy regression where bursts of TCP connections could result in EOF errors, due to an incorrect queue capacity. In addition, it includes fixes for the control plane, dependency upgrades, and support for image digests in Linkerd manifests. * Added a controlPlaneVersion override to the `linkerd-control-plane`` Helm chart to support including SHA256 image digests in Linkerd manifests (thanks @cromulentbanana!) ([#11406]; fixes [#11312]) * Added a `checksum/config `annotation to the destination and proxy injector deployment manifests, to force restarting those workloads whenever their webhook secrets change during upgrade (thanks @iAnomaly!) ([#11440]; fixes [#6940]) * Updated the Policy controller's OpenSSL dependency to v3, as OpenSSL 1.1.1 is EOL ([#11625]) * proxy: Increased `DEFAULT_OUTBOUND_TCP_QUEUE_CAPACITY` to prevent EOF errors during bursts of TCP connections (proxy PR [#2521][proxy-2521]) [#11406]: #11406 [#11312]: #11312 [#11440]: #11440 [#6940]: #6940 [#11625]: #11625 [proxy-2521]: linkerd/linkerd2-proxy#2521
## stable-2.14.5 This stable release fixes a proxy regression where bursts of TCP connections could result in EOF errors, due to an incorrect queue capacity. In addition, it includes fixes for the control plane, dependency upgrades, and support for image digests in Linkerd manifests. * Added a controlPlaneVersion override to the `linkerd-control-plane`` Helm chart to support including SHA256 image digests in Linkerd manifests (thanks @cromulentbanana!) ([#11406]; fixes [#11312]) * Added a `checksum/config `annotation to the destination and proxy injector deployment manifests, to force restarting those workloads whenever their webhook secrets change during upgrade (thanks @iAnomaly!) ([#11440]; fixes [#6940]) * Updated the Policy controller's OpenSSL dependency to v3, as OpenSSL 1.1.1 is EOL ([#11625]) * proxy: Increased `DEFAULT_OUTBOUND_TCP_QUEUE_CAPACITY` to prevent EOF errors during bursts of TCP connections (proxy PR [#2521][proxy-2521]) [#11406]: #11406 [#11312]: #11312 [#11440]: #11440 [#6940]: #6940 [#11625]: #11625 [proxy-2521]: linkerd/linkerd2-proxy#2521
## stable-2.14.5 This stable release fixes a proxy regression where bursts of TCP connections could result in EOF errors, due to an incorrect queue capacity. In addition, it includes fixes for the control plane, dependency upgrades, and support for image digests in Linkerd manifests. * Added a controlPlaneVersion override to the `linkerd-control-plane`` Helm chart to support including SHA256 image digests in Linkerd manifests (thanks @cromulentbanana!) ([#11406]; fixes [#11312]) * Added a `checksum/config `annotation to the destination and proxy injector deployment manifests, to force restarting those workloads whenever their webhook secrets change during upgrade (thanks @iAnomaly!) ([#11440]; fixes [#6940]) * Updated the Policy controller's OpenSSL dependency to v3, as OpenSSL 1.1.1 is EOL ([#11625]) * proxy: Increased `DEFAULT_OUTBOUND_TCP_QUEUE_CAPACITY` to prevent EOF errors during bursts of TCP connections (proxy PR [#2521][proxy-2521]) [#11406]: #11406 [#11312]: #11312 [#11440]: #11440 [#6940]: #6940 [#11625]: #11625 [proxy-2521]: linkerd/linkerd2-proxy#2521
This PR fixes ignoreDifferences otherwise argocd will be out of sync every few minutes and restart continously restart linkerd-destination and linkerd-proxy-injector pods Fix the group field for the MutatingWebhookConfiguration and ValidatingWebhookConfiguration entries by removing the API version as per argocd requirements here image Adds 3 additional entries. 1 to ignore the CronJob schedule and 2 entries to ignore the checksum/config annotation in the linkerd-destination and linkerd-proxy-injector Deployments (See Restart destination, proxy-injector controllers on config change. linkerd/linkerd2#11440)
Fixes #6940
Added a
checksum/config
annotation into the destination, proxy-injector and tap-injector workloads, whose value is calculated as the SHA256 of the template file containing the TLS cert they depend on. This is necessary so that every time those other files change (they get re-generated on every upgrade or config update vialinkerd upgrade
), the workloads change as well.We had this in place before, but with the 2.12 helm charts migrations we dropped it by mistake.
Signed-off-by: Cameron Boulton [email protected]