Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart destination, proxy-injector controllers on config change. #11440

Merged

Conversation

iAnomaly
Copy link
Contributor

@iAnomaly iAnomaly commented Sep 28, 2023

Fixes #6940

Added a checksum/config annotation into the destination, proxy-injector and tap-injector workloads, whose value is calculated as the SHA256 of the template file containing the TLS cert they depend on. This is necessary so that every time those other files change (they get re-generated on every upgrade or config update via linkerd upgrade), the workloads change as well.
We had this in place before, but with the 2.12 helm charts migrations we dropped it by mistake.

Signed-off-by: Cameron Boulton [email protected]

@olix0r
Copy link
Member

olix0r commented Sep 28, 2023

Hi @iAnomaly. Thanks for submitting a fix.

Could you help us understand the problem you encountered more specifically? It looks like you hit an issue related to RBAC changing? How did that problem arise? How does tracking the SHA of the rbac template resolve that for you?

@iAnomaly
Copy link
Contributor Author

Hi @olix0r, please see conversation with @alpeb here.

@adleong
Copy link
Member

adleong commented Sep 28, 2023

Looks like you need to run go test ./... --update to update the golden test files to reflect this change in template.

@olix0r, I believe this is related to needing to restart the webhook controllers when the webhook credentials change. It's a bit misleading because the credentials secret is in the destination-rbac.yaml template file.

Signed-off-by: Cameron Boulton <[email protected]>
@iAnomaly
Copy link
Contributor Author

Different/unrelated integration tests now failing because Docker.io APIs are down? failed to authorize: failed to fetch oauth token: unexpected status from POST request to https://auth.docker.io/token: 503 Service Unavailable

Not sure if either of you can rerun those for me at some point.

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turnaround @iAnomaly . This looks good to me, but could you also add the annotation for the tap-injector template? Also, I took the liberty of editing this PR's description for easier reference without having to look into the original issue 😉

@iAnomaly
Copy link
Contributor Author

Good call on tap-injector @alpeb , I missed that one; thanks. Will add that shortly.

Thanks for the PR description.

@iAnomaly
Copy link
Contributor Author

Just kidding, tap-injector template already has the annotation @alpeb.

@alpeb
Copy link
Member

alpeb commented Sep 28, 2023

Sorry for the back and forth @iAnomaly , but I took another look at this and now realize why we had dropped these annotations... The proxy-injector, tap-injector and jaeger-injector make use of this server which automatically reloads the server if the TLS cert changes. And the policy validator does the same (with the help of the kubert library).
Can you put up a simple series of steps for us to repro the issue you're experiencing?

@iAnomaly
Copy link
Contributor Author

iAnomaly commented Sep 28, 2023

Chart versions:

linkerd-control-plane: 1.16.2
linkerd-jaeger: 30.13.0-edge
linkerd-viz: 30.12.2

Steps being performed:

  1. I am performing a helm upgrade of linkerd-control-plane followed by linkerd-jaeger and linkerd-viz via helmfile. I say "followed" because I am using helmfile's needs such that linkerd-jaeger and linkerd-viz Helm releases wait to upgrade until linkerd-control-plane upgrade succeeds.
  2. We pass the values-ha from here with very little modification
  3. We consistently hit errors like:
cannot patch "metrics-api" with kind AuthorizationPolicy: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "prometheus-admin" with kind AuthorizationPolicy: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "tap-injector" with kind AuthorizationPolicy: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "tap" with kind AuthorizationPolicy: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "metrics-api-web" with kind MeshTLSAuthentication: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "kubelet" with kind NetworkAuthentication: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "kube-api-server" with kind NetworkAuthentication: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "metrics-api" with kind Server: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "prometheus-admin" with kind Server: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "tap-injector-webhook" with kind Server: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "tap-api" with kind Server: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "metrics-api.linkerd-viz.svc.cluster.local" with kind ServiceProfile: Internal error occurred: failed calling webhook "linkerd-sp-validator.linkerd.io": failed to call webhook: Post "https://linkerd-sp-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-sp-validator.linkerd.svc")
cannot patch "prometheus.linkerd-viz.svc.cluster.local" with kind ServiceProfile: Internal error occurred: failed calling webhook "linkerd-sp-validator.linkerd.io": failed to call webhook: Post "https://linkerd-sp-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-sp-validator.linkerd.svc"): cannot patch "metrics-api" with kind AuthorizationPolicy: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "prometheus-admin" with kind AuthorizationPolicy: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "tap-injector" with kind AuthorizationPolicy: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "tap" with kind AuthorizationPolicy: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "metrics-api-web" with kind MeshTLSAuthentication: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "kubelet" with kind NetworkAuthentication: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "kube-api-server" with kind NetworkAuthentication: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "metrics-api" with kind Server: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "prometheus-admin" with kind Server: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "tap-injector-webhook" with kind Server: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "tap-api" with kind Server: Internal error occurred: failed calling webhook "linkerd-policy-validator.linkerd.io": failed to call webhook: Post "https://linkerd-policy-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-policy-validator.linkerd.svc")
cannot patch "metrics-api.linkerd-viz.svc.cluster.local" with kind ServiceProfile: Internal error occurred: failed calling webhook "linkerd-sp-validator.linkerd.io": failed to call webhook: Post "https://linkerd-sp-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-sp-validator.linkerd.svc")
cannot patch "prometheus.linkerd-viz.svc.cluster.local" with kind ServiceProfile: Internal error occurred: failed calling webhook "linkerd-sp-validator.linkerd.io": failed to call webhook: Post "https://linkerd-sp-validator.linkerd.svc:443/?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "linkerd-sp-validator.linkerd.svc")

When I follow the selector/targets for linkerd-policy-validator.linkerd.svc and linkerd-sp-validator.linkerd.svc both target the destination controller Pods.

  1. With the checksum/config annotation fixes in this PR/on my branch I am no longer able to reproduce those errors (and can confirm the controller Pods are being recreated each time).

@alpeb
Copy link
Member

alpeb commented Sep 28, 2023

I couldn't reproduce the problem locally, but this might be an issue of Secrets not propagating quickly enough (I know that's a problem with ConfigMaps, not sure about Secrets). Could you help me checking this by introducing a delay (at least a minute) in between the control-plane upgrade and the extensions upgrades?

@iAnomaly
Copy link
Contributor Author

I can confirm even with a delay/sleep of 30s (let alone a minute as you requested) seems to avoid the race condition. I ran 5 upgrades in a row without issue when there is a sleep 30 between control-plane upgrade and extension upgrades.

This isn't a great workaround for us though as it will consistently make CI/CD take 30s longer for each environment/cluster we deploy Linkerd to AND if Secret/ConfigMap update event propagation delay is the root cause here its also possible that could delay past 30s or even 60s in rare cases where the API/event processing is considerably back pressured.

Is there a strong technical reason for utilizing server hot reload vs. Pod replacement? I suppose hot reload is needed to support Cert Manager/external certificates for example that might update the Secret without a Helm upgrade?

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a strong technical reason for utilizing server hot reload vs. Pod replacement? I suppose hot reload is needed to support Cert Manager/external certificates for example that might update the Secret without a Helm upgrade?

Great point, yes, that hot-reload logic is necessary for external (cert-manager) driven Secret updates. After discussing with colleagues we came to the conclusion that this PR is the right approach, there were cert management hasn't been outsourced. But please mind that upgrades in these conditions can produce temporary downtime, while the workloads get rolled out with the new Secrets mounted. This is why for production we recommend not relying on this self-signed certificate generation that linkerd uses by default, and manage them explicitly, with something like cert-manager.

@adleong adleong merged commit a6ea765 into linkerd:main Oct 5, 2023
36 checks passed
alpeb added a commit that referenced this pull request Oct 12, 2023
This edge release includes a fix addressing an issue during upgrades for
instances not relying on automated webhook certificate management (like
cert-manager provides).

* Added a `checksum/config` annotation to the destination and proxy injector
  deployment manifests to force restarting those workloads whenever their
  webhook secrets change during upgrade (thanks @iAnomaly!) ([#11440])
* Fixed policy controller error when deleting a gateway API HTTPRoute resource
  ([#11471])
alpeb added a commit that referenced this pull request Oct 12, 2023
## edge-23.10.2

This edge release includes a fix addressing an issue during upgrades for
instances not relying on automated webhook certificate management (like
cert-manager provides).

* Added a `checksum/config` annotation to the destination and proxy injector
  deployment manifests, to force restarting those workloads whenever their
  webhook secrets change during upgrade (thanks @iAnomaly!) ([#11440])
* Fixed policy controller error when deleting a Gateway API HTTPRoute resource
  ([#11471])

[#11440]: #11440
[#11471]: #11471
@logicminds
Copy link

Is this something that would be back ported to older releases? Or a doc on how to add these to 2.12+. I currently just set my hook to ignore everything so I can upgrade.

hawkw pushed a commit that referenced this pull request Nov 22, 2023
…1440)

Fixes #6940

Added a `checksum/config` annotation into the destination, proxy-injector and tap-injector workloads, whose value is calculated as the SHA256 of the template file containing the TLS cert they depend on. This is necessary so that every time those other files change (they get re-generated on every upgrade or config update via `linkerd upgrade`), the workloads change as well. 
We had this in place before, but with the 2.12 helm charts migrations we dropped it by mistake.

Signed-off-by: Cameron Boulton <[email protected]>
hawkw added a commit that referenced this pull request Nov 22, 2023
## stable-2.14.5

This stable release fixes a proxy regression where bursts of TCP
connections could result in EOF errors, due to an incorrect queue
capacity. In addition, it includes fixes for the control plane,
dependency upgrades, and support for image digests in Linkerd manifests.

* Added a controlPlaneVersion override to the `linkerd-control-plane``
  Helm chart to support including SHA256 image digests in Linkerd
  manifests (thanks @cromulentbanana!) ([#11406]; fixes [#11312])
* Added a `checksum/config `annotation to the destination and proxy
  injector deployment manifests, to force restarting those workloads
  whenever their webhook secrets change during upgrade (thanks
  @iAnomaly!) ([#11440]; fixes [#6940])
* Updated the Policy controller's OpenSSL dependency to v3, as OpenSSL
  1.1.1 is EOL ([#11625])
* proxy: Increased `DEFAULT_OUTBOUND_TCP_QUEUE_CAPACITY` to prevent EOF
  errors during bursts of TCP connections (proxy PR [#2521][proxy-2521])

[#11406]: #11406
[#11312]: #11312
[#11440]: #11440
[#6940]: #6940
[#11625]: #11625
[proxy-2521]: linkerd/linkerd2-proxy#2521
@hawkw hawkw mentioned this pull request Nov 22, 2023
hawkw added a commit that referenced this pull request Nov 22, 2023
## stable-2.14.5

This stable release fixes a proxy regression where bursts of TCP
connections could result in EOF errors, due to an incorrect queue
capacity. In addition, it includes fixes for the control plane,
dependency upgrades, and support for image digests in Linkerd manifests.

* Added a controlPlaneVersion override to the `linkerd-control-plane``
  Helm chart to support including SHA256 image digests in Linkerd
  manifests (thanks @cromulentbanana!) ([#11406]; fixes [#11312])
* Added a `checksum/config `annotation to the destination and proxy
  injector deployment manifests, to force restarting those workloads
  whenever their webhook secrets change during upgrade (thanks
  @iAnomaly!) ([#11440]; fixes [#6940])
* Updated the Policy controller's OpenSSL dependency to v3, as OpenSSL
  1.1.1 is EOL ([#11625])
* proxy: Increased `DEFAULT_OUTBOUND_TCP_QUEUE_CAPACITY` to prevent EOF
  errors during bursts of TCP connections (proxy PR [#2521][proxy-2521])

[#11406]: #11406
[#11312]: #11312
[#11440]: #11440
[#6940]: #6940
[#11625]: #11625
[proxy-2521]: linkerd/linkerd2-proxy#2521
hawkw added a commit that referenced this pull request Nov 22, 2023
## stable-2.14.5

This stable release fixes a proxy regression where bursts of TCP
connections could result in EOF errors, due to an incorrect queue
capacity. In addition, it includes fixes for the control plane,
dependency upgrades, and support for image digests in Linkerd manifests.

* Added a controlPlaneVersion override to the `linkerd-control-plane``
  Helm chart to support including SHA256 image digests in Linkerd
  manifests (thanks @cromulentbanana!) ([#11406]; fixes [#11312])
* Added a `checksum/config `annotation to the destination and proxy
  injector deployment manifests, to force restarting those workloads
  whenever their webhook secrets change during upgrade (thanks
  @iAnomaly!) ([#11440]; fixes [#6940])
* Updated the Policy controller's OpenSSL dependency to v3, as OpenSSL
  1.1.1 is EOL ([#11625])
* proxy: Increased `DEFAULT_OUTBOUND_TCP_QUEUE_CAPACITY` to prevent EOF
  errors during bursts of TCP connections (proxy PR [#2521][proxy-2521])

[#11406]: #11406
[#11312]: #11312
[#11440]: #11440
[#6940]: #6940
[#11625]: #11625
[proxy-2521]: linkerd/linkerd2-proxy#2521
MaungSan added a commit to MaungSan/real-world-argo-linkerd that referenced this pull request Dec 19, 2023
This PR fixes ignoreDifferences otherwise argocd will be out of sync every few minutes and restart continously restart linkerd-destination and linkerd-proxy-injector pods

Fix the group field for the MutatingWebhookConfiguration and ValidatingWebhookConfiguration entries by removing the API version as per argocd requirements here
image
Adds 3 additional entries. 1 to ignore the CronJob schedule and 2 entries to ignore the checksum/config annotation in the linkerd-destination and linkerd-proxy-injector Deployments (See Restart destination, proxy-injector controllers on config change. linkerd/linkerd2#11440)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrading sometimes requires restarting linkerd-destination
5 participants