Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm HA install w/cert-manager webhook CA fails w/webhook linkerd-proxy-injector error #7699

Closed
ron1 opened this issue Jan 25, 2022 · 8 comments · Fixed by #7718
Closed

Helm HA install w/cert-manager webhook CA fails w/webhook linkerd-proxy-injector error #7699

ron1 opened this issue Jan 25, 2022 · 8 comments · Fixed by #7718

Comments

@ron1
Copy link

ron1 commented Jan 25, 2022

What is the issue?

Helm HA install w/cert-manager webhook CA fails with all 3 linkerd control plane Deployments reporting

- "Internal error occurrred: failed calling webhook "linkerd-proxy-injector.linkerd.io": Post "https://linkerd-proxy-injector.linkerd.svc.443/?timeout=10s": dial tcp 10.43.4.25:443: connect: connection refused
- Deployment does not have minimum availability.
- Progress deadline exceeded
state: updating
transitioning: true

How can it be reproduced?

Perform Helm HA install w/cert-manager managed webhook CA

Logs, error output, etc

- "Internal error occurrred: failed calling webhook "linkerd-proxy-injector.linkerd.io": Post "https://linkerd-proxy-injector.linkerd.svc.443/?timeout=10s": dial tcp 10.43.4.25:443: connect: connection refused
- Deployment does not have minimum availability.
- Progress deadline exceeded
state: updating
transitioning: true

output of linkerd check -o short

edge-22.1.1

Environment

  • Kubernetes Version: Rancher RKE2 1.21.5+rke2r1
  • CentOS 7.9
  • 3 server nodes, 6 agent nodes

Possible solution

A work-around is to install with MutatingWebhookConfiguration failurePolicy of Ignore and change the policy to Fail after install.

Additional context

No response

Would you like to work on fixing this bug?

no

@ron1 ron1 added the bug label Jan 25, 2022
@olix0r
Copy link
Member

olix0r commented Jan 26, 2022

Is the issue basically that there's a race between the MutatingWebhookConfiguration being applied and the pods being created? I.e., if the mwc is applied first, the injector pods don't start because there's no webook available?

The MutatingWebookConfiguration is configured with:

    namespaceSelector:
      matchExpressions:
        - key: config.linkerd.io/admission-webhooks
          operator: NotIn
          values:
            - disabled

When Linkerd is installed via the CLI, we get a namespace like:

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    linkerd.io/inject: disabled
  labels:
    config.linkerd.io/admission-webhooks: disabled
    kubernetes.io/metadata.name: linkerd
    linkerd.io/control-plane-ns: linkerd
    linkerd.io/is-control-plane: "true"
  name: linkerd

Does your linkerd namespace have these labels?

@ron1
Copy link
Author

ron1 commented Jan 26, 2022

How is the namespace intended to acquire these annotations and labels in the helm install? Must the namespace be pre-created and pre-configured to support an HA install?

Since the crds chart is a prereq for the control-plane chart, the control-plane chart should be able to assume the namespace exists. Therefore, could the control plane chart use a pre-hook to annotate and label the linkerd namespace if needed?

In this BYO cert-manager webhook CA scenario, it seems that the cert-manager namespace also needs the disallow admission-webhooks annotation to avoid a similar MutatingWebhookConfiguration race.

Would it make sense for the control-plane chart ha-values.yaml to include an optional property containing a list of namespace names that require the disallow admission-webhooks annotation with default value [kube-system, cert-manager]? An optional job could then take care of annotating these namespaces if they exist. I create such a job today in a helm post-renderer. This job would require extensive privileges which limit its value. OTOH, installing a service mesh like linkerd is a cluster admin function anyway.

I guess I could have worked-around this issue by also using my custom job above to annotate the linkerd namespace as well. However, I just assumed the namespace annotations and labels were already being taken care of for me since the cli already does it.

Finally, how is linkerd-cni namespace supposed to acquire its annotations and labels using the Helm install?

@alpeb alpeb self-assigned this Jan 26, 2022
@alpeb
Copy link
Member

alpeb commented Jan 26, 2022

Thanks for the feedback @ron1 ! This is indeed a problem with the helm charts in edge under HA. As per helm best-practices we stopped having the chart generate the namespace so we can't rely on those labels. But if I'm not mistaking, we can address this by simply adding an objectSelector to the webhook config that would discard any pod with the label linkerd.io/control-plane-component.

In this BYO cert-manager webhook CA scenario, it seems that the cert-manager namespace also needs the disallow admission-webhooks annotation to avoid a similar MutatingWebhookConfiguration race.

When the webhook race for bringing linkerd-proxy-injector up gets resolved, I think this won't be a problem. Cert-manager might fail starting first, but should quickly converge to a running state as soon as the injector comes up. OTOH kube-system does require to be ignored, otherwise the cluster will easily get deadlocked. We recommend filtering it out somewhere in the docs, but I think it's a good idea to have it done by default in HA. I'll see if there's a standard label identifying kube-system that we can use in the objectSelector to avoid relying on a helm hook.

Finally, how is linkerd-cni namespace supposed to acquire its annotations and labels using the Helm install?

Sounds like there's an issue there too. We'll need to include it in the objectSelector as well.

@ron1
Copy link
Author

ron1 commented Jan 26, 2022

Due to Rancher Fleet bug rancher/fleet#486, I seem to be limited in my ability to control the order in which the cert-manager and linkerd workloads are deployed.

And just to confirm, you think the potential race between the linkerd-proxy-injector MutatingWebhookConfiguration resource, the linkerd-proxy-injector webhook cert, the linkerd-proxy-injector deployment, and the cert-manager-cainjector should be insignificant.

@ron1
Copy link
Author

ron1 commented Jan 26, 2022

We recommend filtering it out somewhere in the docs, but I think it's a good idea to have it done by default in HA. I'll see if there's a standard label identifying kube-system that we can use in the objectSelector to avoid relying on a helm hook.

At least on Kubernetes 1.21+, could you exclude kube-system by adding an additional namespaceSelector.matchExpressions based on label kubernetes.io/metadata.name: kube-system?

@alpeb
Copy link
Member

alpeb commented Jan 26, 2022

Due to Rancher Fleet bug rancher/fleet#486, I seem to be limited in my ability to control the order in which the cert-manager and linkerd workloads are deployed.

And just to confirm, you think the potential race between the linkerd-proxy-injector MutatingWebhookConfiguration resource, the linkerd-proxy-injector webhook cert, the linkerd-proxy-injector deployment, and the cert-manager-cainjector should be insignificant.

Ok I didn't have in mind cert-manager-cainjector. Seems like we'll have the injector bypass the entire cert-manager namespace then.

At least on Kubernetes 1.21+, could you exclude kube-system by adding an additional namespaceSelector.matchExpressions based on label kubernetes.io/metadata.name: kube-system?

That seems to be the case in all the clusters I've checked 👍

@ron1
Copy link
Author

ron1 commented Jan 26, 2022

FYI, I was able to successfully deploy linkerd workload to linkerd-cni, linkerd, and linkerd-viz namespaces with the following linkerd-control-plane helm values and kustomize post-renderer patch. Thanks much for your assistance.

helm values.yaml

cniEnabled: true
identity:
  externalCA: true
  issuer:
    scheme: "kubernetes.io/tls"
profileValidator/proxyInjector:
  externalSecret: true
  injectCaFrom: linkerd/webhook-issuer-tls
  namespaceSelector:
    matchExpressions:
    - key: config.linkerd.io/admission-webhooks
      operator: NotIn
      values:
      - disabled
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values:
      - kube-system
      - linkerd-cni
      - cert-manager

kustomization.yaml

patches:
- target
  group: admissionregistration.k8s.io
  version: v1
  kind: MutatingWebhookConfiguration
  name: linkerd-proxy-injector-webhook-config
  patch: |-
    - op: replace
      path: /webhooks/0/objectSelector
      value:
        matchExpressions:
        - key: linkerd.io/control-plane-component
          operator: DoesNotExist

@alpeb
Copy link
Member

alpeb commented Jan 27, 2022

Sounds good, I'll push a fix taking this into account 👍

alpeb added a commit that referenced this issue Jan 27, 2022
Fixes #7699

The problem didn't affect 2.11, only latest edges since the Helm charts
got split into `linkerd-crds` and `linkerd-control-plane` and we stopped
creating the linkerd namespace.

With the surrendering of the creation of the namespace, we can no longer
guarantee the existence of the `config.linkerd.io/admission-webhooks`
label, so this PR creates an `objectSelector` for the injector that
filters-out control-plane components, based on the existence of the
`linkerd.io/control-plane-component` label.

Given we still want the multicluster components to be injected, we had
to be rename its `linkerd.io/control-plane-component` label to
`component`, following the same convention used by the other extensions.
The corresponding Prometheus rule for scraping the service mirrors was
updated accordingly.

A similar filter was added for the linkerd-cni DaemonSet.

Also, now that the `kubernetes.io/metadata.name` is prevalent, we're
also using it to filter out the kube-system and cert-manager namespaces.
The former namespace was already mentioned in the docs; the latter is
also included to avoid having races with cert-manager-cainjector which
can be used to provision the injector's cert.
alpeb added a commit that referenced this issue Feb 2, 2022
* Fix HA race when installing through Helm

Fixes #7699

The problem didn't affect 2.11, only latest edges since the Helm charts
got split into `linkerd-crds` and `linkerd-control-plane` and we stopped
creating the linkerd namespace.

With the surrendering of the creation of the namespace, we can no longer
guarantee the existence of the `config.linkerd.io/admission-webhooks`
label, so this PR creates an `objectSelector` for the injector that
filters-out control-plane components, based on the existence of the
`linkerd.io/control-plane-component` label.

Given we still want the multicluster components to be injected, we had
to be rename its `linkerd.io/control-plane-component` label to
`component`, following the same convention used by the other extensions.
The corresponding Prometheus rule for scraping the service mirrors was
updated accordingly.

A similar filter was added for the linkerd-cni DaemonSet.

Also, now that the `kubernetes.io/metadata.name` is prevalent, we're
also using it to filter out the kube-system and cert-manager namespaces.
The former namespace was already mentioned in the docs; the latter is
also included to avoid having races with cert-manager-cainjector which
can be used to provision the injector's cert.
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 5, 2022
olix0r pushed a commit that referenced this issue Apr 13, 2022
* Fix HA race when installing through Helm

Fixes #7699

The problem didn't affect 2.11, only latest edges since the Helm charts
got split into `linkerd-crds` and `linkerd-control-plane` and we stopped
creating the linkerd namespace.

With the surrendering of the creation of the namespace, we can no longer
guarantee the existence of the `config.linkerd.io/admission-webhooks`
label, so this PR creates an `objectSelector` for the injector that
filters-out control-plane components, based on the existence of the
`linkerd.io/control-plane-component` label.

Given we still want the multicluster components to be injected, we had
to be rename its `linkerd.io/control-plane-component` label to
`component`, following the same convention used by the other extensions.
The corresponding Prometheus rule for scraping the service mirrors was
updated accordingly.

A similar filter was added for the linkerd-cni DaemonSet.

Also, now that the `kubernetes.io/metadata.name` is prevalent, we're
also using it to filter out the kube-system and cert-manager namespaces.
The former namespace was already mentioned in the docs; the latter is
also included to avoid having races with cert-manager-cainjector which
can be used to provision the injector's cert.

(cherry picked from commit 539bcce)
Signed-off-by: Oliver Gould <[email protected]>
olix0r pushed a commit that referenced this issue Apr 14, 2022
* Fix HA race when installing through Helm

Fixes #7699

The problem didn't affect 2.11, only latest edges since the Helm charts
got split into `linkerd-crds` and `linkerd-control-plane` and we stopped
creating the linkerd namespace.

With the surrendering of the creation of the namespace, we can no longer
guarantee the existence of the `config.linkerd.io/admission-webhooks`
label, so this PR creates an `objectSelector` for the injector that
filters-out control-plane components, based on the existence of the
`linkerd.io/control-plane-component` label.

Given we still want the multicluster components to be injected, we had
to be rename its `linkerd.io/control-plane-component` label to
`component`, following the same convention used by the other extensions.
The corresponding Prometheus rule for scraping the service mirrors was
updated accordingly.

A similar filter was added for the linkerd-cni DaemonSet.

Also, now that the `kubernetes.io/metadata.name` is prevalent, we're
also using it to filter out the kube-system and cert-manager namespaces.
The former namespace was already mentioned in the docs; the latter is
also included to avoid having races with cert-manager-cainjector which
can be used to provision the injector's cert.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants