Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Certificates HTTP-01 LE challenge with ingress-nginx #354

Closed
ashokspeelyaal opened this issue Oct 18, 2022 · 17 comments
Closed

Comments

@ashokspeelyaal
Copy link

ashokspeelyaal commented Oct 18, 2022

I have the cluster setup and running, I have the below config with respect to ingress controller and cert_manager


enable_cert_manager = true
  enable_nginx = true
  enable_traefix = false

Then I have created a cluster issuer as below,

kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    # You must replace this email address with your own.
    # Let's Encrypt will use this to contact you about expiring
    # certificates, and issues related to your account.
    email: [email protected]
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      # Secret resource that will be used to store the account's private key.
      name: staging-issuer-account-key
    # Add a single challenge solver, HTTP01 using nginx
    solvers:
      - http01:
          ingress:
            class: nginx

But the certificates are not being issued. I see the below status

  Conditions:
    Last Transition Time:        2022-10-18T07:50:22Z
    Message:                     Issuing certificate as Secret does not exist
    Observed Generation:         1
    Reason:                      DoesNotExist
    Status:                      False
    Type:                        Ready
    Last Transition Time:        2022-10-18T07:50:22Z
    Message:                     Issuing certificate as Secret does not exist
    Observed Generation:         1
    Reason:                      DoesNotExist
    Status:                      True
    Type:                        Issuing
  Next Private Key Secret Name:  tls-grafana-nonprod-2t6c7
Events:
  Type    Reason     Age   From                                       Message
  ----    ------     ----  ----                                       -------
  Normal  Issuing    60s   cert-manager-certificates-trigger          Issuing certificate as Secret does not exist
  Normal  Generated  59s   cert-manager-certificates-key-manager      Stored new private key in temporary Secret resource "tls-grafana-nonprod-2t6c7"
  Normal  Requested  59s   cert-manager-certificates-request-manager  Created new CertificateRequest resource "tls-grafana-nonprod-vmnbr"

When I check the logs of cert-manager pod, I see this error

E1018 07:56:31.619347 1 sync.go:190] cert-manager/challenges "msg"="propagation check failed" "error"="failed to perform self check GET request 'http://<mydomain>/.well-known/acme-challenge/TFjxhOtvqJusiURDD5OZHATbUuchvFdfmZNQ': Get \"http://<mydomain>/.well-known/acme-challenge/TFjxhOtvqJugBOEfl9siURDD5OZHATbUuchvFdfmZNQ\": EOF" "dnsName"="<mydomain>" "resource_kind"="Challenge" "resource_name"="tls-grafana-nonprod-vmnbr-468113688-1685101689" "resource_namespace"="grafana-stack" "resource_version"="v1" "type"="HTTP-01"

As per the cert-manager community, the url http://<mydomain>/.well-known/acme-challenge/TFjxhOtvqJusiURDD5OZHATbUuchvFdfmZNQ is accessible with in the cluster

I am able to access this url from browser, but When I try to curl from a pod I get 52: empty reply from server

If any of you have done this successfully, can you let me know what needs to be fixed?

Note: I have tried with Production cluster-issuer as well, but the issue remains the same

@mysticaltech
Copy link
Collaborator

Weird - Please post your full kube.tf without the sensitive values.

@mysticaltech
Copy link
Collaborator

And are you 100% positive that your DNS points to the generated ingress LB IPs? Both A and AAAA records?

If you just did the change, make sure to give some time for the DNS to propagate or set the same DNS servers you have on the cluster, there is a variable for that.

You can also use the dig command inside your pods/containers to see if the name resolution is correct.

@ashokspeelyaal
Copy link
Author

ashokspeelyaal commented Oct 18, 2022

@mysticaltech , Yes, I am sure about DNS, because I use external-dns and verified the A records are updated. As I said before, the challege URL is accessible from browser.

I create a module called kube-environment to add more modules (like external-dns and grafana)

Here is the default variables list

variable "hcloud_token" {
  default = ""
}




variable "agent_nodepools" {
  default = [
    {
      name        = "agent-small-fsn1",
      server_type = "cpx21",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 2
    },
    {
      name        = "agent-large-nbg1",
      server_type = "cpx21",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 2
    },
    {
      name        = "agent-large-hel1",
      server_type = "cpx21",
      labels      = [],
      taints      = [],
      location    = "hel1",
      count = 2
      # In the case of using Longhorn, you can use Hetzner volumes instead of using the node's own storage by specifying a value from 10 to 10000 (in GB)
      # It will create one volume per node in the nodepool, and configure Longhorn to use them.
      # longhorn_volume_size = 20
    },
    {
      name        = "storage",
      server_type = "cpx21",
      location    = "fsn1",
      # Fully optional, just a demo
      labels = [
        "node.kubernetes.io/server-usage=storage"
      ],
      taints = [
        "server-usage=storage:NoSchedule"
      ],
      count = 1
      # In the case of using Longhorn, you can use Hetzner volumes instead of using the node's own storage by specifying a value from 10 to 10000 (in GB)
      # It will create one volume per node in the nodepool, and configure Longhorn to use them.
      # longhorn_volume_size = 20
    }
  ]
}
variable "ssh_public_key_path" {
  default = ""
}
variable "ssh_private_key_path" {
  default = ""
}
variable "k8s_network_region" {
  default = "eu-central"
}
variable "load_balancer_type" {
  default = "lb11"
}
variable "load_balancer_location" {
  default = "nbg1"
}
variable "control_plane_nodepools" {
  default =  [

    {
      name        = "control-plane-nbg1",
      server_type = "cpx11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-fsn1",
      server_type = "cpx11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-hel1",
      server_type = "cpx11",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]
}
variable "enable_nginx" {
  default = false
}
variable "enable_traefix" {
  default = false
}
variable "cluster_name" {
  default = "app-factory-nonprod"
}
variable "extra_firewall_rules" {
  default = []
}
variable "enable_cert_manager" {
  default = false
}
variable "use_control_plane_lb" {
  default = true
}

And this is how I am using

module "kube_environment_nonprod" {
  source = "../../../modules/hetzner/kube-environment"


  cluster_name = var.cluster_name
  ssh_private_key_path = var.ssh_private_key_path
  ssh_public_key_path = var.ssh_public_key_path
  use_control_plane_lb = true
  enable_cert_manager = true
  enable_nginx = true
  enable_traefix = false
  hcloud_token = var.hcloud_token
  agent_nodepools = [

    {
      name        = "agent-medium-fsn1",
      server_type = "cpx21",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 2
    },
    {
      name        = "agent-medium-nbg1",
      server_type = "cpx21",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 2
    },
    {
      name        = "agent-medium-hel1",
      server_type = "cpx21",
      labels      = [],
      taints      = [],
      location    = "hel1",
      count = 2

    },
    {
      name        = "agent-small-nbg1",
      server_type = "cpx11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 2
    }
  ]






  providers = {
    hcloud = hcloud
  }


}


Here is the ingress: (I have used this ingress before in other managed k8s clusters)

kind: Ingress
metadata:
  name: grafana-dashboard-ingress
  namespace: {{.Values.namespace}}
  annotations:
    kubernetes.io/ingress.class: nginx
    ingress.kubernetes.io/rewrite-target: /
    cert-manager.io/cluster-issuer: letsencrypt-staging
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/use-regex: "true"
    external-dns.alpha.kubernetes.io/access: "public"
    #nginx.ingress.kubernetes.io/ssl-redirect: "false"
spec:
  tls:
    - hosts:
        - {{.Values.dashboardHost}}
      secretName: tls-grafana-{{.Values.environment}}
  rules:
    - host: {{.Values.dashboardHost}}
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: grafana
                port:
                  number: {{ .Values.grafana.port }}




@mysticaltech
Copy link
Collaborator

mysticaltech commented Oct 18, 2022

@ashokspeelyaal Thanks for sharing those details, first time I see it done that way for the module, interesting!

Your physical location and the location of the cluster in the cloud are not the same. The DNS servers in the cloud may not have been propagated with your changes at the time you tried. Please retry and let me know!

@WolfspiritM
Copy link
Contributor

I'm actually having exactly the same issue setting up a cluster right now. I have the feeling it has something to do with the loadbalancer not being able to redirect traffic from inside the cluster back to the cluster out of some reason or it's an issue with hetzner in general right now.

@WolfspiritM
Copy link
Contributor

WolfspiritM commented Oct 18, 2022

I found the reason why that happens.

It seems like the Hetzner Loadbalancer does handle internal traffic differently to external traffic. Especially it doesn't send the PROXY protocol header for internal traffic it seems so nginx ingress doesn't know what to do with it. nginx ingress seems to only allow either proxy header or not depending on the configuration. Not sure how traefik handles that.

I've now turned off the proxy protocol by setting this from the nginx config:

"use-proxy-protocol": "false"

and this to the service annotations:

"load-balancer.hetzner.cloud/uses-proxyprotocol": "false"

Now both external and internal traffic works as expected. However that means that we only get the IP of the loadbalancer in the kube cluster I guess as nginx ingress doesn't know about the real ip anymore.

@mysticaltech
Copy link
Collaborator

Interesting issue, thanks for contributing a solution @WolfspiritM!

@mysticaltech
Copy link
Collaborator

@phaer Any ideas on how to fix this for good?

@mysticaltech mysticaltech changed the title Issue with Certificates Issue with Certificates HTTP-01 LE challenge with ingress-nginx Oct 19, 2022
@mysticaltech
Copy link
Collaborator

mysticaltech commented Oct 19, 2022

@ashokspeelyaal @WolfspiritM I found the issue. It's related to cert-manager/cert-manager#466.

Could you folks please try setting this annotation to the nginx service, it should fix it, but I need confirmation:

load-balancer.hetzner.cloud/hostname

And it should be given the value of an FQDN that points to your LB. Please let me know how it goes, folks! 🙏

ksnip_20221020-013828

If this works, we could just explain the procedure in the docs.

Sources:
https://github.com/hetznercloud/hcloud-cloud-controller-manager#kube-proxy-mode-ipvs-and-hcloud-loadbalancer
https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/main/internal/annotation/load_balancer.go

@WolfspiritM
Copy link
Contributor

This seems to have fixed it!

I'm still not exactly sure what that changed, cause even curl requests from the agent machines directly didn't work but setting the hostname resolved it somehow.

So what I understand is that somehow traffic to the external loadbalancer ip is redirected directly to the nginx ingress service which then doesn't have the proxy-ip.

My guess is that setting the hostname somehow makes nginx ingress aware that the request is coming from the loadbalancer instead of from internally?!

@mysticaltech
Copy link
Collaborator

Thanks for trying, @WolfspiritM; yes, exactly... It takes the "external route" somehow. The problem with the "internal route," from my understanding, was that it lacks the PROXY header.

By setting the hostname, we bypass that issue!

@mysticaltech
Copy link
Collaborator

README.md updated; you can find the new recommendation in Examples > Ingress with TLS.

@ashokspeelyaal
Copy link
Author

Thanks a lot @mysticaltech @WolfspiritM

@RudlTier
Copy link

Thanks @mysticaltech and @WolfspiritM!
how can multible different subdomains be handeled with this setup?
eg. app1.demo.com + app2.demo.com
What value would need to be added to load-balancer.hetzner.cloud/hostname?

@WolfspiritM
Copy link
Contributor

@RudlTier Just a public subdomain that maps to the loadbalancer. In our case we use lb.demo.com. It doesn't mean that it's using just that domain for TLS.

@RudlTier
Copy link

@WolfspiritM thanks for the quick answer. that worked! :)

@aleksasiriski
Copy link
Member

aleksasiriski commented Feb 13, 2023

If you use nginx_values then lb_hostname is ignored. Don't use nginx_values, if you need HTTP-01 challenge then leaving the default values and setting lb_hostname is the correct solution.

This is a known problem across EVERY cloud provider, here's the source:

Reason: I enable the proxy protocol on the load balancers so that my ingress controller and applications can "see" the real IP address of the client. However when this is enabled, there is a problem where cert-manager fails http01 challenges; you can find an explanation of why here but the easy fix provided by some providers - including Hetzner - is to configure the load balancer so that it uses a hostname instead of an IP. Again, read the explanation for the reason but if you care about seeing the actual IP of the client then I recommend you use these two annotations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants