Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump the terraform module for AWS EKS (and consequences) #3305

Closed
dduportal opened this issue Dec 20, 2022 · 15 comments
Closed

Bump the terraform module for AWS EKS (and consequences) #3305

dduportal opened this issue Dec 20, 2022 · 15 comments
Assignees

Comments

@dduportal
Copy link
Contributor

dduportal commented Dec 20, 2022

This issue is related to a major bump of the Terraform EKS module that we usein https://github.com/jenkins-infra/aws to manage the two EKS clusters on our infrastructure (cik8s and eks-public).

This issue is an audit trail after the whole operation.

@dduportal
Copy link
Contributor Author

On the terraform code, with "usual" PRs:

@dduportal
Copy link
Contributor Author

Note that, with 19.x changes, the EKS clusters are now private by default.

hotfix for the cik8s cluster: jenkins-infra/aws@27d4f74

@dduportal
Copy link
Contributor Author

dduportal commented Dec 20, 2022

@dduportal
Copy link
Contributor Author

Status: 2 new problems to fix:

  • Certificate for https://repo.aws.jenkins.io/ is not issued. Gotta check cert-manager (is it rate-limited by LE?)
  • the terratests fails with the latest EKS module (see main branch build of jenkins-infra/aws).

@dduportal dduportal self-assigned this Dec 20, 2022
@dduportal
Copy link
Contributor Author

[10.0.0.38] - - [22/Dec/2022:14:47:42 +0000] "GET /.well-known/acme-challenge/<redacted> HTTP/1.1" 401 172 "http://repo.aws.jenkins.io/.well-known/acme-challenge/<redacted>" "cert-manager-challenges/v1.9.1 (linux/amd64) cert-manager/<redacted>" 377 0.000 [artifact-caching-proxy-artifact-caching-proxy-8080] - - - - <redacted>

It's weird: the /.well-known location should not ask for authentication as per https://github.com/kubernetes/ingress-nginx/blob/f9cce5a4ed7ef372a18bc826e395ff5660b7a444/docs/user-guide/nginx-configuration/configmap.md#no-auth-locations

But since we define a custom configmap, it might be overwritten: https://github.com/jenkins-infra/kubernetes-management/blob/8c6d91f9a02048f3b9e8fb4a444106f5a08fcfe6/config/ext_public-nginx-ingress__common.yaml#L25-L36 🤔

@smerle33
Copy link
Contributor

smerle33 commented Jan 12, 2023

Just discovered a second EKS cluster named : jenkins-infra-eks-ENRZrfwf
that will probably need to be cleaned up if not used par jenkins-infra

@dduportal
Copy link
Contributor Author

@dduportal
Copy link
Contributor Author

Just discovered a second EKS cluster named : jenkins-infra-eks-ENRZrfwf
that will probably need to be cleaned up if not used par jenkins-infra

Checked during a team working session: this cluster is cik8s (used by ci.jenkins.io for its build). We did not found any dangling resource

@dduportal
Copy link
Contributor Author

We had an issue with this cluster after the ingress rules where successfully updated with a valid certificate:

the public IP (the 3 public IPs associated to the 3 network zones of the public loadbalancer) weren't reachable at all (even from inside the cluster) but Kubernetes reported everything fine.

Here are my (raw) notes:

  • With the command kubectl get svc -A we can see that the public-nginx Ingress controller has an AWS LoadBalancer associated with a valid DNS name (using the `dig we see 3 records to the 3 public IPs):
public-nginx-ingress     public-nginx-ingress-ingress-nginx-controller             LoadBalancer   172.20.240.59    k8s-publicng-publicng-f7332522a1-59fde896b2eb752b.elb.us-east-2.amazonaws.com   80:31868/TCP,443:32267/TCP   38d
  • From any pod of the cluster (command kubectl -n artifact-caching-proxy exec -ti artifact-caching-proxy-0 -- sh for instance), we try to reach both the private and public IP of the Public Service LB from above:
curl -v 172.20.48.207 -o /dev/null
*   Trying 172.20.48.207:80...
* Connected to 172.20.48.207 (172.20.48.207) port 80 (#0)
> GET / HTTP/1.1
> Host: 172.20.48.207
> User-Agent: curl/7.83.1
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: nginx/1.22.1
< Date: Fri, 27 Jan 2023 16:45:18 GMT
< Content-Type: text/html
< Content-Length: 1826
< Last-Modified: Mon, 23 Jan 2023 01:36:05 GMT
< Connection: keep-alive
< ETag: "63cde485-722"
< Accept-Ranges: bytes
< 
* Connection #0 to host 172.20.48.207 left intact
curl -v k8s-publicng-publicng-f7332522a1-59fde896b2eb752b.elb.us-east-2.amazonaws.com
*   Trying 18.116.6.230:80...

# Stuck, need to wait 60s for timeout or issue a Ctrl-C cancellation

=> Private IP works well as expected, but the public IP(s) of the LB does not answer. It means the issue is with the LB itself.

  • Checking the LB in AWS UI (section EC2 -> "Load Balancing" -> "Load Balancers"), selecting the LB and in the tab "Listeners", click on thge "Default routing rule" of the "TCP:80" line (for example).
    The list of "target groups" (e.g. backend IP of the LB) is empty: that confirms the observed behavior.

  • This list of backend IP is specified by Kubernetes. In particular by the "AWS LB Controller" that we installed in this cluster. The role of this component is to scan the Kubernetes API for "Services" resources which kind are "LoadBalancer" and to create/update/delete them in the AWS API.
    Checking the logs of this component (kubectl -n aws-load-balancer -l app.kubernetes.io/instance=aws-load-balancer-controller) shows the error:

{"level":"error","ts":1674838128.6175954,"logger":"controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-publicng-publicng-7482972d25","namespace":"public-nginx-ingress","error":"expect exactly one securityGroup tagged with kubernetes.io/cluster/public-happy-polliwog for eni eni-0970f1ec0888c2d65, got: [sg-0c0d669a830f6e013 sg-0ca36e364f5491978] (clusterName: public-happy-polliwog)"}
  • Hotfix: From the AWS UI, removing the tag kubernetes.io/cluster/public-happy-polliwog=true of the security group for the cluster itself (eks-cluster-sg-public-happy-polliwog-1884802038, usually the first in the list) and keeping this tag on the SG public-happy-polliwog-node, because this 2nd SG is applied to the Kubernetes node VMs, which hosts the floating private IP of the Public Service LB.
    => After 5 min, the whole system is working again.

=> Next step: find how to avoid this duplicate "tagging" in the jenkins-infra/aws terraform code.

@dduportal
Copy link
Contributor Author

New error: we cannot update the ACP statefulset:

 Normal   NotTriggerScaleUp  3m58s (x32461 over 3d18h)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict
  Warning  FailedScheduling   2m30s (x5365 over 3d18h)   default-scheduler   0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had volume node affinity conflict.

=> the autoscaler pod's logs for this cluster shows that autoscaling cannot be done:

I0207 10:58:37.369492       1 binder.go:791] "Could not get a CSINode object for the node" node="template-node-for-eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44-2880955131433950512" err="csinode.storage.k8s.io \"template-node-for-eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44-2880955131433950512\" not found"
I0207 10:58:37.369532       1 binder.go:811] "PersistentVolume and node mismatch for pod" PV="pvc-173ee3c5-22ec-4444-bee0-fe7b8ece01fa" node="template-node-for-eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44-2880955131433950512" pod="artifact-caching-proxy/artifact-caching-proxy-0" err="no matching NodeSelectorTerms"
I0207 10:58:37.369561       1 scale_up.go:300] Pod artifact-caching-proxy-0 can't be scheduled on eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I0207 10:58:37.369819       1 scale_up.go:449] No pod can fit to eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44
I0207 10:58:37.369836       1 scale_up.go:453] No expansion options

It looks like kubernetes/autoscaler#4811: the PVCs are only per AZ: the autoscaler seems to fail scaling nodes in the correct AZ so it's stuck 🤦

@dduportal
Copy link
Contributor Author

jenkins-infra/aws#333 was merged: we are watching the effect

@dduportal
Copy link
Contributor Author

Temporary unblocking the kube management builds: jenkins-infra/kubernetes-management@0288bb0 (this commit will have to be reverted once repo.aws is fixed)

@dduportal
Copy link
Contributor Author

It seems we have found a working setup:

=> we have to:

  • Define custom storage class with topology awareness (despite what AWS documentations says, the CSI driver does not seem to automatically generates expected classes).
    • Nice to have: add retain and delete classes
    • Scope: automate in jenkins-infra/aws Terraform project, like it was done for jenkins-infra/azure
  • Update AWS autoscaler configuration to be higly available AND taking in account the topology (it is not by default: https://github.com/kubernetes/autoscaler/blob/9158196a3c06ed754fc4333ac67417e66a4ec274/charts/cluster-autoscaler/values.yaml#L180)
  • Scope: jenkins-infra/kubernetes-management, for both cik8s and eks-public
  • Cleanup the addition node pools added yesterday: autoscaler manages with the actual node pool (let's keep it simple)
    • We are using a node pool spanning across multiple AZs: it's worth switching to one of the 3 new ones that is in the same AZ. Tests in progress

@dduportal
Copy link
Contributor Author

Closing as the problem is now fixed \o/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants