Bump the terraform module for AWS EKS (and consequences) #3305

dduportal · 2022-12-20T14:19:07Z

This issue is related to a major bump of the Terraform EKS module that we usein https://github.com/jenkins-infra/aws to manage the two EKS clusters on our infrastructure (cik8s and eks-public).

This issue is an audit trail after the whole operation.

The text was updated successfully, but these errors were encountered:

dduportal · 2022-12-20T16:53:45Z

On the terraform code, with "usual" PRs:

[updatecli] Update the content of the Terraform lock file to upgrade terraform providers aws#309 => required to get the latest terraform aws provider needed by 19.1.x EKS module
Preparing the upgrade required disabling the "default" node SGs: cleanup(eks*) delete default (empty) node security groups aws#310
Bumped the module to 19.3.1: Bump version of the Terraform module "eks" to 19.3.1 aws#304
hotfix: jenkins-infra/aws@7d4303b (security groups were duplicated causing a cluster re-creaction for eks-public)
Cluster eks-publicwas manually (terraform apply on my machine instead of CI) recreated
jenkins-infra/aws@0d7f293 to ensure that the production terraform user (used from infra.ci) can operate the cluster, even if cloudbees-jenkins IAM account owns the new cluster (created on my machine with my IAM credentials),

dduportal · 2022-12-20T16:54:11Z

Note that, with 19.x changes, the EKS clusters are now private by default.

hotfix for the cik8s cluster: jenkins-infra/aws@27d4f74

dduportal · 2022-12-20T16:56:12Z

eks-public update on the kubernetes management: fix(eks-public) recreate cluster kubernetes-management#3366 (bootstrapped services prior to the PR)
Updated the kubeconfig for the new EKS cluster: https://github.com/jenkins-infra/charts-secrets/commit/b5d45b3625efdad2fc027cb5b59ab3c254e2266b
Bumped module to 19.4.0: Bump version of the Terraform module "eks" to 19.4.0 aws#311

dduportal · 2022-12-20T16:57:15Z

Status: 2 new problems to fix:

Certificate for https://repo.aws.jenkins.io/ is not issued. Gotta check cert-manager (is it rate-limited by LE?)
the terratests fails with the latest EKS module (see main branch build of jenkins-infra/aws).

dduportal · 2022-12-23T18:13:27Z

Build fixed in Bump version of the Terraform module "eks" to 19.4.2 aws#312
The LE certificate is not emitted because it receieves HTTP/401 (from the public ingress access logs on eks-public):

[10.0.0.38] - - [22/Dec/2022:14:47:42 +0000] "GET /.well-known/acme-challenge/<redacted> HTTP/1.1" 401 172 "http://repo.aws.jenkins.io/.well-known/acme-challenge/<redacted>" "cert-manager-challenges/v1.9.1 (linux/amd64) cert-manager/<redacted>" 377 0.000 [artifact-caching-proxy-artifact-caching-proxy-8080] - - - - <redacted>

It's weird: the /.well-known location should not ask for authentication as per https://github.com/kubernetes/ingress-nginx/blob/f9cce5a4ed7ef372a18bc826e395ff5660b7a444/docs/user-guide/nginx-configuration/configmap.md#no-auth-locations

But since we define a custom configmap, it might be overwritten: https://github.com/jenkins-infra/kubernetes-management/blob/8c6d91f9a02048f3b9e8fb4a444106f5a08fcfe6/config/ext_public-nginx-ingress__common.yaml#L25-L36 🤔

smerle33 · 2023-01-12T13:38:28Z

Just discovered a second EKS cluster named : jenkins-infra-eks-ENRZrfwf
that will probably need to be cleaned up if not used par jenkins-infra

dduportal · 2023-01-17T18:09:22Z

For info: terraform-aws-modules/terraform-aws-eks#2337

dduportal · 2023-01-31T13:33:45Z

fix(eks-public/doks-public): upgrade cert-manager version to v1.11.0 kubernetes-management#3517 => cert-manager version is now tracked and up to date on the eks-cluster
fix(eks-public) correct ACP and public-ingress configuration to follow cert-manager recommendations kubernetes-management#3538
- Applied manually to unblock the LetsEncrypt challenge validation with HTTP-01
- Possible alternative we considered (for future reader if the problem happens): switch to DNS-01
Certificate generated, ACP is working (confirmed in Artifact caching failed on DigitalOcean and AWS Kubernetes pods #3302 )

Just discovered a second EKS cluster named : jenkins-infra-eks-ENRZrfwf
that will probably need to be cleaned up if not used par jenkins-infra

Checked during a team working session: this cluster is cik8s (used by ci.jenkins.io for its build). We did not found any dangling resource

dduportal · 2023-01-31T14:01:15Z

We had an issue with this cluster after the ingress rules where successfully updated with a valid certificate:

the public IP (the 3 public IPs associated to the 3 network zones of the public loadbalancer) weren't reachable at all (even from inside the cluster) but Kubernetes reported everything fine.

Here are my (raw) notes:

With the command kubectl get svc -A we can see that the public-nginx Ingress controller has an AWS LoadBalancer associated with a valid DNS name (using the `dig we see 3 records to the 3 public IPs):

public-nginx-ingress     public-nginx-ingress-ingress-nginx-controller             LoadBalancer   172.20.240.59    k8s-publicng-publicng-f7332522a1-59fde896b2eb752b.elb.us-east-2.amazonaws.com   80:31868/TCP,443:32267/TCP   38d

From any pod of the cluster (command kubectl -n artifact-caching-proxy exec -ti artifact-caching-proxy-0 -- sh for instance), we try to reach both the private and public IP of the Public Service LB from above:

curl -v 172.20.48.207 -o /dev/null
*   Trying 172.20.48.207:80...
* Connected to 172.20.48.207 (172.20.48.207) port 80 (#0)
> GET / HTTP/1.1
> Host: 172.20.48.207
> User-Agent: curl/7.83.1
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: nginx/1.22.1
< Date: Fri, 27 Jan 2023 16:45:18 GMT
< Content-Type: text/html
< Content-Length: 1826
< Last-Modified: Mon, 23 Jan 2023 01:36:05 GMT
< Connection: keep-alive
< ETag: "63cde485-722"
< Accept-Ranges: bytes
< 
* Connection #0 to host 172.20.48.207 left intact

curl -v k8s-publicng-publicng-f7332522a1-59fde896b2eb752b.elb.us-east-2.amazonaws.com
*   Trying 18.116.6.230:80...

# Stuck, need to wait 60s for timeout or issue a Ctrl-C cancellation

=> Private IP works well as expected, but the public IP(s) of the LB does not answer. It means the issue is with the LB itself.

Checking the LB in AWS UI (section EC2 -> "Load Balancing" -> "Load Balancers"), selecting the LB and in the tab "Listeners", click on thge "Default routing rule" of the "TCP:80" line (for example).
The list of "target groups" (e.g. backend IP of the LB) is empty: that confirms the observed behavior.
This list of backend IP is specified by Kubernetes. In particular by the "AWS LB Controller" that we installed in this cluster. The role of this component is to scan the Kubernetes API for "Services" resources which kind are "LoadBalancer" and to create/update/delete them in the AWS API.
Checking the logs of this component (kubectl -n aws-load-balancer -l app.kubernetes.io/instance=aws-load-balancer-controller) shows the error:

{"level":"error","ts":1674838128.6175954,"logger":"controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-publicng-publicng-7482972d25","namespace":"public-nginx-ingress","error":"expect exactly one securityGroup tagged with kubernetes.io/cluster/public-happy-polliwog for eni eni-0970f1ec0888c2d65, got: [sg-0c0d669a830f6e013 sg-0ca36e364f5491978] (clusterName: public-happy-polliwog)"}

Hotfix: From the AWS UI, removing the tag kubernetes.io/cluster/public-happy-polliwog=true of the security group for the cluster itself (eks-cluster-sg-public-happy-polliwog-1884802038, usually the first in the list) and keeping this tag on the SG public-happy-polliwog-node, because this 2nd SG is applied to the Kubernetes node VMs, which hosts the floating private IP of the Public Service LB.
=> After 5 min, the whole system is working again.

=> Next step: find how to avoid this duplicate "tagging" in the jenkins-infra/aws terraform code.

dduportal · 2023-01-31T14:17:18Z

Ref. https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/faq.md#i-received-an-error-expect-exactly-one-securitygroup-tagged-with-kubernetesioclustername-

dduportal · 2023-02-07T16:42:22Z

New error: we cannot update the ACP statefulset:

 Normal   NotTriggerScaleUp  3m58s (x32461 over 3d18h)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict
  Warning  FailedScheduling   2m30s (x5365 over 3d18h)   default-scheduler   0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had volume node affinity conflict.

=> the autoscaler pod's logs for this cluster shows that autoscaling cannot be done:

I0207 10:58:37.369492       1 binder.go:791] "Could not get a CSINode object for the node" node="template-node-for-eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44-2880955131433950512" err="csinode.storage.k8s.io \"template-node-for-eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44-2880955131433950512\" not found"
I0207 10:58:37.369532       1 binder.go:811] "PersistentVolume and node mismatch for pod" PV="pvc-173ee3c5-22ec-4444-bee0-fe7b8ece01fa" node="template-node-for-eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44-2880955131433950512" pod="artifact-caching-proxy/artifact-caching-proxy-0" err="no matching NodeSelectorTerms"
I0207 10:58:37.369561       1 scale_up.go:300] Pod artifact-caching-proxy-0 can't be scheduled on eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I0207 10:58:37.369819       1 scale_up.go:449] No pod can fit to eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44
I0207 10:58:37.369836       1 scale_up.go:453] No expansion options

It looks like kubernetes/autoscaler#4811: the PVCs are only per AZ: the autoscaler seems to fail scaling nodes in the correct AZ so it's stuck 🤦

dduportal · 2023-02-07T17:05:59Z

jenkins-infra/aws#333 was merged: we are watching the effect

dduportal · 2023-02-07T19:24:38Z

Temporary unblocking the kube management builds: jenkins-infra/kubernetes-management@0288bb0 (this commit will have to be reverted once repo.aws is fixed)

dduportal · 2023-02-08T13:27:56Z

It seems we have found a working setup:

Sources of informations:
- https://aws.amazon.com/blogs/containers/amazon-eks-cluster-multi-zone-auto-scaling-groups/
- https://stackoverflow.com/questions/51946393/kubernetes-pod-warning-1-nodes-had-volume-node-affinity-conflict

=> we have to:

Define custom storage class with topology awareness (despite what AWS documentations says, the CSI driver does not seem to automatically generates expected classes).
- Nice to have: add retain and delete classes
- Scope: automate in jenkins-infra/aws Terraform project, like it was done for jenkins-infra/azure
Update AWS autoscaler configuration to be higly available AND taking in account the topology (it is not by default: https://github.com/kubernetes/autoscaler/blob/9158196a3c06ed754fc4333ac67417e66a4ec274/charts/cluster-autoscaler/values.yaml#L180)
Scope: jenkins-infra/kubernetes-management, for both cik8s and eks-public
Cleanup the addition node pools added yesterday: autoscaler manages with the actual node pool (let's keep it simple)
- We are using a node pool spanning across multiple AZs: it's worth switching to one of the 3 new ones that is in the same AZ. Tests in progress

dduportal · 2023-02-08T18:29:42Z

Closing as the problem is now fixed \o/

dduportal modified the milestones: infra-team-sync-2022-12-20, infra-team-sync-2023-01-03 Dec 20, 2022

dduportal self-assigned this Dec 20, 2022

smerle33 modified the milestones: infra-team-sync-2023-01-03, infra-team-sync-2023-01-10 Jan 3, 2023

dduportal modified the milestones: infra-team-sync-2023-01-10, infra-team-sync-next Jan 10, 2023

dduportal modified the milestones: infra-team-sync-next, infra-team-sync-2023-01-17 Jan 12, 2023

dduportal modified the milestones: infra-team-sync-2023-01-17, infra-team-sync-2023-01-24, infra-team-sync-2023-01-31 Jan 18, 2023

dduportal mentioned this issue Jan 31, 2023

fix(eks-public) correct ACP and public-ingress configuration to follow cert-manager recommendations jenkins-infra/kubernetes-management#3538

Merged

smerle33 self-assigned this Jan 31, 2023

dduportal modified the milestones: infra-team-sync-2023-01-31, infra-team-sync-2023-02-07 Jan 31, 2023

dduportal unassigned smerle33 Feb 2, 2023

dduportal mentioned this issue Feb 3, 2023

fix(cik8s,eks-public) add back global tags and disable cluster SG tagging jenkins-infra/aws#331

Merged

dduportal modified the milestones: infra-team-sync-2023-02-07, infra-team-sync-2023-02-14 Feb 7, 2023

dduportal mentioned this issue Feb 7, 2023

feat(eks-public) add 1 node pool per AZ jenkins-infra/aws#333

Merged

dduportal closed this as completed Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump the terraform module for AWS EKS (and consequences) #3305

Bump the terraform module for AWS EKS (and consequences) #3305

dduportal commented Dec 20, 2022 •

edited

Loading

dduportal commented Dec 20, 2022

dduportal commented Dec 20, 2022

dduportal commented Dec 20, 2022 •

edited

Loading

dduportal commented Dec 20, 2022

dduportal commented Dec 23, 2022

smerle33 commented Jan 12, 2023 •

edited

Loading

dduportal commented Jan 17, 2023

dduportal commented Jan 31, 2023

dduportal commented Jan 31, 2023

dduportal commented Jan 31, 2023

dduportal commented Feb 7, 2023

dduportal commented Feb 7, 2023

dduportal commented Feb 7, 2023

dduportal commented Feb 8, 2023

dduportal commented Feb 8, 2023

Bump the terraform module for AWS EKS (and consequences) #3305

Bump the terraform module for AWS EKS (and consequences) #3305

Comments

dduportal commented Dec 20, 2022 • edited Loading

dduportal commented Dec 20, 2022

dduportal commented Dec 20, 2022

dduportal commented Dec 20, 2022 • edited Loading

dduportal commented Dec 20, 2022

dduportal commented Dec 23, 2022

smerle33 commented Jan 12, 2023 • edited Loading

dduportal commented Jan 17, 2023

dduportal commented Jan 31, 2023

dduportal commented Jan 31, 2023

dduportal commented Jan 31, 2023

dduportal commented Feb 7, 2023

dduportal commented Feb 7, 2023

dduportal commented Feb 7, 2023

dduportal commented Feb 8, 2023

dduportal commented Feb 8, 2023

dduportal commented Dec 20, 2022 •

edited

Loading

dduportal commented Dec 20, 2022 •

edited

Loading

smerle33 commented Jan 12, 2023 •

edited

Loading