-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump the terraform module for AWS EKS (and consequences) #3305
Comments
On the terraform code, with "usual" PRs:
|
Note that, with 19.x changes, the EKS clusters are now private by default. hotfix for the cik8s cluster: jenkins-infra/aws@27d4f74 |
|
Status: 2 new problems to fix:
|
[10.0.0.38] - - [22/Dec/2022:14:47:42 +0000] "GET /.well-known/acme-challenge/<redacted> HTTP/1.1" 401 172 "http://repo.aws.jenkins.io/.well-known/acme-challenge/<redacted>" "cert-manager-challenges/v1.9.1 (linux/amd64) cert-manager/<redacted>" 377 0.000 [artifact-caching-proxy-artifact-caching-proxy-8080] - - - - <redacted> It's weird: the But since we define a custom configmap, it might be overwritten: https://github.com/jenkins-infra/kubernetes-management/blob/8c6d91f9a02048f3b9e8fb4a444106f5a08fcfe6/config/ext_public-nginx-ingress__common.yaml#L25-L36 🤔 |
Just discovered a second EKS cluster named : |
Checked during a team working session: this cluster is |
We had an issue with this cluster after the ingress rules where successfully updated with a valid certificate: the public IP (the 3 public IPs associated to the 3 network zones of the public loadbalancer) weren't reachable at all (even from inside the cluster) but Kubernetes reported everything fine. Here are my (raw) notes:
public-nginx-ingress public-nginx-ingress-ingress-nginx-controller LoadBalancer 172.20.240.59 k8s-publicng-publicng-f7332522a1-59fde896b2eb752b.elb.us-east-2.amazonaws.com 80:31868/TCP,443:32267/TCP 38d
curl -v 172.20.48.207 -o /dev/null
* Trying 172.20.48.207:80...
* Connected to 172.20.48.207 (172.20.48.207) port 80 (#0)
> GET / HTTP/1.1
> Host: 172.20.48.207
> User-Agent: curl/7.83.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: nginx/1.22.1
< Date: Fri, 27 Jan 2023 16:45:18 GMT
< Content-Type: text/html
< Content-Length: 1826
< Last-Modified: Mon, 23 Jan 2023 01:36:05 GMT
< Connection: keep-alive
< ETag: "63cde485-722"
< Accept-Ranges: bytes
<
* Connection #0 to host 172.20.48.207 left intact curl -v k8s-publicng-publicng-f7332522a1-59fde896b2eb752b.elb.us-east-2.amazonaws.com
* Trying 18.116.6.230:80...
# Stuck, need to wait 60s for timeout or issue a Ctrl-C cancellation => Private IP works well as expected, but the public IP(s) of the LB does not answer. It means the issue is with the LB itself.
{"level":"error","ts":1674838128.6175954,"logger":"controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-publicng-publicng-7482972d25","namespace":"public-nginx-ingress","error":"expect exactly one securityGroup tagged with kubernetes.io/cluster/public-happy-polliwog for eni eni-0970f1ec0888c2d65, got: [sg-0c0d669a830f6e013 sg-0ca36e364f5491978] (clusterName: public-happy-polliwog)"}
=> Next step: find how to avoid this duplicate "tagging" in the jenkins-infra/aws terraform code. |
New error: we cannot update the ACP statefulset:
=> the autoscaler pod's logs for this cluster shows that autoscaling cannot be done:
It looks like kubernetes/autoscaler#4811: the PVCs are only per AZ: the autoscaler seems to fail scaling nodes in the correct AZ so it's stuck 🤦 |
jenkins-infra/aws#333 was merged: we are watching the effect |
Temporary unblocking the kube management builds: jenkins-infra/kubernetes-management@0288bb0 (this commit will have to be reverted once repo.aws is fixed) |
It seems we have found a working setup:
=> we have to:
|
Closing as the problem is now fixed \o/ |
This issue is related to a major bump of the Terraform EKS module that we usein https://github.com/jenkins-infra/aws to manage the two EKS clusters on our infrastructure (
cik8s
andeks-public
).This issue is an audit trail after the whole operation.
The text was updated successfully, but these errors were encountered: