AWS: create build clusters with EKS #4686

ameukam · 2023-01-26T20:40:51Z

Now we have credits for 2023, we should investigate on moving some prowjobs to AWS.

Create EKS build cluster(s) that matches existing GKE clusters:

CPU (Intel|AMD): 8 (minimum)
Memory: 52 GB (minimum)
Local SSD disks: 2 (minimum)
boot disk: 100 GB (minimum)
OS: Ubuntu/Debian
Container runtime: containerd

The EKS build clusters should also be able to sync secrets from AWS Secret Manager

(I probably forget a few things. Will update the issue)

/milestone v1.27
/area infra
/area infra/aws
/priority important-soon

sftim · 2023-01-26T23:19:43Z

Let's detail what addons / features we want (eg: IAM Roles For ServiceAccounts).

That can be a separate issue or even 1 issue per addon.

ameukam · 2023-01-27T09:22:04Z

Let's detail what addons / features we want (eg: IAM Roles For ServiceAccounts).

That can be a separate issue or even 1 issue per addon.

on top of my mind:

private clusters (Public endpoint for the API server with private nodes)
Node auto-scaling
Integration with AWS Secret manager
Integration with ALB
Network Dual-Stack (Optional)
OS Kernel configuration
AWS Nitro enclaves (Optional)
Integration with KubeCost. (Added on 01/30/2023)

sftim · 2023-01-27T09:26:33Z

Maybe some of:

metrics-server
Prometheus (or as part of a bigger component)
Node exporter
Node problem reporter
persistent storage
AWS VPC network plugin (as an example)
kube-state-metrics

ameukam · 2023-01-27T09:38:01Z

Maybe some of:

metrics-server

Prometheus (or as part of a bigger component)

Node exporter

Node problem reporter

persistent storage

AWS VPC network plugin (as an example)

kube-state-metrics

LGTM. we can start with this list and expand the list of addons depending on the issues/needs we will face.

ameukam · 2023-01-27T23:10:29Z

@jeefy would be great to have Kubermatic work on that.

jeefy · 2023-02-07T21:47:52Z

cc @mfahlandt @xmudrii (Don't know the other GH handles lol)

xmudrii · 2023-02-10T21:34:02Z

I'll be taking care of this next week.
/assign

ameukam · 2023-02-17T16:05:24Z

We also need to ensure we can use DinD on the build clusters. Starting 1.24 DockerShim is not supported by EKS.

sftim · 2023-02-17T16:08:16Z

We also need to ensure we can use DinD on the build clusters. Starting 1.24 DockerShim is not supported by EKS.

Does this possibly mean making / baking our own AMIs?

xmudrii · 2023-02-17T17:03:02Z

@sftim We should probably look if it's possible that we get rid of the Docker dependency. Generally, it's possible to get DinD on the build clusters with containerd (we do that on our/Kubermatic Prow instance), but it requires some changes to the build image.

xmudrii · 2023-02-17T17:14:48Z

@ameukam The link to the GKE cluster (#4685) is mostly like pointing to the wrong issue/place. Can you please let me know what are our current capacities of the GKE build cluster? I'm mostly wondering:

How many min/max nodes we should have in a Node group?
Are resources in the issue description supposed to be in total or per node:

CPU (Intel|AMD): 8 (minimum)
Memory: 52 GB (minimum)

ameukam · 2023-02-17T17:20:09Z

@xmudrii Sorry. I updated the link. I would say for a node group:
min: 100
max: 300
the resources are per node. We can use the r5ad.4xlarge instance (or the r5 instance family)

xmudrii · 2023-02-17T17:34:13Z

@ameukam Isn't 100-300 nodes a bit too much for the beginning:

min: 100
max: 300

Maybe it would be better to go with 10-20 nodes and increase as we migrate and as it is needed?

xmudrii · 2023-02-17T17:35:33Z

@ameukam Also, do we have any preferences regarding the AWS region?

ameukam · 2023-02-17T17:45:12Z

@ameukam Isn't 100-300 nodes a bit too much for the beginning:

min: 100
max: 300

Maybe it would be better to go with 10-20 nodes and increase as we migrate and as it is needed?

I don't think we care about size right now. and this is probably gonna the default size of the cluster when we go in production. we currently have the budget to handle this for 2023.

For region, we can start with us-east-2.

xmudrii · 2023-02-27T17:22:23Z

I tried creating a node group based on instructions above (100-300 nodes based on r5ad.4xlarge), but I'm getting this error:

Launching a new EC2 instance. Status Reason: Could not launch On-Demand Instances. VcpuLimitExceeded - You have requested more vCPU capacity than your current vCPU limit of 32 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed.

I'm going to request the vCPU limit to be increased.

sftim · 2023-02-27T17:32:54Z

We should set up cluster autoscaling using Karpenter (it really is a good fit for cloud scaling, but it's especially good on AWS). Maybe a small static node group to ensure that Karpenter has somewhere to run even if things break.

Karpenter automatically tracks' AWS instance pricing APIs and is able to mix spot and on-demand instances. I imagine we mainly want spot instances.

Does that want to be its own issue?

ameukam · 2023-02-27T17:40:48Z

We have jobs that run for a long time. So I'm not sure spot instances are a good fit for the different tests we have. Also cost optimization is not really required at the moment.

Does that want to be its own issue?

Let's see how things are going with jobs scheduling from prow before we start take a look at Karpenter.

@sftim you want to try things with Karpenter, reach out to slack to get access.

xmudrii · 2023-02-27T17:44:33Z

I agree that we should give Karpenter a try, but let's come up with a working setup and we can add it later (I believe this is the first priority right now). Spot instances might indeed be problematic. Our tests can already be flaky, I'm worried Spot instances will make it even worse.

tzneal · 2023-03-03T02:56:55Z

I think @sftim is a Karpenter expert by now, but I work on Karpenter and am happy to assist if needed if you decide to use it. I'm part of the EKS compute team so if you run into any EKS issues, feel free to drag me in as well.

dims · 2023-03-03T03:03:29Z

cc @ellistarn as well :)

xmudrii · 2023-03-06T12:17:11Z

Here's the current status regarding requirements:

private clusters (Public endpoint for the API server with private nodes) - done, checked
Node auto-scaling - done (Cluster Autoscaler), checked
Integration with AWS Secret manager - done, checked
Integration with ALB - done, checked
Network Dual-Stack (Optional) - VPC is dual-stack, but nodes might not be getting IPv6 addresses
OS Kernel configuration - done, checked
AWS Nitro enclaves (Optional) - done, not sure how to check if it's working properly
Integration with KubeCost. (Added on 01/30/2023) - TBD
metrics-server - done, checked
Prometheus (or as part of a bigger component) - N/A
Node exporter - N/A
Node problem reporter - N/A
persistent storage - done, checked
AWS VPC network plugin (as an example) - done, checked
kube-state-metrics - N/A

xmudrii · 2023-03-10T09:53:23Z

Prow is configured to use the new build cluster and it works as expected. However, there're still some tasks that we need to take care of before closing this issue.

ameukam · 2023-03-15T14:39:43Z

Local SSD disks: 2 (minimum)

This was added to replicate GKE build clusters but it's not actually needed. GCP don't actually offer to possibly to have disks bigger than 375 GB. I think it's ok to pick single-disk instance (e.g. r6id.4xlarge)

tzneal · 2023-03-15T14:43:17Z

If you were using the base EKS AMIs, you'll need to use custom user data to have pods use the local disk storage if you choose an instance type that has it. There is a PR at https://github.com/awslabs/amazon-eks-ami/pulls that starts to build this in, but it hasn't been merged yet.

pkprzekwas · 2023-03-20T11:11:43Z

Prometheus (or as part of a bigger component) - N/A

Node exporter - N/A

Node problem reporter - N/A

I will be taking a look at the monitoring stack for EKS.

xmudrii · 2023-03-27T13:50:26Z

Action items to take care of before closing the issue:

eks-prow-build-cluster: fix root volume for node groups and setup IAM role with admin access #4989 (comment)
eks-prow-build-cluster: fix root volume for node groups and setup IAM role with admin access #4989 (comment)

xmudrii · 2023-04-25T15:45:08Z

eks-prow-build-cluster is created and it has been running canary jobs for a few weeks now. I think it's time to close this issue. Let's use #5169 as a tracking issue for further improvements and enhancements.
/close

k8s-ci-robot · 2023-04-25T15:45:14Z

@xmudrii: Closing this issue.

In response to this:

eks-prow-build-cluster is created and it has been running canary jobs for a few weeks now. I think it's time to close this issue. Let's use #5169 as a tracking issue for further improvements and enhancements.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ameukam added the sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. label Jan 26, 2023

k8s-ci-robot added area/infra/aws Issues or PRs related to Kubernetes AWS infrastructure priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jan 26, 2023

ameukam added this to AWS Infrastructure (SIG K8s Infra) Jan 26, 2023

ameukam moved this to 📋 Backlog in AWS Infrastructure (SIG K8s Infra) Jan 26, 2023

ameukam added this to the v1.27 milestone Jan 26, 2023

ameukam moved this from 📋 Backlog to 🔖 Ready in AWS Infrastructure (SIG K8s Infra) Feb 9, 2023

hh mentioned this issue Feb 9, 2023

Create AWS Access for Kubermatic to create prow clusters #4748

Closed

k8s-ci-robot assigned xmudrii Feb 10, 2023

xmudrii mentioned this issue Feb 28, 2023

Initial Terraform configs for prow-build-cluster on AWS #4843

Merged

This was referenced Mar 5, 2023

prow-build-cluster: deploy cluster-autoscaler, metrics-server, and AWS Secrets Manager integration #4873

Merged

prow-build-cluster: deploy AWS Load Balancer Controller #4874

Merged

xmudrii mentioned this issue Mar 7, 2023

Add variables needed to access prow-build-cluster on EKS kubernetes/test-infra#28951

Merged

ameukam mentioned this issue Mar 21, 2023

[Failure-test] TestLazyThroughput: total wait was: 3.145007079s; par would be ~1s kubernetes/kubernetes#116789

Closed

ameukam moved this from 🔖 Ready to 🏗 In progress in AWS Infrastructure (SIG K8s Infra) Mar 22, 2023

pkprzekwas mentioned this issue Mar 23, 2023

eks-prow-build-cluster: create simple monitoring stack #5011

Merged

xmudrii mentioned this issue Mar 27, 2023

eks-prow-build-cluster: fix root volume for node groups and setup IAM role with admin access #4989

Merged

This was referenced Mar 27, 2023

eks-prow-build-cluster: fix port number in Grafana's config #5038

Merged

prow-build-canary-cluster: provisioning scripts #5063

Merged

xmudrii mentioned this issue Apr 2, 2023

eks-prow-build-cluster: Reconsider instance type selection #5066

Open

ameukam moved this from 🏗 In progress to 👀 In review in AWS Infrastructure (SIG K8s Infra) Apr 13, 2023

k8s-ci-robot closed this as completed Apr 25, 2023

github-project-automation bot moved this from 👀 In review to ✅ Done in AWS Infrastructure (SIG K8s Infra) Apr 25, 2023

xmudrii mentioned this issue Apr 25, 2023

eks-prow-bulild-cluster improvements and enhancements #5169

Closed

xmudrii mentioned this issue Aug 25, 2023

Steering Committee Nomination: Marko Mudrinić (@xmudrii) kubernetes/community#7491

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS: create build clusters with EKS #4686

AWS: create build clusters with EKS #4686

ameukam commented Jan 26, 2023 •

edited

Loading

sftim commented Jan 26, 2023

ameukam commented Jan 27, 2023 •

edited

Loading

sftim commented Jan 27, 2023

ameukam commented Jan 27, 2023

ameukam commented Jan 27, 2023

jeefy commented Feb 7, 2023

xmudrii commented Feb 10, 2023

ameukam commented Feb 17, 2023

sftim commented Feb 17, 2023

xmudrii commented Feb 17, 2023

xmudrii commented Feb 17, 2023

ameukam commented Feb 17, 2023

xmudrii commented Feb 17, 2023

xmudrii commented Feb 17, 2023

ameukam commented Feb 17, 2023 •

edited

Loading

xmudrii commented Feb 27, 2023

sftim commented Feb 27, 2023

ameukam commented Feb 27, 2023

xmudrii commented Feb 27, 2023

tzneal commented Mar 3, 2023

dims commented Mar 3, 2023

xmudrii commented Mar 6, 2023 •

edited

Loading

xmudrii commented Mar 10, 2023

ameukam commented Mar 15, 2023

tzneal commented Mar 15, 2023

pkprzekwas commented Mar 20, 2023

xmudrii commented Mar 27, 2023 •

edited

Loading

xmudrii commented Apr 25, 2023

k8s-ci-robot commented Apr 25, 2023

AWS: create build clusters with EKS #4686

AWS: create build clusters with EKS #4686

Comments

ameukam commented Jan 26, 2023 • edited Loading

sftim commented Jan 26, 2023

ameukam commented Jan 27, 2023 • edited Loading

sftim commented Jan 27, 2023

ameukam commented Jan 27, 2023

ameukam commented Jan 27, 2023

jeefy commented Feb 7, 2023

xmudrii commented Feb 10, 2023

ameukam commented Feb 17, 2023

sftim commented Feb 17, 2023

xmudrii commented Feb 17, 2023

xmudrii commented Feb 17, 2023

ameukam commented Feb 17, 2023

xmudrii commented Feb 17, 2023

xmudrii commented Feb 17, 2023

ameukam commented Feb 17, 2023 • edited Loading

xmudrii commented Feb 27, 2023

sftim commented Feb 27, 2023

ameukam commented Feb 27, 2023

xmudrii commented Feb 27, 2023

tzneal commented Mar 3, 2023

dims commented Mar 3, 2023

xmudrii commented Mar 6, 2023 • edited Loading

xmudrii commented Mar 10, 2023

ameukam commented Mar 15, 2023

tzneal commented Mar 15, 2023

pkprzekwas commented Mar 20, 2023

xmudrii commented Mar 27, 2023 • edited Loading

xmudrii commented Apr 25, 2023

k8s-ci-robot commented Apr 25, 2023

ameukam commented Jan 26, 2023 •

edited

Loading

ameukam commented Jan 27, 2023 •

edited

Loading

ameukam commented Feb 17, 2023 •

edited

Loading

xmudrii commented Mar 6, 2023 •

edited

Loading

xmudrii commented Mar 27, 2023 •

edited

Loading