Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS: create build clusters with EKS #4686

Closed
Tracked by #5169
ameukam opened this issue Jan 26, 2023 · 29 comments
Closed
Tracked by #5169

AWS: create build clusters with EKS #4686

ameukam opened this issue Jan 26, 2023 · 29 comments
Assignees
Labels
area/infra/aws Issues or PRs related to Kubernetes AWS infrastructure priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Milestone

Comments

@ameukam
Copy link
Member

ameukam commented Jan 26, 2023

Now we have credits for 2023, we should investigate on moving some prowjobs to AWS.

Create EKS build cluster(s) that matches existing GKE clusters:

  • CPU (Intel|AMD): 8 (minimum)
  • Memory: 52 GB (minimum)
  • Local SSD disks: 2 (minimum)
  • boot disk: 100 GB (minimum)
  • OS: Ubuntu/Debian
  • Container runtime: containerd

The EKS build clusters should also be able to sync secrets from AWS Secret Manager

(I probably forget a few things. Will update the issue)

/milestone v1.27
/area infra
/area infra/aws
/priority important-soon

@ameukam ameukam added the sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. label Jan 26, 2023
@k8s-ci-robot k8s-ci-robot added area/infra/aws Issues or PRs related to Kubernetes AWS infrastructure priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jan 26, 2023
@ameukam ameukam added this to the v1.27 milestone Jan 26, 2023
@sftim
Copy link
Contributor

sftim commented Jan 26, 2023

Let's detail what addons / features we want (eg: IAM Roles For ServiceAccounts).

That can be a separate issue or even 1 issue per addon.

@ameukam
Copy link
Member Author

ameukam commented Jan 27, 2023

Let's detail what addons / features we want (eg: IAM Roles For ServiceAccounts).

That can be a separate issue or even 1 issue per addon.

on top of my mind:

  • private clusters (Public endpoint for the API server with private nodes)
  • Node auto-scaling
  • Integration with AWS Secret manager
  • Integration with ALB
  • Network Dual-Stack (Optional)
  • OS Kernel configuration
  • AWS Nitro enclaves (Optional)
  • Integration with KubeCost. (Added on 01/30/2023)

@sftim
Copy link
Contributor

sftim commented Jan 27, 2023

Maybe some of:

  • metrics-server
  • Prometheus (or as part of a bigger component)
  • Node exporter
  • Node problem reporter
  • persistent storage
  • AWS VPC network plugin (as an example)
  • kube-state-metrics

@ameukam
Copy link
Member Author

ameukam commented Jan 27, 2023

Maybe some of:

  • metrics-server
  • Prometheus (or as part of a bigger component)
  • Node exporter
  • Node problem reporter
  • persistent storage
  • AWS VPC network plugin (as an example)
  • kube-state-metrics

LGTM. we can start with this list and expand the list of addons depending on the issues/needs we will face.

@ameukam
Copy link
Member Author

ameukam commented Jan 27, 2023

@jeefy would be great to have Kubermatic work on that.

@jeefy
Copy link
Member

jeefy commented Feb 7, 2023

cc @mfahlandt @xmudrii (Don't know the other GH handles lol)

@xmudrii
Copy link
Member

xmudrii commented Feb 10, 2023

I'll be taking care of this next week.
/assign

@ameukam
Copy link
Member Author

ameukam commented Feb 17, 2023

We also need to ensure we can use DinD on the build clusters. Starting 1.24 DockerShim is not supported by EKS.

@sftim
Copy link
Contributor

sftim commented Feb 17, 2023

We also need to ensure we can use DinD on the build clusters. Starting 1.24 DockerShim is not supported by EKS.

Does this possibly mean making / baking our own AMIs?

@xmudrii
Copy link
Member

xmudrii commented Feb 17, 2023

@sftim We should probably look if it's possible that we get rid of the Docker dependency. Generally, it's possible to get DinD on the build clusters with containerd (we do that on our/Kubermatic Prow instance), but it requires some changes to the build image.

@xmudrii
Copy link
Member

xmudrii commented Feb 17, 2023

@ameukam The link to the GKE cluster (#4685) is mostly like pointing to the wrong issue/place. Can you please let me know what are our current capacities of the GKE build cluster? I'm mostly wondering:

  • How many min/max nodes we should have in a Node group?
  • Are resources in the issue description supposed to be in total or per node:

    CPU (Intel|AMD): 8 (minimum)
    Memory: 52 GB (minimum)

@ameukam
Copy link
Member Author

ameukam commented Feb 17, 2023

@xmudrii Sorry. I updated the link. I would say for a node group:
min: 100
max: 300
the resources are per node. We can use the r5ad.4xlarge instance (or the r5 instance family)

@xmudrii
Copy link
Member

xmudrii commented Feb 17, 2023

@ameukam Isn't 100-300 nodes a bit too much for the beginning:

min: 100
max: 300

Maybe it would be better to go with 10-20 nodes and increase as we migrate and as it is needed?

@xmudrii
Copy link
Member

xmudrii commented Feb 17, 2023

@ameukam Also, do we have any preferences regarding the AWS region?

@ameukam
Copy link
Member Author

ameukam commented Feb 17, 2023

@ameukam Isn't 100-300 nodes a bit too much for the beginning:

min: 100
max: 300

Maybe it would be better to go with 10-20 nodes and increase as we migrate and as it is needed?

I don't think we care about size right now. and this is probably gonna the default size of the cluster when we go in production. we currently have the budget to handle this for 2023.

For region, we can start with us-east-2.

@xmudrii
Copy link
Member

xmudrii commented Feb 27, 2023

I tried creating a node group based on instructions above (100-300 nodes based on r5ad.4xlarge), but I'm getting this error:

Launching a new EC2 instance. Status Reason: Could not launch On-Demand Instances. VcpuLimitExceeded - You have requested more vCPU capacity than your current vCPU limit of 32 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed.

I'm going to request the vCPU limit to be increased.

@sftim
Copy link
Contributor

sftim commented Feb 27, 2023

We should set up cluster autoscaling using Karpenter (it really is a good fit for cloud scaling, but it's especially good on AWS). Maybe a small static node group to ensure that Karpenter has somewhere to run even if things break.

Karpenter automatically tracks' AWS instance pricing APIs and is able to mix spot and on-demand instances. I imagine we mainly want spot instances.

Does that want to be its own issue?

@ameukam
Copy link
Member Author

ameukam commented Feb 27, 2023

We have jobs that run for a long time. So I'm not sure spot instances are a good fit for the different tests we have. Also cost optimization is not really required at the moment.

Does that want to be its own issue?

Let's see how things are going with jobs scheduling from prow before we start take a look at Karpenter.

@sftim you want to try things with Karpenter, reach out to slack to get access.

@xmudrii
Copy link
Member

xmudrii commented Feb 27, 2023

I agree that we should give Karpenter a try, but let's come up with a working setup and we can add it later (I believe this is the first priority right now). Spot instances might indeed be problematic. Our tests can already be flaky, I'm worried Spot instances will make it even worse.

@tzneal
Copy link

tzneal commented Mar 3, 2023

I think @sftim is a Karpenter expert by now, but I work on Karpenter and am happy to assist if needed if you decide to use it. I'm part of the EKS compute team so if you run into any EKS issues, feel free to drag me in as well.

@dims
Copy link
Member

dims commented Mar 3, 2023

cc @ellistarn as well :)

@xmudrii
Copy link
Member

xmudrii commented Mar 6, 2023

Here's the current status regarding requirements:

  • private clusters (Public endpoint for the API server with private nodes) - done, checked
  • Node auto-scaling - done (Cluster Autoscaler), checked
  • Integration with AWS Secret manager - done, checked
  • Integration with ALB - done, checked
  • Network Dual-Stack (Optional) - VPC is dual-stack, but nodes might not be getting IPv6 addresses
  • OS Kernel configuration - done, checked
  • AWS Nitro enclaves (Optional) - done, not sure how to check if it's working properly
  • Integration with KubeCost. (Added on 01/30/2023) - TBD
  • metrics-server - done, checked
  • Prometheus (or as part of a bigger component) - N/A
  • Node exporter - N/A
  • Node problem reporter - N/A
  • persistent storage - done, checked
  • AWS VPC network plugin (as an example) - done, checked
  • kube-state-metrics - N/A

@xmudrii
Copy link
Member

xmudrii commented Mar 10, 2023

Prow is configured to use the new build cluster and it works as expected. However, there're still some tasks that we need to take care of before closing this issue.

@ameukam
Copy link
Member Author

ameukam commented Mar 15, 2023

Local SSD disks: 2 (minimum)

This was added to replicate GKE build clusters but it's not actually needed. GCP don't actually offer to possibly to have disks bigger than 375 GB. I think it's ok to pick single-disk instance (e.g. r6id.4xlarge)

@tzneal
Copy link

tzneal commented Mar 15, 2023

If you were using the base EKS AMIs, you'll need to use custom user data to have pods use the local disk storage if you choose an instance type that has it. There is a PR at https://github.com/awslabs/amazon-eks-ami/pulls that starts to build this in, but it hasn't been merged yet.

@pkprzekwas
Copy link
Contributor

  • Prometheus (or as part of a bigger component) - N/A
  • Node exporter - N/A
  • Node problem reporter - N/A

I will be taking a look at the monitoring stack for EKS.

@xmudrii
Copy link
Member

xmudrii commented Apr 25, 2023

eks-prow-build-cluster is created and it has been running canary jobs for a few weeks now. I think it's time to close this issue. Let's use #5169 as a tracking issue for further improvements and enhancements.
/close

@k8s-ci-robot
Copy link
Contributor

@xmudrii: Closing this issue.

In response to this:

eks-prow-build-cluster is created and it has been running canary jobs for a few weeks now. I think it's time to close this issue. Let's use #5169 as a tracking issue for further improvements and enhancements.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/infra/aws Issues or PRs related to Kubernetes AWS infrastructure priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Projects
Status: Done
Development

No branches or pull requests

8 participants