-
Notifications
You must be signed in to change notification settings - Fork 822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS: create build clusters with EKS #4686
Comments
Let's detail what addons / features we want (eg: IAM Roles For ServiceAccounts). That can be a separate issue or even 1 issue per addon. |
on top of my mind:
|
Maybe some of:
|
LGTM. we can start with this list and expand the list of addons depending on the issues/needs we will face. |
@jeefy would be great to have Kubermatic work on that. |
cc @mfahlandt @xmudrii (Don't know the other GH handles lol) |
I'll be taking care of this next week. |
We also need to ensure we can use DinD on the build clusters. Starting 1.24 DockerShim is not supported by EKS. |
Does this possibly mean making / baking our own AMIs? |
@sftim We should probably look if it's possible that we get rid of the Docker dependency. Generally, it's possible to get DinD on the build clusters with containerd (we do that on our/Kubermatic Prow instance), but it requires some changes to the build image. |
@ameukam The link to the GKE cluster (#4685) is mostly like pointing to the wrong issue/place. Can you please let me know what are our current capacities of the GKE build cluster? I'm mostly wondering:
|
@xmudrii Sorry. I updated the link. I would say for a node group: |
@ameukam Isn't 100-300 nodes a bit too much for the beginning:
Maybe it would be better to go with 10-20 nodes and increase as we migrate and as it is needed? |
@ameukam Also, do we have any preferences regarding the AWS region? |
I don't think we care about size right now. and this is probably gonna the default size of the cluster when we go in production. we currently have the budget to handle this for 2023. For region, we can start with |
I tried creating a node group based on instructions above (100-300 nodes based on
I'm going to request the vCPU limit to be increased. |
We should set up cluster autoscaling using Karpenter (it really is a good fit for cloud scaling, but it's especially good on AWS). Maybe a small static node group to ensure that Karpenter has somewhere to run even if things break. Karpenter automatically tracks' AWS instance pricing APIs and is able to mix spot and on-demand instances. I imagine we mainly want spot instances. Does that want to be its own issue? |
We have jobs that run for a long time. So I'm not sure spot instances are a good fit for the different tests we have. Also cost optimization is not really required at the moment.
Let's see how things are going with jobs scheduling from prow before we start take a look at Karpenter. @sftim you want to try things with Karpenter, reach out to slack to get access. |
I agree that we should give Karpenter a try, but let's come up with a working setup and we can add it later (I believe this is the first priority right now). Spot instances might indeed be problematic. Our tests can already be flaky, I'm worried Spot instances will make it even worse. |
I think @sftim is a Karpenter expert by now, but I work on Karpenter and am happy to assist if needed if you decide to use it. I'm part of the EKS compute team so if you run into any EKS issues, feel free to drag me in as well. |
cc @ellistarn as well :) |
Here's the current status regarding requirements:
|
Prow is configured to use the new build cluster and it works as expected. However, there're still some tasks that we need to take care of before closing this issue. |
This was added to replicate GKE build clusters but it's not actually needed. GCP don't actually offer to possibly to have disks bigger than 375 GB. I think it's ok to pick single-disk instance (e.g. |
If you were using the base EKS AMIs, you'll need to use custom user data to have pods use the local disk storage if you choose an instance type that has it. There is a PR at https://github.com/awslabs/amazon-eks-ami/pulls that starts to build this in, but it hasn't been merged yet. |
I will be taking a look at the monitoring stack for EKS. |
Action items to take care of before closing the issue: |
eks-prow-build-cluster is created and it has been running canary jobs for a few weeks now. I think it's time to close this issue. Let's use #5169 as a tracking issue for further improvements and enhancements. |
@xmudrii: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Now we have credits for 2023, we should investigate on moving some prowjobs to AWS.
Create EKS build cluster(s) that matches existing GKE clusters:
The EKS build clusters should also be able to sync secrets from AWS Secret Manager
(I probably forget a few things. Will update the issue)
/milestone v1.27
/area infra
/area infra/aws
/priority important-soon
The text was updated successfully, but these errors were encountered: