-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EKS] [request]: Cluster init job to bootstrap new clusters #1666
Comments
@bryantbiggs I think #1559 would be an alternative to this, assuming the behaviours required were provided for. If I ever get some free time it's near the top of my project list, although I might create a more generic operator suitable for all kubeadm (style) clusters. For EKS my use case is managing the default kubeadm services (CoreDNS & kube-proxy) and the CNI (aws-vpc-cni). I'd also like to see it optionally support the "required" services (aws-load-balanacer-controller, cluster-autoscaler, aws-node-termination-handler, metrics-server, etc). |
Not quite - I am looking to avoid this statement from #1559 The overall idea I am pitching here is two parts, mostly centered on the 2nd:
|
@bryantbiggs I suspect that any solution to this is going to require compute somewhere; this issue would need something to run the bootstrap script which IMHO makes it equivalent to #1559. I'm pretty sure you could do this already with a Lambda function, for TF a aws_lambda_invocation resource. You could then run further config once this function has completed. I don't think you'd have a dependency on #923 as your bootstrap could clean all of this up before any nodes are connected. This is something that could be added to the OSS EKS TF module; expose a variable to pass in a function to execute on create and maybe a peer module to create the lambda and execute some shell script with the correct dependencies. For #1559, the operator could be run in Fargate (like Karpenter can be for the same reason) and would be expected to use host networking so can precede CNI configuration. Obviously with an operator it wouldn't matter if it was run on a cluster with incorrect config and attached nodes as it's running a desired state control loop with eventual consistency. |
Couldn't you use terraform to execute the commands which you want to occur at cluster creation time? For example:
In this example they are creating a fargate profile to host the karpenter app. |
yes, I am familiar with those since I created them 😬 - this request was more about how best to transition from infrastructure provisioning over to cluster provisioning smoothly and seamlessly. The somewhat ideal, high-level flow being:
There are of course other, more complex scenarios such as standing up a cluster using a different CNI without chaining, managing core addons such as CoreDNS through the GitOps controller, etc. But thats the gist of it - provision the cluster, somehow get a controller installed and pointed at manifests, said controller reconciles the intended cluster state. No hacks, no manual intervention steps, not multi-step applies required (ideally) |
Any progress on this issue? Or everyone is using the workaround? |
Community Note
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
This is somewhat related to #923 (reliant on it actually) but I didn't want to muck up that feature request with additional noise since on its own its quite good and would be a great addition.
When provisioning new clusters, there are a number of steps that unfortunately have to be manually performed that are repetitious; especially when treating clusters as temporary/ephemeral (i.e. - blue/green cluster upgrades, sandbox clusters, etc.). An example would be:
a. One route is to remove the VPC CNI permissions from the node group and rely on an IRSA for providing the VPC CNI the necessary permissions where the VPC CNI is a managed addon. I cannot launch a node group without the VPC CNI permissions and rely only on the IRSA role for the addon because its a chicken vs the egg problem. I have to provision first with the node having the permissions, then come back and remove and rely on the IRSA only
b. Another route is where I don't want to use the VPC CNI and I need to delete it and provision the desired CNI/configuration. Again, this can only be done manually since EKS makes the default assumption to use the VPC CNI (I assume for a good "out of the box" experience, which makes sense)
✨ The Request ✨
The ability to provide a cluster
init
job that is run once upon cluster creation (think user data, but for clusters).The changes in #923 are required to make this work without race conditions. If default components are installed, its a race to get the default components removed from the cluster before nodes come up using them and the desired components installed (i.e. - CoreDNS, CNI, etc.). If they are not on the cluster to begin with, users are simply providing their desired configurations via the init job. The init job needs to run in a loop for those tools that are not designed as operators; but an operator (which itself is its own control loop) is the preferred approach. The longer form for ArgoCD is shown below - the preferred alternative would be to use the ArgoCD operator instead.
This init job also alleviates the shortcomings of running a Fargate only cluster where users have to manually intervene to configure CoreDNS for Fargate
This lifecycle roughly looks like the following:
x
number of times for up toy
timeout (standard job specs within the manifest). The key will be how to properly surface failures to users when the init job does not succeed - because nodes will most likely fail to join the cluster as well in that scenario.Which service(s) is this request for?
EKS
Are you currently working around this issue?
A mix of hacky scripts and manual intervention - "playbooks" if you will
Additional context
we need the abilty for setting custom cni networking upon creating a cluster terraform-aws-modules/terraform-aws-eks#1822
[EKS] [request]: API flag to initialize completely bare EKS cluster #923 (comment)
[EKS] [CNI]: Optional Default CNI Plugin Installation #71
[EKS] [request]: 1-click upgrades support blue/green or reversibility #1592
[EKS] [addon]: Support Flux V2 in EKS addons #1593
Opt-Out AWS VPC CNI (And Any Other EKS "Magics") awslabs/amazon-eks-ami#117
Ability to choose another CNI amazon-vpc-cni-k8s#214
What are consequences of disabling the CNI plugin in EKS? amazon-vpc-cni-k8s#176
https://github.com/aws-quickstart/quickstart-amazon-eks/issues/51
[EKS] [request]: Manage IAM identity cluster access with EKS API #185 is tangentially related due to the various hacks that people have to employ to initially provision access into the cluster via a codified process. With the approach here, this can help alleviate some of those issues - instead of giving access to the role that created the cluster (usually a CI/CD role that is quite elevated and heavily protected), users can offload the aws-auth configmap to be managed by their GitOps operator and/or they can update the configmap upon cluster creation
The text was updated successfully, but these errors were encountered: