-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instance Rolling-Update Discussion #1398
Comments
Hello! We're using managed workers and we're also experiencing this behavior. Existing instances from the old ASG start being terminated before the new ASG is already created and the nodes joined the cluster. Example output with module version: 15.2.0
With older versions of the module this didn't happen. First the new ASG was created, the nodes joined the cluster and entered Ready state and then, the old ASG was terminated one instance at a time. |
In your example @lgg42 you are talking about Node groups, and instance refresh is used for self-managed workers. But still, interesting. I will test the new commit that removes random_pets to see how it behaves. I updated the issue description with more info |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
maybe this is of interest hashicorp/terraform-provider-aws#20009 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity since being marked as stale. |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
I want to review the different alternatives and configs available for setting up workers and the ability to seamlessly do rolling updates.
Let's start with self-managed workers:
For that we have:
My issue with instance refresh (as you can read below, in the "original question" section), is that apparently, it terminates instances before creating new ones. Not sure when this design is desirable, but it is the case.
I suppose that instance refresh can still work fine, but specifically I would say that only when the ASG size is at least 2, and your pods always have a replica count of 2, given that one node at a time will be terminated before the new node is ready. In my experience, the node is terminated way before the ~120 seconds that a new instance (amazon linux) takes to be ready.
The rolling-update script is not bad, but I have had some issues in the past with it. I feel like it is not safe to use in production in a completely unattended way, as any issue might leave you in a wrong state that requires manual correction. Again, this is my experience, but I suppose for some people it works fantastic.
Then, for managed node groups:
AWS provides an out-of-the-box feature for doing rolling updates. Before going forward, I must say that managed node groups used to have some disadvantages, although, currently, I think the only one left is that you cannot scale the groups to zero.
The rolling updates system is automatically triggered in most desired situations (AMI change, template change...). The only issue I have is with the way they do the updates. The author if this issue properly explains it graphically in this PDF.
In my experience, this results in a case where a single-node ASG ends up with 7 instances before it is scaled down to 1 again. This is extremely inefficient. Unless there was an issue on my side, or some kind of bug I am not aware of.
This is explained here (although not very clearly, as it says: "Increments the Auto Scaling group maximum size and desired size by one up to twice the number of Availability Zones in the Region that the Auto Scaling group is deployed in" which I'm not sure what it means): https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html
For a single-node ASG it took me more than 30 minutes, during which the terraform was running and waiting for the process to be completed.
For this, I would assume that in certain cases like a production environment where you may have at least 3 nodes (min=3), this is acceptable. But I may even have single-node ASG in prod for very specific applications, and if each of those ASG is going to be increased to 7 and then decreased to 1, the time it would take to do a full-cluster update would be huge. (Not to mention that we would be paying for more instances than needed, -or even expensive gpu instances-).
A small optimization for certain edge cases could be to create more ASGs for the same worker roles, one for each AZ. This can be helpful to reduce the high amount of instances during the updates.
What do you think about it?
------ Original Question --------
I have tested the instance refresh example and overall it looks good and works fine. I would like to know if anyone has been using this in production, and what are your thoughts about it.
Specifically, want to ask if this solution is good as a method to do "Rolling Updates" for nodes.
My main concern is that according to their documentation:
And
I conclude that there is no way to revert the order (first create, then terminate), which is how this software https://github.com/hellofresh/eks-rolling-update works.
Should we assume that for this to work without downtime, we need to have a minimum of 2 instances on each AutoScalingGroups? And, in that case, should I assume that the rest of the instances (total-1) should have extra capacity to absorb the pods from the just terminated instance before the new one is created?. Also, the cluster-autoscaler does not have native support to over provisioning of nodes.
Can I assume that using managed workers is a better alternative than using self managed worker groups at this point?
The text was updated successfully, but these errors were encountered: