Instance Rolling-Update Discussion #1398

jaimehrubiks · 2021-05-26T15:30:36Z

I want to review the different alternatives and configs available for setting up workers and the ability to seamlessly do rolling updates.

Self-managed workers vs managed node groups
Instance refresh vs managed node groups updates vs eks-rolling-update vs manually
Single ASG per AZ vs One ASG for each AZ

Let's start with self-managed workers:

For that we have:

instance refresh + node-termination-handler in queue mode
eks-rolling-update script
manual (won't be discussed)

My issue with instance refresh (as you can read below, in the "original question" section), is that apparently, it terminates instances before creating new ones. Not sure when this design is desirable, but it is the case.

I suppose that instance refresh can still work fine, but specifically I would say that only when the ASG size is at least 2, and your pods always have a replica count of 2, given that one node at a time will be terminated before the new node is ready. In my experience, the node is terminated way before the ~120 seconds that a new instance (amazon linux) takes to be ready.

The rolling-update script is not bad, but I have had some issues in the past with it. I feel like it is not safe to use in production in a completely unattended way, as any issue might leave you in a wrong state that requires manual correction. Again, this is my experience, but I suppose for some people it works fantastic.

Then, for managed node groups:

AWS provides an out-of-the-box feature for doing rolling updates. Before going forward, I must say that managed node groups used to have some disadvantages, although, currently, I think the only one left is that you cannot scale the groups to zero.

The rolling updates system is automatically triggered in most desired situations (AMI change, template change...). The only issue I have is with the way they do the updates. The author if this issue properly explains it graphically in this PDF.

In my experience, this results in a case where a single-node ASG ends up with 7 instances before it is scaled down to 1 again. This is extremely inefficient. Unless there was an issue on my side, or some kind of bug I am not aware of.

This is explained here (although not very clearly, as it says: "Increments the Auto Scaling group maximum size and desired size by one up to twice the number of Availability Zones in the Region that the Auto Scaling group is deployed in" which I'm not sure what it means): https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html

For a single-node ASG it took me more than 30 minutes, during which the terraform was running and waiting for the process to be completed.

For this, I would assume that in certain cases like a production environment where you may have at least 3 nodes (min=3), this is acceptable. But I may even have single-node ASG in prod for very specific applications, and if each of those ASG is going to be increased to 7 and then decreased to 1, the time it would take to do a full-cluster update would be huge. (Not to mention that we would be paying for more instances than needed, -or even expensive gpu instances-).

A small optimization for certain edge cases could be to create more ASGs for the same worker roles, one for each AZ. This can be helpful to reduce the high amount of instances during the updates.

What do you think about it?

------ Original Question --------
I have tested the instance refresh example and overall it looks good and works fine. I would like to know if anyone has been using this in production, and what are your thoughts about it.

Specifically, want to ask if this solution is good as a method to do "Rolling Updates" for nodes.

My main concern is that according to their documentation:

During an instance refresh, Amazon EC2 Auto Scaling takes a set of instances out of service, terminates them, and then launches a set of instances with the new configuration

And

Instances terminated before launch: When there is only one instance in the Auto Scaling group, starting an instance refresh can result in an outage because Amazon EC2 Auto Scaling terminates an instance and then launches a new instance

I conclude that there is no way to revert the order (first create, then terminate), which is how this software https://github.com/hellofresh/eks-rolling-update works.

Should we assume that for this to work without downtime, we need to have a minimum of 2 instances on each AutoScalingGroups? And, in that case, should I assume that the rest of the instances (total-1) should have extra capacity to absorb the pods from the just terminated instance before the new one is created?. Also, the cluster-autoscaler does not have native support to over provisioning of nodes.

Can I assume that using managed workers is a better alternative than using self managed worker groups at this point?

The text was updated successfully, but these errors were encountered:

lgg42 · 2021-05-27T10:58:11Z

Hello!

We're using managed workers and we're also experiencing this behavior. Existing instances from the old ASG start being terminated before the new ASG is already created and the nodes joined the cluster.

Example output with module version: 15.2.0

module.eks.module.node_groups.random_pet.node_groups["main-v5"]: Creating...
module.eks.module.node_groups.aws_eks_node_group.workers["main-v4"]: Destroying... [id=company-cluster-qa:company-cluster-qa-main-v4-usable-mole]
module.eks.module.node_groups.random_pet.node_groups["main-v5"]: Creation complete after 0s [id=modern-squid]
module.eks.module.node_groups.aws_eks_node_group.workers["main-v5"]: Creating...
module.eks.module.node_groups.aws_eks_node_group.workers["main-v4"]: Still destroying... [id=company-cluster-qa:company-cluster-qa-main-v4-usable-mole, 10s elapsed]
module.eks.module.node_groups.aws_eks_node_group.workers["main-v5"]: Still creating... [10s elapsed]
module.eks.module.node_groups.aws_eks_node_group.workers["main-v4"]: Still destroying... [id=company-cluster-qa:company-cluster-qa-main-v4-usable-mole, 20s elapsed]
module.eks.module.node_groups.aws_eks_node_group.workers["main-v5"]: Still creating... [20s elapsed]
module.eks.module.node_groups.aws_eks_node_group.workers["main-v4"]: Still destroying... [id=company-cluster-qa:company-cluster-qa-main-v4-usable-mole, 30s elapsed]
module.eks.module.node_groups.aws_eks_node_group.workers["main-v5"]: Still creating... [30s elapsed]

With older versions of the module this didn't happen. First the new ASG was created, the nodes joined the cluster and entered Ready state and then, the old ASG was terminated one instance at a time.

jaimehrubiks · 2021-05-28T14:38:18Z

In your example @lgg42 you are talking about Node groups, and instance refresh is used for self-managed workers. But still, interesting. I will test the new commit that removes random_pets to see how it behaves.

I updated the issue description with more info

stale · 2021-08-26T16:05:45Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bryantbiggs · 2021-08-26T22:50:17Z

maybe this is of interest hashicorp/terraform-provider-aws#20009

stale · 2021-09-25T23:16:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2021-10-03T05:14:44Z

This issue has been automatically closed because it has not had recent activity since being marked as stale.

github-actions · 2022-11-18T02:29:27Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

barryib added the question label May 28, 2021

jaimehrubiks changed the title ~~Instance Refresh Discussion~~ Instance Rolling-Update Discussion May 28, 2021

jaimehrubiks mentioned this issue Jun 7, 2021

Autoscaling group desired_size and max_size keeps on increasing after updating launch_template #1430

Closed

stale bot added the stale label Aug 26, 2021

stale bot removed the stale label Aug 26, 2021

stale bot added the stale label Sep 25, 2021

stale bot closed this as completed Oct 3, 2021

github-actions bot locked as resolved and limited conversation to collaborators Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instance Rolling-Update Discussion #1398

Instance Rolling-Update Discussion #1398

jaimehrubiks commented May 26, 2021 •

edited

Loading

lgg42 commented May 27, 2021

jaimehrubiks commented May 28, 2021 •

edited

Loading

stale bot commented Aug 26, 2021

bryantbiggs commented Aug 26, 2021

stale bot commented Sep 25, 2021

stale bot commented Oct 3, 2021

github-actions bot commented Nov 18, 2022

Instance Rolling-Update Discussion #1398

Instance Rolling-Update Discussion #1398

Comments

jaimehrubiks commented May 26, 2021 • edited Loading

lgg42 commented May 27, 2021

jaimehrubiks commented May 28, 2021 • edited Loading

stale bot commented Aug 26, 2021

bryantbiggs commented Aug 26, 2021

stale bot commented Sep 25, 2021

stale bot commented Oct 3, 2021

github-actions bot commented Nov 18, 2022

jaimehrubiks commented May 26, 2021 •

edited

Loading

jaimehrubiks commented May 28, 2021 •

edited

Loading