Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance Rolling-Update Discussion #1398

Closed
jaimehrubiks opened this issue May 26, 2021 · 7 comments
Closed

Instance Rolling-Update Discussion #1398

jaimehrubiks opened this issue May 26, 2021 · 7 comments

Comments

@jaimehrubiks
Copy link
Contributor

jaimehrubiks commented May 26, 2021

I want to review the different alternatives and configs available for setting up workers and the ability to seamlessly do rolling updates.

  • Self-managed workers vs managed node groups
  • Instance refresh vs managed node groups updates vs eks-rolling-update vs manually
  • Single ASG per AZ vs One ASG for each AZ

Let's start with self-managed workers:

For that we have:

  • instance refresh + node-termination-handler in queue mode
  • eks-rolling-update script
  • manual (won't be discussed)

My issue with instance refresh (as you can read below, in the "original question" section), is that apparently, it terminates instances before creating new ones. Not sure when this design is desirable, but it is the case.

I suppose that instance refresh can still work fine, but specifically I would say that only when the ASG size is at least 2, and your pods always have a replica count of 2, given that one node at a time will be terminated before the new node is ready. In my experience, the node is terminated way before the ~120 seconds that a new instance (amazon linux) takes to be ready.

The rolling-update script is not bad, but I have had some issues in the past with it. I feel like it is not safe to use in production in a completely unattended way, as any issue might leave you in a wrong state that requires manual correction. Again, this is my experience, but I suppose for some people it works fantastic.

Then, for managed node groups:

AWS provides an out-of-the-box feature for doing rolling updates. Before going forward, I must say that managed node groups used to have some disadvantages, although, currently, I think the only one left is that you cannot scale the groups to zero.

The rolling updates system is automatically triggered in most desired situations (AMI change, template change...). The only issue I have is with the way they do the updates. The author if this issue properly explains it graphically in this PDF.

In my experience, this results in a case where a single-node ASG ends up with 7 instances before it is scaled down to 1 again. This is extremely inefficient. Unless there was an issue on my side, or some kind of bug I am not aware of.

This is explained here (although not very clearly, as it says: "Increments the Auto Scaling group maximum size and desired size by one up to twice the number of Availability Zones in the Region that the Auto Scaling group is deployed in" which I'm not sure what it means): https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html

For a single-node ASG it took me more than 30 minutes, during which the terraform was running and waiting for the process to be completed.

For this, I would assume that in certain cases like a production environment where you may have at least 3 nodes (min=3), this is acceptable. But I may even have single-node ASG in prod for very specific applications, and if each of those ASG is going to be increased to 7 and then decreased to 1, the time it would take to do a full-cluster update would be huge. (Not to mention that we would be paying for more instances than needed, -or even expensive gpu instances-).

A small optimization for certain edge cases could be to create more ASGs for the same worker roles, one for each AZ. This can be helpful to reduce the high amount of instances during the updates.

What do you think about it?

------ Original Question --------
I have tested the instance refresh example and overall it looks good and works fine. I would like to know if anyone has been using this in production, and what are your thoughts about it.

Specifically, want to ask if this solution is good as a method to do "Rolling Updates" for nodes.

My main concern is that according to their documentation:

During an instance refresh, Amazon EC2 Auto Scaling takes a set of instances out of service, terminates them, and then launches a set of instances with the new configuration

And

Instances terminated before launch: When there is only one instance in the Auto Scaling group, starting an instance refresh can result in an outage because Amazon EC2 Auto Scaling terminates an instance and then launches a new instance

I conclude that there is no way to revert the order (first create, then terminate), which is how this software https://github.com/hellofresh/eks-rolling-update works.

Should we assume that for this to work without downtime, we need to have a minimum of 2 instances on each AutoScalingGroups? And, in that case, should I assume that the rest of the instances (total-1) should have extra capacity to absorb the pods from the just terminated instance before the new one is created?. Also, the cluster-autoscaler does not have native support to over provisioning of nodes.

Can I assume that using managed workers is a better alternative than using self managed worker groups at this point?

@lgg42
Copy link

lgg42 commented May 27, 2021

Hello!

We're using managed workers and we're also experiencing this behavior. Existing instances from the old ASG start being terminated before the new ASG is already created and the nodes joined the cluster.

Example output with module version: 15.2.0

module.eks.module.node_groups.random_pet.node_groups["main-v5"]: Creating...
module.eks.module.node_groups.aws_eks_node_group.workers["main-v4"]: Destroying... [id=company-cluster-qa:company-cluster-qa-main-v4-usable-mole]
module.eks.module.node_groups.random_pet.node_groups["main-v5"]: Creation complete after 0s [id=modern-squid]
module.eks.module.node_groups.aws_eks_node_group.workers["main-v5"]: Creating...
module.eks.module.node_groups.aws_eks_node_group.workers["main-v4"]: Still destroying... [id=company-cluster-qa:company-cluster-qa-main-v4-usable-mole, 10s elapsed]
module.eks.module.node_groups.aws_eks_node_group.workers["main-v5"]: Still creating... [10s elapsed]
module.eks.module.node_groups.aws_eks_node_group.workers["main-v4"]: Still destroying... [id=company-cluster-qa:company-cluster-qa-main-v4-usable-mole, 20s elapsed]
module.eks.module.node_groups.aws_eks_node_group.workers["main-v5"]: Still creating... [20s elapsed]
module.eks.module.node_groups.aws_eks_node_group.workers["main-v4"]: Still destroying... [id=company-cluster-qa:company-cluster-qa-main-v4-usable-mole, 30s elapsed]
module.eks.module.node_groups.aws_eks_node_group.workers["main-v5"]: Still creating... [30s elapsed]

With older versions of the module this didn't happen. First the new ASG was created, the nodes joined the cluster and entered Ready state and then, the old ASG was terminated one instance at a time.

@jaimehrubiks jaimehrubiks changed the title Instance Refresh Discussion Instance Rolling-Update Discussion May 28, 2021
@jaimehrubiks
Copy link
Contributor Author

jaimehrubiks commented May 28, 2021

In your example @lgg42 you are talking about Node groups, and instance refresh is used for self-managed workers. But still, interesting. I will test the new commit that removes random_pets to see how it behaves.

I updated the issue description with more info

@stale
Copy link

stale bot commented Aug 26, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 26, 2021
@bryantbiggs
Copy link
Member

maybe this is of interest hashicorp/terraform-provider-aws#20009

@stale stale bot removed the stale label Aug 26, 2021
@stale
Copy link

stale bot commented Sep 25, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 25, 2021
@stale
Copy link

stale bot commented Oct 3, 2021

This issue has been automatically closed because it has not had recent activity since being marked as stale.

@stale stale bot closed this as completed Oct 3, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants