GitHub - Conservis/kube-asg-node-drainer: Opinionated graceful K8s node drainer for AWS ASG

AWS ASG rebalancing node drainer

1. Why

A workaround for rebalancing feature of ASG https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#common-notes-and-gotchas. More in https://chrisdodds.net/kubernetes-ec2-autoscaling-for-fun-and-profit/ on why the official "fix" sucks.

2. What

Based on kube-aws deployment and kube-aws resources.
Transformed into a helm chart to be effectively used in any EKS cluster.
The final solution gracefully evicts pods - even single replicas or deployments without PDB - with kubectl rollout restart.

3. Prerequisites - AWS InfraStructure Setup

ASG must provide a lifecycle hook kube-asg-node-drainer to let the node drainer script notify ASG after completing the POD eviction:

aws cloudformation deploy --template-file cf/lifecyclehook.yml --stack-name kube-asg-node-drainer-hook --parameter-overrides AsgName=<YOUR_ASG_NAME>

This chart assumes that the worker node is provided with an IAM role to access ASG resources:

aws cloudformation deploy --template-file cf/noderole.yml --stack-name kube-asg-node-worker-role

If a project uses kube2iam one can use iamRole in values.yml to assign an IAM Role to the kube-asg-node-drainer pods.

4. Install

https://github.com/Conservis/helm-charts/tree/master/charts/kube-asg-node-drainer

5. Test

How to test that things work:

terminate an instance in the desired ASG

aws autoscaling terminate-instance-in-auto-scaling-group --no-should-decrement-desired-capacity --instance-id <instance-id>

the instance/node is marked with Terminating:Wait
kube-asg-node-drainer will start gracefully evicting the pods
autoscaler replaces the node instance-id with the new one
pods move from terminating instances to new ones

During that period one can verify that the app didn't go down by something like:

while true; do date; curl <app_health_check>; echo ''; sleep 5; done

History

https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws#common-notes-and-gotchas, third comment:

On creation time, the ASG will have the AZRebalance process enabled, which means it will actively work 
to balance the number of instances between AZs, and possibly terminate instances. If your applications 
could be impacted from sudden termination, you can either suspend the AZRebalance feature, or use a tool 
for automatic draining upon ASG scale-in such as the https://github.com/aws-samples/amazon-k8s-node-drainer.

As of 2020-05-22 known workaround solutions are:

Any of this solution will face the issue while draining node/evicting a pod in a single-replica deployment and there still will be downtime to the app. PodDisruptionBudget won’t help in this case because of kubernetes/kubernetes#66811. This limitation is best described in kubernetes/kubernetes#66811 (comment).

Bottom line: As of today (2020-05-22) the best solution we can provide is:

cordon the node - kubectl cordon <node>
trigger restart for pods requiring graceful eviction - kubectl rollout restart deployment/<deployment_name>

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
bin		bin
cf		cf
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS ASG rebalancing node drainer

1. Why

2. What

3. Prerequisites - AWS InfraStructure Setup

4. Install

5. Test

History

About

Releases 1

Packages

Languages

License

Conservis/kube-asg-node-drainer

Folders and files

Latest commit

History

Repository files navigation

AWS ASG rebalancing node drainer

1. Why

2. What

3. Prerequisites - AWS InfraStructure Setup

4. Install

5. Test

History

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages