Skip to content

Commit

Permalink
Merge pull request #53 from helgi/feature/env-handling
Browse files Browse the repository at this point in the history
  • Loading branch information
deitch authored Jun 7, 2021
2 parents d93b32e + e6e6dd6 commit f2fd590
Show file tree
Hide file tree
Showing 13 changed files with 221 additions and 178 deletions.
25 changes: 14 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,23 +226,26 @@ Several key areas of potential modification:
## Configuration
ASG Roller takes its configuration via environment variables. All environment variables that affect ASG Roller begin with `ROLLER_`.

* `ROLLER_ASG`: comma-separated list of auto-scaling groups that should be managed.
* `ROLLER_KUBERNETES`: If set to `true`, will check if a new node is ready via-a-vis Kubernetes before declaring it "ready", and will drain an old node before eliminating it. Defaults to `true` when running in Kubernetes as a pod, `false` otherwise.
* `ROLLER_IGNORE_DAEMONSETS`: If set to `false`, will not reclaim a node until there are no DaemonSets running on the node; if set to `true` (default), will reclaim node when all regular pods are drained off, but will ignore the presence of DaemonSets, which should be present on every node anyways. Normally, you want this set to `true`, which is the default.
* `ROLLER_DELETE_LOCAL_DATA`: If set to `false` (default), will not reclaim a node until there are no pods with [emptyDir](https://kubernetes.io/docs/concepts/storage/volumes/#emptydir) running on the node; if set to `true`, will continue to terminate the pod and delete the local data before reclaiming the node. The default is `false` to maintain backward compatibility.
* `ROLLER_CHECK_DELAY`: Time, in seconds, between checks of ASG status.
* `ROLLER_CAN_INCREASE_MAX`: If set to `true`, will increase the ASG maximum size to accommodate the increase in desired count. If set to `false`, will instead error when desired is higher than max.
* `ROLLER_ORIGINAL_DESIRED_ON_TAG`: If set to `true`, will store the original desired value of the ASG as a tag on the ASG, with the key `aws-asg-roller/OriginalDesired`. This helps maintain state in the situation where the process terminates.
* `ROLLER_VERBOSE`: If set to `true`, will increase verbosity of logs.
* `KUBECONFIG`: Path to kubernetes config file for authenticating to the kubernetes cluster. Required only if `ROLLER_KUBERNETES` is `true` and we are not operating in a kubernetes cluster.
* `ROLLER_ASG` [`string`, required]: comma-separated list of auto-scaling groups that should be managed.
* `ROLLER_KUBERNETES` [`bool`, default: `true`]: If set to `true`, will check if a new node is ready via-a-vis Kubernetes before declaring it "ready", and will drain an old node before eliminating it. Defaults to `true` when running in Kubernetes as a pod, `false` otherwise.
* `ROLLER_DRAIN` [`bool`, default: `true`]: If set to `true`, will handle draining of pods and other kubernetes resources. Consider setting to false if your distribution has a built in drain on terminate.
* `ROLLER_DRAIN_FORCE` [`bool` default: `true`]: If drain will force delete kubernetes resources if they violate PDB or grace periods.
* `ROLLER_IGNORE_DAEMONSETS` [`bool`, default: `true`]: If set to `false`, will not reclaim a node until there are no DaemonSets running on the node; if set to `true` (default), will reclaim node when all regular pods are drained off, but will ignore the presence of DaemonSets, which should be present on every node anyways. Normally, you want this set to `true`.
* `ROLLER_DELETE_LOCAL_DATA` [`bool`, default: `false`]: If set to `false` (default), will not reclaim a node until there are no pods with [emptyDir](https://kubernetes.io/docs/concepts/storage/volumes/#emptydir) running on the node; if set to `true`, will continue to terminate the pod and delete the local data before reclaiming the node. The default is `false` to maintain backward compatibility.
* `ROLLER_INTERVAL` [`time.Duration`, default: `30s`]: Time between roller runs. Decimal number with a unit suffix, such as "10s", "10m", "10d", "300ms", "-1.5h" or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h". Internally uses [time.ParseDuration](https://golang.org/pkg/time/#ParseDuration)
* `ROLLER_CHECK_DELAY` [`int`]: Time, in seconds, between checks of ASG status. **Deprecated**, use `ROLLER_INTERVAL`. If both `ROLLER_CHECK_DELAY` and `ROLLER_INTERVAL` are specified then `ROLLER_INTERVAL` is used.
* `ROLLER_CAN_INCREASE_MAX` `bool`: If set to `true`, will increase the ASG maximum size to accommodate the increase in desired count. If set to `false`, will instead error when desired is higher than max.
* `ROLLER_ORIGINAL_DESIRED_ON_TAG` [`bool`, default: `false`]: If set to `true`, will store the original desired value of the ASG as a tag on the ASG, with the key `aws-asg-roller/OriginalDesired`. This helps maintain state in the situation where the process terminates.
* `ROLLER_VERBOSE` [`bool`, default: `false`]: If set to `true`, will increase verbosity of logs.
* `KUBECONFIG` [`string`]: Path to kubernetes config file for authenticating to the kubernetes cluster. Required only if `ROLLER_KUBERNETES` is `true` and we are not operating in a kubernetes cluster.

## Interaction with cluster-autoscaler

[cluster-autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) is a tool that commonly used to automatically adjusts the size of the Kubernetes cluster. However, there might be some conflicts (see [#19](https://github.com/deitch/aws-asg-roller/issues/19) for more details) between cluster-autoscaler and aws-asg-roller when they are both trying to schedule the asg. A workaround was implemented in aws-asg-roller by annotating all the managed nodes with `cluster-autoscaler.kubernetes.io/scale-down-disabled` when rolling-update is required.
[cluster-autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) is a tool that commonly used to automatically adjusts the size of the Kubernetes cluster. However, there might be some conflicts (see [#19](https://github.com/deitch/aws-asg-roller/issues/19) for more details) between cluster-autoscaler and aws-asg-roller when they are both trying to schedule the asg. A workaround was implemented in aws-asg-roller by annotating all the managed nodes with `cluster-autoscaler.kubernetes.io/scale-down-disabled` when rolling-update is required.

The general flow can be summarized as follow:
* Check if any nodes in the asg needs to be updated.
* If there are nodes that needs to be updated, annotate all up-to-date or new nodes with `cluster-autoscaler.kubernetes.io/scale-down-disabled`
* If there are nodes that needs to be updated, annotate all up-to-date or new nodes with `cluster-autoscaler.kubernetes.io/scale-down-disabled`
* Update asg to spin up a new node before draining any old nodes.
* Sleep and repeat (i.e. annotate new unutilized node to prevent it from being scaled-down).
* If all nodes are up-to-date, remove `cluster-autoscaler.kubernetes.io/scale-down-disabled` if any from all the nodes - i.e. normal cluster-autoscaler management resumes.
Expand Down
14 changes: 7 additions & 7 deletions aws.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ import (
"log"
)

func setAsgDesired(svc autoscalingiface.AutoScalingAPI, asg *autoscaling.Group, count int64) error {
func setAsgDesired(svc autoscalingiface.AutoScalingAPI, asg *autoscaling.Group, count int64, canIncreaseMax, verbose bool) error {
if count > *asg.MaxSize {
if canIncreaseMax {
err := setAsgMax(svc, asg, count)
err := setAsgMax(svc, asg, count, verbose)
if err != nil {
return err
}
Expand Down Expand Up @@ -43,17 +43,17 @@ func setAsgDesired(svc autoscalingiface.AutoScalingAPI, asg *autoscaling.Group,
default:
return fmt.Errorf("%s - unexpected and unknown AWS error: %v", errMsg, aerr.Error())
}
} else {
return fmt.Errorf("%s - unexpected and unknown non-AWS error: %v", errMsg, err.Error())
}

return fmt.Errorf("%s - unexpected and unknown non-AWS error: %v", errMsg, err.Error())
}
if verbose {
log.Printf("increased ASG %s desired count to %d", *asg.AutoScalingGroupName, count)
}
return nil
}

func setAsgMax(svc autoscalingiface.AutoScalingAPI, asg *autoscaling.Group, count int64) error {
func setAsgMax(svc autoscalingiface.AutoScalingAPI, asg *autoscaling.Group, count int64, verbose bool) error {
if verbose {
log.Printf("increasing ASG %s max size to %d to accommodate desired count", *asg.AutoScalingGroupName, count)
}
Expand All @@ -72,9 +72,9 @@ func setAsgMax(svc autoscalingiface.AutoScalingAPI, asg *autoscaling.Group, coun
default:
return fmt.Errorf("%s - unexpected and unknown AWS error: %v", errMsg, aerr.Error())
}
} else {
return fmt.Errorf("%s - unexpected and unknown non-AWS error: %v", errMsg, err.Error())
}

return fmt.Errorf("%s - unexpected and unknown non-AWS error: %v", errMsg, err.Error())
}
if verbose {
log.Printf("increased ASG %s max size to %d to accommodate desired count", *asg.AutoScalingGroupName, count)
Expand Down
37 changes: 19 additions & 18 deletions aws_internal_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -309,24 +309,24 @@ func TestAwsSetAsgDesired(t *testing.T) {
canIncreaseMax bool
setErr error
err error
verbose bool
}{
{3, 3, true, nil, nil},
{2, 2, true, nil, nil},
{15, 15, true, awserr.New(autoscaling.ErrCodeResourceContentionFault, "", nil), fmt.Errorf("unable to increase ASG mygroup desired count to 15 - ResourceContention")},
{1, 1, true, awserr.New("testabc", "", nil), fmt.Errorf("unable to increase ASG mygroup desired count to 1 - unexpected and unknown AWS error")},
{25, 25, true, fmt.Errorf("testabc"), fmt.Errorf("unable to increase ASG mygroup desired count to 25 - unexpected and unknown non-AWS error")},
{31, 30, false, nil, fmt.Errorf("unable to increase ASG mygroup desired size to 31 as greater than max size 30")},
{31, 30, true, nil, nil},
{3, 3, true, nil, nil, false},
{2, 2, true, nil, nil, false},
{15, 15, true, awserr.New(autoscaling.ErrCodeResourceContentionFault, "", nil), fmt.Errorf("unable to increase ASG mygroup desired count to 15 - ResourceContention"), false},
{1, 1, true, awserr.New("testabc", "", nil), fmt.Errorf("unable to increase ASG mygroup desired count to 1 - unexpected and unknown AWS error"), false},
{25, 25, true, fmt.Errorf("testabc"), fmt.Errorf("unable to increase ASG mygroup desired count to 25 - unexpected and unknown non-AWS error"), false},
{31, 30, false, nil, fmt.Errorf("unable to increase ASG mygroup desired size to 31 as greater than max size 30"), false},
{31, 30, true, nil, nil, false},
}
for i, tt := range tests {
asg := &autoscaling.Group{
AutoScalingGroupName: &groupName,
MaxSize: &tt.max,
}
canIncreaseMax = tt.canIncreaseMax
err := setAsgDesired(&mockAsgSvc{
err: tt.setErr,
}, asg, tt.desired)
}, asg, tt.desired, tt.canIncreaseMax, tt.verbose)
switch {
case (err == nil && tt.err != nil) || (err != nil && tt.err == nil) || (err != nil && tt.err != nil && !strings.HasPrefix(err.Error(), tt.err.Error())):
t.Errorf("%d: Mismatched error, actual then expected", i)
Expand All @@ -339,23 +339,24 @@ func TestAwsSetAsgDesired(t *testing.T) {
func TestAwsSetAsgMax(t *testing.T) {
groupName := "mygroup"
tests := []struct {
max int64
setErr error
err error
max int64
setErr error
err error
verbose bool
}{
{3, nil, nil},
{2, nil, nil},
{15, awserr.New(autoscaling.ErrCodeResourceContentionFault, "", nil), fmt.Errorf("unable to increase ASG mygroup max size to 15 - ResourceContention")},
{1, awserr.New("testabc", "", nil), fmt.Errorf("unable to increase ASG mygroup max size to 1 - unexpected and unknown AWS error: testabc")},
{25, fmt.Errorf("testabc"), fmt.Errorf("unable to increase ASG mygroup max size to 25 - unexpected and unknown non-AWS error: testabc")},
{3, nil, nil, false},
{2, nil, nil, false},
{15, awserr.New(autoscaling.ErrCodeResourceContentionFault, "", nil), fmt.Errorf("unable to increase ASG mygroup max size to 15 - ResourceContention"), false},
{1, awserr.New("testabc", "", nil), fmt.Errorf("unable to increase ASG mygroup max size to 1 - unexpected and unknown AWS error: testabc"), false},
{25, fmt.Errorf("testabc"), fmt.Errorf("unable to increase ASG mygroup max size to 25 - unexpected and unknown non-AWS error: testabc"), false},
}
for i, tt := range tests {
asg := &autoscaling.Group{
AutoScalingGroupName: &groupName,
}
err := setAsgMax(&mockAsgSvc{
err: tt.setErr,
}, asg, tt.max)
}, asg, tt.max, tt.verbose)
switch {
case (err == nil && tt.err != nil) || (err != nil && tt.err == nil) || (err != nil && tt.err != nil && !strings.HasPrefix(err.Error(), tt.err.Error())):
t.Errorf("%d: Mismatched error, actual then expected", i)
Expand Down
18 changes: 18 additions & 0 deletions configs.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
package main

import "time"

// Configs struct deals with env configuration
type Configs struct {
Interval time.Duration `env:"ROLLER_INTERVAL" envDefault:"30s"`
CheckDelay int `env:"ROLLER_CHECK_DELAY" envDefault:"30"`
Drain bool `env:"ROLLER_DRAIN" envDefault:"true"`
DrainForce bool `env:"ROLLER_DRAIN_FORCE" envDefault:"true"`
IncreaseMax bool `env:"ROLLER_CAN_INCREASE_MAX" envDefault:"false"`
IgnoreDaemonSets bool `env:"ROLLER_IGNORE_DAEMONSETS" envDefault:"true"`
DeleteLocalData bool `env:"ROLLER_DELETE_LOCAL_DATA" envDefault:"false"`
OriginalDesiredOnTag bool `env:"ROLLER_ORIGINAL_DESIRED_ON_TAG" envDefault:"false"`
ASGS []string `env:"ROLLER_ASG,required" envSeparator:","`
KubernetesEnabled bool `env:"ROLLER_KUBERNETES" envDefault:"true"`
Verbose bool `env:"ROLLER_VERBOSE" envDefault:"false"`
}
9 changes: 6 additions & 3 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ module github.com/deitch/aws-asg-roller
go 1.12

require (
github.com/alexkohler/nakedret v1.0.0 // indirect
github.com/aws/aws-sdk-go v1.21.8
github.com/caarlos0/env/v6 v6.6.0
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/go-log/log v0.2.0 // indirect
github.com/gogo/protobuf v0.0.0-20170330071051-c0656edd0d9e // indirect
Expand All @@ -13,19 +13,22 @@ require (
github.com/googleapis/gnostic v0.0.0-20170729233727-0c5108395e2d // indirect
github.com/gregjones/httpcache v0.0.0-20170728041850-787624de3eb7 // indirect
github.com/imdario/mergo v0.3.6 // indirect
github.com/kr/pretty v0.1.0 // indirect
github.com/modern-go/reflect2 v0.0.0-20180701023420-4b7aa43c6742 // indirect
github.com/openshift/kubernetes-drain v0.0.0-20180831174519-c2e51be1758e
github.com/peterbourgon/diskv v2.0.1+incompatible // indirect
github.com/spf13/pflag v1.0.3 // indirect
github.com/stretchr/testify v1.3.0 // indirect
github.com/stretchr/testify v1.7.0
golang.org/x/crypto v0.0.0-20191011191535-87dc89f01550 // indirect
golang.org/x/net v0.0.0-20200226121028-0de0cce0169b // indirect
golang.org/x/oauth2 v0.0.0-20170412232759-a6bd8cefa181 // indirect
golang.org/x/sync v0.0.0-20190911185100-cd5d95a43a6e // indirect
golang.org/x/time v0.0.0-20161028155119-f51c12702a4d // indirect
google.golang.org/appengine v1.3.0 // indirect
gopkg.in/check.v1 v1.0.0-20190902080502-41f04d3bba15 // indirect
gopkg.in/inf.v0 v0.9.0 // indirect
k8s.io/api v0.0.0-20181004124137-fd83cbc87e76
k8s.io/apimachinery v0.0.0-20180913025736-6dd46049f395
k8s.io/client-go v9.0.0+incompatible
k8s.io/kube-openapi v0.0.0-20190426233423-c5d3b0f4bee0 // indirect
mvdan.cc/unparam v0.0.0-20200314162735-0ac8026f7d06 // indirect
)
Loading

0 comments on commit f2fd590

Please sign in to comment.