Add deleteLocalData flag, basic logging, and cluster-autoscaler annotation #23

eterna2 · 2019-10-16T10:02:42Z

Quick workaround for #19 #21.

Originally, wanted to refine the fix abit (use proper logging, better handle the node annotation, etc), but I just couldn't find the time, so here is the PR - not ideal, but it is working in my cluster.

Changes

add deleteLocalData flag for node draining because some of my pods uses the local data volume mount
add some really basic logging because I needed them to debug and monitor aws-asg-roller state
add basic func to set node annotation in k8s to prevent cluster-autoscaler to scale down the nodes, and removing the annotations when update is done

Important Note

summary of how annotation is handled:

if asg needs update, annotate all old nodes with flag to prevent scaling down
if asg does not need update, remove flag from all nodes in asg

Hence if aws-asg-roller is improperly killed or removed without restart or redeployment, the old instances will be annotated with the no scale down flag!

deitch · 2019-10-20T07:06:23Z

Thanks @eterna2 ; looking right now...

hooks/build

kubernetes.go

deitch

Overall pretty good, thank you for doing this. As you said, nice and quick, but that doesn't make it bad. The logging could be better overall, but again, not something to fix in this PR.

I made a minor suggestion or two, a question, and could you please add documentation to the README:

the new env var supported and why
how the interaction between CAPI and this work

Thanks!

deitch · 2019-10-20T08:23:14Z

Also, should rebase, as I just switched it over to GitHub actions.

eterna2 · 2019-10-20T16:12:36Z

@deitch

Updated as requested.

Just to add:
There is still a issue somewhere. Most likely a logical issue, haven't got time to properly think it over. U might want to think if u have time.

Because based on my observations, depending on the timing of the check, u might end up with a desired == original when needs update > 0 and no new nodes are in progress of creation.

This results in an infinite loop, cuz no new nodes will be created, and no nodes will be terminated.

I suspect some count somewhere went wrong, when cluster-autoscaler creates even more nodes due to demand, or when some nodes are in progress of being scaled-down when the check is been done.

I have to restart aws-asg-roller to get the update running again.

eterna2 · 2019-10-20T16:15:12Z

Forgot to add. I made some additional changes. Changed ROLLER_DELETE_LOCAL_DATA default to false to maintain backward compatibility (instead of true which I am currently using for my cluster).

deitch · 2019-10-21T18:34:40Z

This looks pretty good. Thanks for the updates. Ready to merge it in?

There is still a issue somewhere. Most likely a logical issue, haven't got time to properly think it over.

Does this block us?

eterna2 · 2019-10-21T22:39:59Z

Nope. It is working but might fail during edge cases.

Will require more work.

Issue is not directly related to the new features. More like this PR is unable to completely resolve all the issues between aws-asg-roller and cluster-autoscaler, because the current update strategy did not account of asg modifications by cluster autoscaler.

deitch · 2019-10-22T05:58:08Z

OK, I will hold off then.

eterna2 · 2019-10-22T08:30:50Z

Erm. I won't be working on it for the moment thou.

What I mean is that this PR

adds the support for localData flag for node draining, and
solve the issue of cluster-autoscaler spinning down nodes raised by aws-asg-roller.

I don't I have the bandwidth to completely solve the edge cases with CAS. And shld be done in a separate PR by someone?

deitch · 2019-10-22T09:28:29Z

I don't I have the bandwidth to completely solve the edge cases with CAS

Heh, me neither for now. Hoping to have more OSS time by end of year. I am bogged down on diskfs as well. I really need extra days in my week.

And should be done in a separate PR by someone?

I am fine with that. If you are comfortable that this PR is an improvement over current state, without solving everything, I am fine merging it in and then treating that as a separate issue.

Your thoughts?

eterna2 · 2019-10-22T10:18:15Z

Yup, we can do that. Raise an issue on that.

deitch · 2019-10-22T11:02:18Z

Done. Now we find out if the Github Actions deploy process to docker hub actually works...

eterna2 · 2019-10-23T02:13:21Z

I just had a great idea on a quick workaround.

Cuz the issue I observe in production is that the loop get stuck in an infinite loop becuz the desired and current is equal and yet there are no new pending nodes.

And how I workaround this in production is to monitor and restart the aws-asg-roller pod when this happens.

We could probably fix this by having a check for this condition:

If (desired == original) and (num_ready_nodes = desired) and (num_old_node > 0) and (time since last increase > x minutes) then (++desired)

What do u think?

deitch · 2019-10-25T06:12:49Z

I think I don't properly understood the problem. :-)

Do you mind opening a new issue for this with a fresh explanation? And then your proposed workaround as a separate comment?

eterna2 force-pushed the william.teo/autoscaler branch from ce1dd71 to eccc757 Compare October 16, 2019 10:05

deitch reviewed Oct 20, 2019

View reviewed changes

hooks/build Outdated Show resolved Hide resolved

deitch reviewed Oct 20, 2019

View reviewed changes

kubernetes.go Show resolved Hide resolved

deitch reviewed Oct 20, 2019

View reviewed changes

kubernetes.go Outdated Show resolved Hide resolved

deitch reviewed Oct 20, 2019

View reviewed changes

kubernetes.go Show resolved Hide resolved

deitch requested changes Oct 20, 2019

View reviewed changes

eterna2 added 3 commits October 20, 2019 23:21

Fix deitch#19 - work with autoscaler

56e22d1

fix unit test

0c3a531

Updated README. Minor fixes.

623d378

eterna2 force-pushed the william.teo/autoscaler branch from ba786c6 to 623d378 Compare October 20, 2019 16:05

fix typo

6839f54

fix lint error

d5fdb29

deitch approved these changes Oct 21, 2019

View reviewed changes

deitch merged commit 685b313 into deitch:master Oct 22, 2019

TwiN mentioned this pull request Feb 7, 2020

Add verbose logs #29

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add deleteLocalData flag, basic logging, and cluster-autoscaler annotation #23

Add deleteLocalData flag, basic logging, and cluster-autoscaler annotation #23

eterna2 commented Oct 16, 2019

deitch commented Oct 20, 2019

deitch left a comment

deitch commented Oct 20, 2019

eterna2 commented Oct 20, 2019

eterna2 commented Oct 20, 2019

deitch commented Oct 21, 2019

eterna2 commented Oct 21, 2019

deitch commented Oct 22, 2019

eterna2 commented Oct 22, 2019

deitch commented Oct 22, 2019

eterna2 commented Oct 22, 2019

deitch commented Oct 22, 2019

eterna2 commented Oct 23, 2019

deitch commented Oct 25, 2019

Add deleteLocalData flag, basic logging, and cluster-autoscaler annotation #23

Add deleteLocalData flag, basic logging, and cluster-autoscaler annotation #23

Conversation

eterna2 commented Oct 16, 2019

Changes

Important Note

deitch commented Oct 20, 2019

deitch left a comment

Choose a reason for hiding this comment

deitch commented Oct 20, 2019

eterna2 commented Oct 20, 2019

eterna2 commented Oct 20, 2019

deitch commented Oct 21, 2019

eterna2 commented Oct 21, 2019

deitch commented Oct 22, 2019

eterna2 commented Oct 22, 2019

deitch commented Oct 22, 2019

eterna2 commented Oct 22, 2019

deitch commented Oct 22, 2019

eterna2 commented Oct 23, 2019

deitch commented Oct 25, 2019