Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deleteLocalData flag, basic logging, and cluster-autoscaler annotation #23

Merged
merged 5 commits into from
Oct 22, 2019

Conversation

eterna2
Copy link
Contributor

@eterna2 eterna2 commented Oct 16, 2019

Quick workaround for #19 #21.

Originally, wanted to refine the fix abit (use proper logging, better handle the node annotation, etc), but I just couldn't find the time, so here is the PR - not ideal, but it is working in my cluster.

Changes

  • add deleteLocalData flag for node draining because some of my pods uses the local data volume mount
  • add some really basic logging because I needed them to debug and monitor aws-asg-roller state
  • add basic func to set node annotation in k8s to prevent cluster-autoscaler to scale down the nodes, and removing the annotations when update is done

Important Note

summary of how annotation is handled:

  • if asg needs update, annotate all old nodes with flag to prevent scaling down
  • if asg does not need update, remove flag from all nodes in asg

Hence if aws-asg-roller is improperly killed or removed without restart or redeployment, the old instances will be annotated with the no scale down flag!

@eterna2 eterna2 force-pushed the william.teo/autoscaler branch from ce1dd71 to eccc757 Compare October 16, 2019 10:05
@deitch
Copy link
Owner

deitch commented Oct 20, 2019

Thanks @eterna2 ; looking right now...

Copy link
Owner

@deitch deitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall pretty good, thank you for doing this. As you said, nice and quick, but that doesn't make it bad. The logging could be better overall, but again, not something to fix in this PR.

I made a minor suggestion or two, a question, and could you please add documentation to the README:

  • the new env var supported and why
  • how the interaction between CAPI and this work

Thanks!

@deitch
Copy link
Owner

deitch commented Oct 20, 2019

Also, should rebase, as I just switched it over to GitHub actions.

@eterna2 eterna2 force-pushed the william.teo/autoscaler branch from ba786c6 to 623d378 Compare October 20, 2019 16:05
@eterna2
Copy link
Contributor Author

eterna2 commented Oct 20, 2019

@deitch

Updated as requested.

Just to add:
There is still a issue somewhere. Most likely a logical issue, haven't got time to properly think it over. U might want to think if u have time.

Because based on my observations, depending on the timing of the check, u might end up with a desired == original when needs update > 0 and no new nodes are in progress of creation.

This results in an infinite loop, cuz no new nodes will be created, and no nodes will be terminated.

I suspect some count somewhere went wrong, when cluster-autoscaler creates even more nodes due to demand, or when some nodes are in progress of being scaled-down when the check is been done.

I have to restart aws-asg-roller to get the update running again.

@eterna2
Copy link
Contributor Author

eterna2 commented Oct 20, 2019

Forgot to add. I made some additional changes. Changed ROLLER_DELETE_LOCAL_DATA default to false to maintain backward compatibility (instead of true which I am currently using for my cluster).

@deitch
Copy link
Owner

deitch commented Oct 21, 2019

This looks pretty good. Thanks for the updates. Ready to merge it in?

There is still a issue somewhere. Most likely a logical issue, haven't got time to properly think it over.

Does this block us?

@eterna2
Copy link
Contributor Author

eterna2 commented Oct 21, 2019

Nope. It is working but might fail during edge cases.

Will require more work.

Issue is not directly related to the new features. More like this PR is unable to completely resolve all the issues between aws-asg-roller and cluster-autoscaler, because the current update strategy did not account of asg modifications by cluster autoscaler.

@deitch
Copy link
Owner

deitch commented Oct 22, 2019

OK, I will hold off then.

@eterna2
Copy link
Contributor Author

eterna2 commented Oct 22, 2019

Erm. I won't be working on it for the moment thou.

What I mean is that this PR

  • adds the support for localData flag for node draining, and
  • solve the issue of cluster-autoscaler spinning down nodes raised by aws-asg-roller.

I don't I have the bandwidth to completely solve the edge cases with CAS. And shld be done in a separate PR by someone?

@deitch
Copy link
Owner

deitch commented Oct 22, 2019

I don't I have the bandwidth to completely solve the edge cases with CAS

Heh, me neither for now. Hoping to have more OSS time by end of year. I am bogged down on diskfs as well. I really need extra days in my week.

And should be done in a separate PR by someone?

I am fine with that. If you are comfortable that this PR is an improvement over current state, without solving everything, I am fine merging it in and then treating that as a separate issue.

Your thoughts?

@eterna2
Copy link
Contributor Author

eterna2 commented Oct 22, 2019

Yup, we can do that. Raise an issue on that.

@deitch deitch merged commit 685b313 into deitch:master Oct 22, 2019
@deitch
Copy link
Owner

deitch commented Oct 22, 2019

Done. Now we find out if the Github Actions deploy process to docker hub actually works...

@eterna2
Copy link
Contributor Author

eterna2 commented Oct 23, 2019

I just had a great idea on a quick workaround.

Cuz the issue I observe in production is that the loop get stuck in an infinite loop becuz the desired and current is equal and yet there are no new pending nodes.

And how I workaround this in production is to monitor and restart the aws-asg-roller pod when this happens.

We could probably fix this by having a check for this condition:

If (desired == original) and (num_ready_nodes = desired) and (num_old_node > 0) and (time since last increase > x minutes) then (++desired)

What do u think?

@deitch
Copy link
Owner

deitch commented Oct 25, 2019

I think I don't properly understood the problem. :-)

Do you mind opening a new issue for this with a fresh explanation? And then your proposed workaround as a separate comment?

@TwiN TwiN mentioned this pull request Feb 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants