-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add deleteLocalData flag, basic logging, and cluster-autoscaler annotation #23
Conversation
ce1dd71
to
eccc757
Compare
Thanks @eterna2 ; looking right now... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall pretty good, thank you for doing this. As you said, nice and quick, but that doesn't make it bad. The logging could be better overall, but again, not something to fix in this PR.
I made a minor suggestion or two, a question, and could you please add documentation to the README:
- the new env var supported and why
- how the interaction between CAPI and this work
Thanks!
Also, should rebase, as I just switched it over to GitHub actions. |
ba786c6
to
623d378
Compare
Updated as requested. Just to add: Because based on my observations, depending on the timing of the check, u might end up with a This results in an infinite loop, cuz no new nodes will be created, and no nodes will be terminated. I suspect some count somewhere went wrong, when cluster-autoscaler creates even more nodes due to demand, or when some nodes are in progress of being scaled-down when the check is been done. I have to restart aws-asg-roller to get the update running again. |
Forgot to add. I made some additional changes. Changed |
This looks pretty good. Thanks for the updates. Ready to merge it in?
Does this block us? |
Nope. It is working but might fail during edge cases. Will require more work. Issue is not directly related to the new features. More like this PR is unable to completely resolve all the issues between aws-asg-roller and cluster-autoscaler, because the current update strategy did not account of asg modifications by cluster autoscaler. |
OK, I will hold off then. |
Erm. I won't be working on it for the moment thou. What I mean is that this PR
I don't I have the bandwidth to completely solve the edge cases with CAS. And shld be done in a separate PR by someone? |
Heh, me neither for now. Hoping to have more OSS time by end of year. I am bogged down on diskfs as well. I really need extra days in my week.
I am fine with that. If you are comfortable that this PR is an improvement over current state, without solving everything, I am fine merging it in and then treating that as a separate issue. Your thoughts? |
Yup, we can do that. Raise an issue on that. |
Done. Now we find out if the Github Actions deploy process to docker hub actually works... |
I just had a great idea on a quick workaround. Cuz the issue I observe in production is that the loop get stuck in an infinite loop becuz the desired and current is equal and yet there are no new pending nodes. And how I workaround this in production is to monitor and restart the aws-asg-roller pod when this happens. We could probably fix this by having a check for this condition:
What do u think? |
I think I don't properly understood the problem. :-) Do you mind opening a new issue for this with a fresh explanation? And then your proposed workaround as a separate comment? |
Quick workaround for #19 #21.
Originally, wanted to refine the fix abit (use proper logging, better handle the node annotation, etc), but I just couldn't find the time, so here is the PR - not ideal, but it is working in my cluster.
Changes
deleteLocalData
flag for node draining because some of my pods uses the local data volume mountImportant Note
summary of how annotation is handled: