-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROLLER_ORIGINAL_DESIRED_ON_TAG is not compatible with cluster-autoscaler #43
Comments
Honestly, I'm a bit stumped. I can't think of any way to make it configurable cleanly other than reverting the entire change, as this is now used in every core functions. On one hand, it provides a way to survive application restarts, but on the other hand, if new nodes are scheduled by cluster-autoscaler while aws-asg-roller tries to get to the desired amount, it will cause availability issues, which is not a risk I can take @deitch @outofcoffee do you have any suggestion? |
Yeah that's not going to work. The problem is that it's not really possible for aws-asg-roller to know what nodes were scaled up as a result of its change to the desired count of the ASG, or cluster-autoscaler's change -- this is because nodes are not scaled up instantly, it takes some time. As a result, it's not possible to update the "real" original desired count with the new nodes from cluster-autoscaler, because they can't be differentiated from the ones spun up because of aws-asg-roller. Technically, we could listen to the events created by cluster-autoscaler, but the problem is that cluster-autoscaler actually takes the nodes spun up by aws-asg-roller into consideration, meaning it may send an event that was in fact caused by aws-asg-roller. What was nice with the previous was of calculating the "original" desired count was that it was recalculated on each run, thus taking cluster-autoscaler's changes into consideration. |
Exactly. Two distinct systems fighting each other. This works fine when each works in isolation - not every ASG rolling is a k8s cluster, so not everyone has a separate autoscaler working in parallel (or more correctly, against it) - but when you put the two together, it is a problem. They each need to find a way to do one of:
The problem is that each just operates on its own. Do you recall how cluster-autoscaler does its own calculations, if it stores the "original" and how? |
@deitch As far as I know, cluster-autoscaler publishes its current status in a configmap named To summarize, CA does this every
The process above varies based on the parameters passed, but that's the gist of it. If CA restarts, the entire process starts over, which is acceptable because the worst case scenario is that there's too many nodes (scaling up is instant, scaling down requires not needing that node for As of #37, this project is no longer be compatible with cluster-autoscaler, and this is because it no longer only scales down old nodes, it can also scale down new nodes because it wants to go back to its original desired count.
|
Hi folks, the original version of the #37 PR had the new behaviour hidden behind a configuration option and retained the original behaviour to preserve existing semantics. See the references to If it would help in the interim whilst the approach is figured out, we could do the same again. |
@outofcoffee Ah, I had seen c1e8e98 but assumed that the configurable portion of it had been removed due to added complexity (specifically for testing) |
Bit of an update on this, reintroducing |
aws-asg-roller fights with cluster-autoscaler when the latter tries to scale up and the former tries to return to the original desired count.
This should not have been enabled by default, and should've been configurable through an environment variable.
After looking at how this was implemented, it doesn't seem like it's going to be a simple task at all to make this configurable.
The text was updated successfully, but these errors were encountered: