How do I gain access to the cluster autoscaler on GKE? #966

chrissound · 2018-06-14T11:27:18Z

I am looking to modify some of the auto scaling options, but this does not seem to be possible on GKE?

It's not clear where to run these 'flags' mentioned in the FAQ or even where these command line flags need to be executed on.

Similar issue is brought up here:
https://stackoverflow.com/questions/48963625/where-to-config-the-kubernetes-cluster-autoscaler-on-google-cloud

aleksandra-malinowska · 2018-06-14T12:32:44Z

You're correct. On GKE, Cluster Autoscaler is always configured automatically. If you run your own cluster on GCE and have access to the master machine, you can change them in Cluster Autoscaler pod's manifest.

chrissound · 2018-06-14T14:09:45Z

Thanks!

joshwand · 2018-12-07T19:36:37Z

It would be good to be able to do some configuration on CA in GKE-- as in the referenced issue, I'd like to reduce the --scale-down-unneeded-time so as to not waste money for 10 minutes of unneeded capacity.

glapark · 2019-01-07T15:18:21Z

I wonder if emptying a node completely helps the autoscaler to quickly remove it. For interactive analytic applications, the default value of 10 minutes for --scale-down-unneeded-time seems too large.

aleksandra-malinowska · 2019-01-07T15:26:46Z

I wonder if emptying a node completely helps the autoscaler to quickly remove it. For interactive analytic applications, the default value of 10 minutes for --scale-down-unneeded-time seems too large.

It helps by eliminating drain time, and also increases throughput by allowing bulk deletes.

As for default 10 minutes wait, it's a compromise of sorts - we don't want the user to wait for nodes to be added because we removed them too quickly between jobs. This being said, we haven't revised this value for a while, so if you any have feedback regarding this behavior, especially production experience with it, please let us know.

glapark · 2019-01-07T15:50:38Z

Thanks for the reply. At the moment, we are still implementing a new service and don't have any production-level experience with it yet (but will publish the result when it is ready).

joshwand · 2019-01-07T16:45:31Z

We spin up expensive high-memory instances on demand as slaves for our integration tests. The load is intermittent, so that extra 10 minutes for 10-30 instances, multiple times a day, gets quite expensive.

glapark · 2019-06-28T18:06:00Z

I wonder if there is any update on the default value of --scale-down-unneeded-time. I think the default value of 10 minutes is fine, but I hope GKE allows users to change the value for their own cluster, because if --scale-down-unneeded-time is set to a new value, the users should know what that actually means.

For us, we would like to implement an autoscaling logic for an analytics system based on Apache Hive, and we would like to remove nodes as soon as possible once the autoscaling logic decides to retire them.

Luke-Vear · 2020-07-15T16:51:09Z

Would be nice to be able to configure things like skip-nodes-with-system-pods or skip-nodes-with-local-storage, there's tons of config that we can't touch.

MaciekPytel · 2020-07-15T17:16:04Z

You can now choose predefined config for more aggressive scale-down: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#autoscaling_profiles (which doesn't help for the flags you listed, but it is what is requested in comments above).

seeruk · 2020-07-15T22:35:23Z

I've been using the optimize-utilization profile, and unfortunately as you've said, it doesn't solve this issue. When using Linkerd, similarly to Istio, it creates emptyDir volume on every pod that you have a sidecar on. This prevents the cluster autoscaler from scaling down because of pretty much every application we have in the cluster.

The current workaround I've had to resort to is this: #3322

The other solution I've been considering so we don't have to maintain a fork of the autoscaler is building some kind of admission controller to add the safe-to-evict annotation to every pod unless an annotation (unsafe-to-evict?) is present, as in the cluster I'm working within local storage should be an extremely exceptional scenario. Using PDBs for the kube-system pods is good, I'd rather know that those pods are being migrated more gracefully.

Being able to just configure the GKE autoscaler would completely solve this though. Perhaps configuration could be exposed in a ConfigMap instead, allowing the solution to be more platform agnostic.

adinhodovic · 2020-08-22T22:02:43Z

Also, I'd prefer be able to monitor the cluster autoscaler using Prometheus.

danielyaa5 · 2021-04-07T22:40:29Z

Not sure why this is closed, I guess google doesn't prioritize features that save people money

zenyui · 2022-01-06T21:34:00Z

@seeruk curious - did you solve this? did you end up writing that admission controller? funny, i was thinking of writing the same thing.

seeruk · 2022-01-07T09:58:18Z

Yeah, it was a really simple one in the end and it's still working to this day! Unfortunately it's closed source currently.

MaciekPytel · 2022-01-10T09:44:17Z

In 1.22+ GKE no longer blocks scale-down on pods with local storage https://cloud.google.com/kubernetes-engine/docs/release-notes#October_27_2021, so an admission controller may no longer be needed.

zenyui · 2022-01-11T16:41:36Z

I just open sourced our pod labeler in case this is useful for anyone. You can use this to add the safe-to-evict annotation. @seeruk lmk your thoughts!

https://github.com/troop-dev/k8s-pod-labeler

vadasambar · 2023-03-17T04:50:21Z

I use custom-autoscaler on GKE to test my PRs.

If anyone's interested I wrote a blogpost around how to delpoy your own cluster-autoscaler on GKE: https://vadasambar.com/post/kubernetes/how-to-deploy-custom-ca-on-gcp/

If you don't want to jump to the blogpost, here's a summarized version:

Enable Workload Identity for the GKE cluster
Deploy your cluster-autoscaler helm chart in a non-kube-system namespace

helm install custom-ca autoscaler/cluster-autoscaler \
--set "autoscalingGroupsnamePrefix[0].name=gke-cluster-1,autoscalingGroupsnamePrefix[0].maxSize=10,autoscalingGroupsnamePrefix[0].minSize=1" \
--set autoDiscovery.clusterName=cluster-1 \
--set "rbac.serviceAccount.annotations.iam\.gke\.io\/gcp-service-account=cluster-autoscaler@my-project-123456.iam.gserviceaccount.com" \
--set cloudProvider=gce \
--version=9.25.0 \
--namespace=default

Create ResourceQuota for system-cluster-critical PriorityClass in your namespace

apiVersion: v1
kind: ResourceQuota
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
  name: gcp-critical-pods
  namespace: default
spec:
  hard:
    pods: 2 # 2 because we need it only for cluster-autoscaler
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
      - system-cluster-critical # cluster-autoscaler priority class

Create GCP service account and attach it to the role you want
Create Kubernetes Service Account to GCP IAM Service Account binding
Annotate your K8s Service Account for cluster-autoscaler to use workload identity federation

Nickmman · 2023-03-17T21:44:32Z

@vadasambar Do your cluster-autoscaler logs show that it manages only one instance group and the others should not be processed by cluster autoscaler (no node group config)?

vadasambar · 2023-03-20T05:54:23Z

@vadasambar Do your cluster-autoscaler logs show that it manages only one instance group and the others should not be processed by cluster autoscaler (no node group config)?

@Nickmman , not sure if this answers your question but I have used the custom cluster-autoscaler (multiple times) to manage only 1 instance group and it works fine with me. If you check this comment, you will see I have 2 instance groups backed nodepools called default-pool and pool-1 but I use cluster-autoscaler to manage only pool-1 (you can see gke-cluster-1-pool-1 value in the flags in the screenshot).

added WattIQ to adopters list

aleksandra-malinowska added area/cluster-autoscaler area/provider/gcp Issues or PRs related to gcp provider labels Jun 14, 2018

chrissound closed this as completed Jun 14, 2018

dsludwig mentioned this issue Oct 5, 2018

Long scale down time with autoscaling groups pangeo-data/pangeo#322

Closed

seeruk mentioned this issue Jul 16, 2020

CA bails early if taints can't be found in GCE cloud provider #3322

Closed

pkosiec mentioned this issue Mar 17, 2022

Add "safe-to-evict" autoscaler annotations for components with emptyDir mounted capactio/capact#676

Merged

vadasambar mentioned this issue Mar 17, 2023

Mar 2023 vadafoss/daily-updates#7

Closed

vadasambar mentioned this issue Mar 21, 2023

Cluster autoscaler ignore node taints? #4231

Closed

vadasambar mentioned this issue May 5, 2023

Create documentation for GCE provider in cluster-autoscaler #5730

Open

3 tasks

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024

Merge pull request kubernetes#966 from madsenwattiq/main

7a6984c

added WattIQ to adopters list

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I gain access to the cluster autoscaler on GKE? #966

How do I gain access to the cluster autoscaler on GKE? #966

chrissound commented Jun 14, 2018

aleksandra-malinowska commented Jun 14, 2018

chrissound commented Jun 14, 2018

joshwand commented Dec 7, 2018

glapark commented Jan 7, 2019

aleksandra-malinowska commented Jan 7, 2019

glapark commented Jan 7, 2019

joshwand commented Jan 7, 2019

glapark commented Jun 28, 2019

Luke-Vear commented Jul 15, 2020

MaciekPytel commented Jul 15, 2020

seeruk commented Jul 15, 2020 •

edited

Loading

adinhodovic commented Aug 22, 2020 •

edited

Loading

danielyaa5 commented Apr 7, 2021

zenyui commented Jan 6, 2022

seeruk commented Jan 7, 2022

MaciekPytel commented Jan 10, 2022

zenyui commented Jan 11, 2022

vadasambar commented Mar 17, 2023

Nickmman commented Mar 17, 2023

vadasambar commented Mar 20, 2023

How do I gain access to the cluster autoscaler on GKE? #966

How do I gain access to the cluster autoscaler on GKE? #966

Comments

chrissound commented Jun 14, 2018

aleksandra-malinowska commented Jun 14, 2018

chrissound commented Jun 14, 2018

joshwand commented Dec 7, 2018

glapark commented Jan 7, 2019

aleksandra-malinowska commented Jan 7, 2019

glapark commented Jan 7, 2019

joshwand commented Jan 7, 2019

glapark commented Jun 28, 2019

Luke-Vear commented Jul 15, 2020

MaciekPytel commented Jul 15, 2020

seeruk commented Jul 15, 2020 • edited Loading

adinhodovic commented Aug 22, 2020 • edited Loading

danielyaa5 commented Apr 7, 2021

zenyui commented Jan 6, 2022

seeruk commented Jan 7, 2022

MaciekPytel commented Jan 10, 2022

zenyui commented Jan 11, 2022

vadasambar commented Mar 17, 2023

Nickmman commented Mar 17, 2023

vadasambar commented Mar 20, 2023

seeruk commented Jul 15, 2020 •

edited

Loading

adinhodovic commented Aug 22, 2020 •

edited

Loading