Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKS issue for taint removal #2934

Closed
pedroamaralm opened this issue May 5, 2022 · 77 comments
Closed

AKS issue for taint removal #2934

pedroamaralm opened this issue May 5, 2022 · 77 comments
Assignees

Comments

@pedroamaralm
Copy link

What happened:

Hi, until last month I was able to remove the taint from my nodes but since then I can't anymore, I haven't made any changes, either in the nodes or in politics, I just can't.

The following problem occurs:

p@p:~$ kubectl taint nodes aks-lxnode1-xxxxxxx-vmss00000y kubernetes.azure.com/scalesetpriority=spot:NoSchedule-
Error from server: admission webhook "aks-node-validating-webhook.azmk8s.io" denied the request: (UID: XXXXXX-XXXXX-XXXX-XXXX) Taint delete request "kubernetes.azure.com/scalesetpriority=spot:NoSchedule" refused. User is attempting to delete a taint configured on aks node pool "lxnode1".

What you expected to happen:

If I use this command on the cluster I have elsewhere I can make it work normally

k8s@node1:$ kubectl taint nodes node2 kubernetes.azure.com/scalesetpriority=spot:NoSchedule
node/node2 tainted
k8s@node1:
$ kubectl taint nodes node2 kubernetes.azure.com/scalesetpriority=spot:NoSchedule-
node/node2 untainted

How to reproduce it (as minimally and precisely as possible):

kubectl taint nodes aks-lxnode1-xxxxxxxx-vmss00000y kubernetes.azure.com/scalesetpriority=spot:NoSchedule-

Anything else we need to know?:

I use spot.io to manage spot instances and when the instance is created it is born with the taint In the Schedule, we have the job that removed this taint but now it doesn't remove it anymore and as a user we can't remove it, I also have a rancher installed on it cluster where I'm Administrator but that still won't let me remove the taint. This problem is not related to spot.io or the machines being spot because I uploaded a new cluster from scratch, with on-demand machines and after inserting the taint I couldn't remove it either.

@ghost ghost added the triage label May 5, 2022
@ghost
Copy link

ghost commented May 5, 2022

Hi pedroamaralm, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

  1. If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
  2. Please abide by the AKS repo Guidelines and Code of Conduct.
  3. If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
  4. Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
  5. Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
  6. If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

@atykhyy
Copy link

atykhyy commented May 6, 2022

This issue also interferes with deploying and scaling AKS clusters which use Cilium for networking, as Cilium (and likely other similar solutions) needs a taint on fresh nodes to prevent pods from being scheduled there until Cilium configures the fresh node.

@ghost ghost added the action-required label May 9, 2022
@ghost
Copy link

ghost commented May 9, 2022

Triage required from @Azure/aks-pm

@allyford
Copy link
Contributor

allyford commented May 9, 2022

Hi @pedroamaralm,
Thank you for your feedback. This is a known issue that only occurs when a non-AKS request to change a node taint that is set via the AKS api is attempted. Could you provide a bit more information to help us understand your use case?

Instead of using the AKS api and then changing the taints on the node, you can use an admission controller to add taints to new node objects.

@do0ominik
Copy link

do0ominik commented May 10, 2022

Hi @allyford ,
we are facing the same issue with our aks. We use spot nodepools for our testing environment and don't want to add Affinity-Rules (like mentioned here) for this purpose, otherwise we need to maintain different configuration for different stages (production, testing, dev, ..).

Until yesterday, the taint removal worked like a charm. As of today, our cluster was broken (pods couldn't be scheduled any more...). Our aks is deployed to "West Europe".

How is this supposed to work in the future. Do we need to have affinity-rules for that? Can we remove taints again?

Cheers!

@lkjangir
Copy link

Until yesterday, It was working fine. I have a daemon-set installed in my AKS clusters which takes care of removing spot Taints from all nodes. As of today, its failing on all the clusters. There's no communication from microsoft regarding this.

@brudnyhenry
Copy link

Hi, AKS cluster is unusable for us. All our test env run on spot instances so now we are blocked. Was this change announced anywhere?

@allyford
Copy link
Contributor

Hi all,

Thank you for all of your questions and feedback.

A couple solutions here for the above issues:

  1. Removal of Taints: Taints can be removed via the AKS api
  2. Need taints on node at init time to avoid pod schedule: Use a mutating admission webhook instead of relying on using the AKS API and kubectl API to give conflict order: Kubernetes Docs . Mutating admission webhooks can be invoked first, and can modify node objects sent to the API server to enforce custom defaults.
  3. Removal of spot pool taint on spot pool: Spot pool taints are an AKS system taint and cannot be modified or removed.

This behavior change was listed in the 2022-04-24 AKS Release notes .

@alkarp
Copy link

alkarp commented May 11, 2022

this is a breaking change for us too. clusters where we run cilium on are now unusable.

cilium uses the taints to ensure the cilium agent pods are deployed before any other workload and are using the cilium cni:

  • it does not support mutating admission webhooks;
  • it does not support bring your own cni on the aks;

extremely disappointing.

@JRolfe-Gen
Copy link

What is the purpose of this feature? It takes away key functionality of k8s. I don't understand what direction you are trying to take this and how taking this away is benefiting users of AKS. Is there any plans to make it so that taints can be controlled down the individual node? Managing taints at the nodepool level is a terrible idea.

@akostic-kostile
Copy link

I just fonud out about this change when our production deployment failed due to Cilium not being able to remove the taint any longer. Cluster was provisioned using Terraform and taint was set in the same way.

Cilium was chosen because it supports something both NetworkPolicy solutions that MS offers do not, DNS based egress policies. In this day and age I consider IP based network policies practically useless for anything that's residing outside the cluster.

@hacst
Copy link

hacst commented May 12, 2022

This change is producing issues for us too. We use an initial taint on nodes in certain pools to prevent scheduling of pods on new nodes until required provisioning done by a daemon set on the node is completed and the taint removed. The mutating admission webhook sounds like a very complicated and brittle workaround for what used to be a simple, kubernetes native way of dealing with this.

@gara-MI
Copy link

gara-MI commented May 12, 2022

Hello @hacst ,
I found a workaround this issue, I used https://github.com/open-policy-agent/gatekeeper from OPA.
the installation of gatekeeper can be found here
The commands I used:

kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/release-3.8/deploy/gatekeeper.yaml

cat <<EOF | kubectl apply -f -
apiVersion: mutations.gatekeeper.sh/v1beta1
kind: ModifySet
metadata:
  name: remove-node-taints
  namespace: gatekeeper
spec:
  location: "spec.taints"
  applyTo:
  - groups: [""]
    kinds: ["Node"]
    versions: ["v1"]
  parameters:
    operation: prune
    values:
      fromList:
        - effect: NoSchedule
          key: kubernetes.azure.com/scalesetpriority
          value: spot
EOF

I tested this on k8s version 1.21.7.

@hacst
Copy link

hacst commented May 12, 2022

@gara-MI thanks. Unfortunately it is not really applicable to my use-case. Though it might be an option for people specifically wanting to get rid of the spot instance taint throughout the cluster. For us as the node-pool scales up/down we would have to add/remove these mutation resources dynamically for each node once it becomes ready/goes away to replicate what we had. Also we are not running OPA in our cluster.

@gara-MI
Copy link

gara-MI commented May 12, 2022

Hi @hacst I see your use case is different, I think it's more useful for others where they are using daemonset to remove taints on spot nodes.

@do0ominik
Copy link

@gara-MI I think this will not work if you have azure policies enabled, because Azure manages gatekeeper and there are no "mutation"-Ressources deployed.

@akostic-kostile
Copy link

I managed to fix our Cilium clusters, here's a brief explanation how:
cilium/cilium#19788 (comment)

@camposdelima
Copy link

I think that a workaround more simple for most cases is to delete the aks-node-validating-webhook:

 kubectl delete ValidatingWebhookConfiguration aks-node-validating-webhook

But if you are more conservative, you can just disable:

 kubectl get ValidatingWebhookConfiguration aks-node-validating-webhook -o yaml | sed -e 's/\(objectSelector: \){}/\1{"matchLabels": {"disable":"true"}}/g' | kubectl apply -f -

To guarantee that modification will be permanent, I also have created a job to do the deactivation every minute and remove all spots node taints:
https://gist.github.com/camposdelima/c77a4f23a9a831188b88ca67650cf011

@akostic-kostile
Copy link

@camposdelima Excelent find! Had I known about this I wouldn't have bothered with an admission controller.

@palma21
Copy link
Member

palma21 commented Jun 9, 2022

I think that a workaround more simple for most cases is to delete the aks-node-validating-webhook:

 kubectl delete ValidatingWebhookConfiguration aks-node-validating-webhook

But if you are more conservative, you can just disable:

 kubectl get ValidatingWebhookConfiguration aks-node-validating-webhook -o yaml | sed -e 's/\(objectSelector: \){}/\1{"matchLabels": {"disable":"true"}}/g' | kubectl apply -f -

To guarantee that modification will be permanent, I also have created a job to do the deactivation every minute and remove all spots node taints: https://gist.github.com/camposdelima/c77a4f23a9a831188b88ca67650cf011

This is going to start failing on future releases FYI. You cannot disable system validating webhooks

What is the purpose of this feature? It takes away key functionality of k8s. I don't understand what direction you are trying to take this and how taking this away is benefiting users of AKS. Is there any plans to make it so that taints can be controlled down the individual node? Managing taints at the nodepool level is a terrible idea.

What functionality of k8s is this removing? You're not forced to managed taints at nodepool level, though i'd love to know more why it's a terrible idea. You cand add any taint via the k8s API to any individual node, and also remove it. What you can't do is add nodepool taints via the AKS API and then remove them of individual nodes via the k8s API.

tial taint on nodes in certain pools to prevent scheduling of pods on new nodes until required provisioning done by a daemon set on the node is completed and the taint removed. The mutating admission webhook sounds like a very complicated and brittle workaround for what used to be a simple, kubernetes native way of dealing with this.

If it was a native k8s way, you could do it all via the k8s API no? :) But in this case you were using the AKS api to set them, and the k8s API to remove them right? This causes conflicts at reconciliation time for the service.

@palma21
Copy link
Member

palma21 commented Jun 9, 2022

Listening to the feedback on this thread it seems all use cases that were using this prior behavior are along the lines of node initialization. We can add startup taints, passed to the kubelet in order to meet this use case. They will not be enforced by AKS, or passed on the AKS API but on the kubelet custom settings, and you can then remove them via k8s API call, because they won't be reconciled by the API. Does anyone have a use case that this wouldn't meet?

Otherwise we can provide this asap

@palma21 palma21 self-assigned this Jun 9, 2022
@aservedio
Copy link

aservedio commented Aug 21, 2023

To summarize and be more constructive about this since it's a major blocker for this important project I've been working on at work:

For my project requirement, I have the interesting and fun challenge of making it work on all top providers. I got 4 so far which all works the same. And recently the project and other projects using it have much more incentive to use AKS, so we have been starting to work on supporting it. And so far it looks promising and 95% working. But we are blocked for one small part.

Basically we package our solution in a helm chart, and it has daemonsets that will configure any current and new nodes joining the cluster that are matching certain labels, so they are ready to pull a variety of large gameserver images in an optimized way. And the only super reliable, cross-provider and very simple way we found after having trying many different things in live environments, was to add a temporary taint ignore-taint.cluster-autoscaler.kubernetes.io/whatever-you-like to nodepools, and have our daemonset remove it once node is ready.

This works very well on Aws, Gcp, Rancher, Linode and Tencent so far. But on AKS we get that error that we cannot remove our own taint using K8 api. (Not sure if AKS autoscaler has that logic though)

I'm pretty sure any project that wants to setup special node is using this method too which those providers allows for those special node setup purposes.


So yeah, here's my rough suggestion:

At the AKS nodepool api level:

  • Taints should be treated as: Create nodes with X taints, and leave it alone after to be fully managed using K8, even if user messes it up. Almost like a do_not_babysit boolean option.
  • For spot instances, this is not blocker for me but more something I feel should be done while at it:
    • If that same boolean option above is true, it should not add spot taints to nodepools with priority=Spot. Projects should be responsible for this and have the flexibility of defining their own constraints.
    • Not forcing everything that could run on a spot instance to have that toleration would be really helpful. Definitely a label should be added so we can target those nodes but tolerations are a pain to add if you have many different helm charts you install and they have multiple components within.

@MarkDeckert
Copy link

This is actually very simple, but it requires a certain respect for the customer.

Microsoft, don't put arbitrary taints on nodepools, ever. If you think a nodepool should have a taint as a best practice - note it in the documentation, or if it's so critical, put a warning somewhere during provisioning.

@andreiZi
Copy link

Are there any updates on this? I'm looking to utilize spot instances for my customers' dev and test environments. Unfortunately, I rely heavily on resources like Consul, which perform numerous side tasks. To configure them, I currently have to fork the original Helm chart, which isn't ideal.

A big thank you to @7wingfly. Your workaround was a lifesaver! However, it would be great if Microsoft could address this issue, as it presents a significant limitation for some of us.

I've opened a case with microsoft about it but I'm not really expecting a resolution at this point, based on what I've read in this thread. For now (or more likely forever) I will get around it by adding a default toleration at the namespace level, for every namespace.

apiVersion: v1
kind: Namespace
metadata:
  name: thenamespacename
  annotations:
    scheduler.alpha.kubernetes.io/defaultTolerations: '[{"Key": "kubernetes.azure.com/scalesetpriority", "Operator": "Equal", "Value": "spot", "Effect": "NoSchedule"}]'

@aservedio-perso
Copy link

Looking at the next "obvious" alternative to go around this different behavior from AKS vs all other major and medium k8 clusters providers:

Anyone knows if we would be able to create a core node mutating webhook to achieve this, or is there another system that will block that too from either k8 or aks?

Regards,

@allyford
Copy link
Contributor

Listening to the feedback on this thread it seems all use cases that were using this prior behavior are along the lines of node initialization. We can add startup taints, passed to the kubelet in order to meet this use case. They will not be enforced by AKS, or passed on the AKS API but on the kubelet custom settings, and you can then remove them via k8s API call, because they won't be reconciled by the API. Does anyone have a use case that this wouldn't meet?

Otherwise we can provide this asap

See #3464 for Startup taints feature currently in development. I'll be updating on that issue with release timelines and happy to answer additional questions there.

@aservedio
Copy link

Startup taints passed to kubelet?

As in create a cluster, then edit some configmap somewhere or patch it and somehow target a nodepool by what, labels set using api?

While everywhere else you just create a cluster and pass relevant options to achieve this?

It sounds like an aks specific workaround to be honest :/

@aservedio
Copy link

In the end what would be absolutely amazing is doing this thru the api and the various toolings around them like terraform or pulumi, to create clusters with taints and labels that can be fully managed and deleted or modified using k8 api later, without this aks specific piece of the logic that blocks k8 actions on them later, contrary to the other 6 providers I'm working with every week.

Doing otherwise means at least for our project and users of it that we and them would need to starts modifying their cluster creation flow with something very outside the usual boxes / new extra flow, specifically for AKS.

Also I think from what I'm seeing on recent gcp updates on that subject and their api, they might be going the route of 2 set of options in their api for both labels and taints. One that can't be changed later and one that can.

Not an unreasonable way IMO.

@aservedio
Copy link

Positive update: a node mutating hook on aks and aws so far is able to intercept node create event and mutate/patch its labels and/or taints.

It puts the bar high to do something that is so trivial elsewhere but at least that seems to work.

@hacst
Copy link

hacst commented Feb 2, 2024

This issue is neither stale nor solved. @allyford maybe you can give an update on the implementation progress? We are still waiting for a fix for this regression and some of the workarounds actually have ongoing cost to us.

@microsoft-github-policy-service microsoft-github-policy-service bot removed the stale Stale issue label Feb 2, 2024
@redaER7
Copy link

redaER7 commented Feb 5, 2024

az aks nodepool update -g $RESOURCE_GROUP -n $NODEPOOL --cluster-name $CLUSTER \ --node-taints "" worked for me from https://learn.microsoft.com/es-es/cli/azure/aks/nodepool?view=azure-cli-latest

Pase la cadena "" vacía para quitar todos los taints.

@ritesh-makerble
Copy link

Any update on this issue? How can the taint be removed at nodepool level

@mattiasb
Copy link

az aks nodepool update -g $RESOURCE_GROUP -n $NODEPOOL --cluster-name $CLUSTER \ --node-taints "" worked for me from https://learn.microsoft.com/es-es/cli/azure/aks/nodepool?view=azure-cli-latest

That solves a subset of the issues talked about in this thread.

What me, @hacst and many more want is the ability to assign a taint to a node-pool such that new nodes start with that taint and that we are able to remove the taint from nodes at a later point (for example as part of a DaemonSet that does some prepatation on the node, like warms up a cache etc).

@hterik
Copy link

hterik commented Mar 25, 2024

Ability to set taints on nodepool level has the potential to create scaleup loops, in case the taints are not removed quickly enough by user, and autoscaler keeps trying to start new nodes because pods could not be scheduled.

Autoscaler provides a feature to prevent such scenario, using --ignore-taints argument, this option is not exposed by AKS today, opening that up need to be part of feature to set nodepool taints. See #3276

@charlienewey-odin
Copy link

charlienewey-odin commented May 17, 2024

My application is slightly different from others in this thread, but have had success with @megakid's approach of using Kyverno & suspect it can generalise quite well.

I am using NVIDIA's GPU Operator to configure GPU drivers and MIGs (multi-instance GPUs) - but the GPU Operator requires a node reboot before MIG can be used. The problem arose because the scheduler wouldn't notice that MIG was not configured and would therefore schedule jobs on the node just before it rebooted (this caused an elevated rate of job failure).

I worked around this by adding 2 Kyverno ClusterPolicy definitions that apply or remove the Unschedulable property based on the presence/absence of labels indicating the node's MIG state. This should be possible to generalise to other situations.

Note: I had to configure the Kyverno deployment to be able to manage Node objects by removing them from resourceFilters in the kyverno config map (can be done via values.yaml in the Helm chart).

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: aks-mig-apply-unready-taint
  annotations:
    policies.kyverno.io/title: Apply Taint to Unready MIG Nodes
    policies.kyverno.io/category: Other
    policies.kyverno.io/subject: Node
    kyverno.io/kyverno-version: 1.7.2
    policies.kyverno.io/minversion: 1.7.0
    policies.kyverno.io/description: >-
      AKS nodes that are configured with the NVIDIA GPU Operator and have a MIG profile
      defined need to be rebooted in order to use the feature. This policy adds the taint
      `mig=notReady:NoSchedule` if the node is configured with a MIG profile and the label
      `nvidia.com/mig.config.state=success` is not present.
spec:
  rules:
    - name: aks-mig-apply-unready-taint
      match:
        resources:
          kinds:
            - Node
          operations:
            - CREATE
            - UPDATE
          selector:
            matchLabels:
              nvidia.com/mig.config: "*"
      exclude:
        resources:
          kinds:
            - Node
          selector:
            matchLabels:
              nvidia.com/mig.config.state: "success"
      mutate:
        patchesJson6902: |-
          - path: /spec/unschedulable
            op: add
            value: true
          - path: /metadata/labels/mig
            op: add
            value: notReady
---
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: aks-mig-remove-unready-taint
  annotations:
    policies.kyverno.io/title: Remove Taint From Ready MIG Nodes
    policies.kyverno.io/category: Other
    policies.kyverno.io/subject: Node
    kyverno.io/kyverno-version: 1.7.2
    policies.kyverno.io/minversion: 1.7.0
    policies.kyverno.io/description: >-
      AKS nodes that are configured with the NVIDIA GPU Operator and have a MIG profile
      defined need to be rebooted in order to use the feature. This policy removes the
      `mig=notReady:NoSchedule` taint if the label `nvidia.com/mig.config.state=success`
      is present.
spec:
  rules:
    - name: aks-mig-remove-unready-taint
      match:
        resources:
          kinds:
            - Node
          operations:
            - CREATE
            - UPDATE
          selector:
            matchLabels:
              nvidia.com/mig.config: "*"
              nvidia.com/mig.config.state: "success"
      mutate:
        patchesJson6902: |-
          - path: /spec/unschedulable
            op: add
            value: false
          - path: /metadata/labels/mig
            op: replace
            value: ready

@microsoft-github-policy-service microsoft-github-policy-service bot added the stale Stale issue label Jun 7, 2024
Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.

Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.

Copy link
Contributor

This issue will now be closed because it hasn't had any activity for 7 days after stale. pedroamaralm feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

@pranaysahith
Copy link

az aks nodepool update -g $RESOURCE_GROUP -n $NODEPOOL --cluster-name $CLUSTER \ --node-taints ""
This does not remove the taint from the node pool.

@microsoft-github-policy-service microsoft-github-policy-service bot removed the stale Stale issue label Oct 29, 2024
@ams0
Copy link

ams0 commented Oct 30, 2024

Just found this closed issue, as @pranaysahith said, this still doesn't work, taints are not removed from spot pools. I want to use those pools just like any other pools, without forcing tolerations or using OPA/Kyverno to modifiy every single deployment; pleae make the taint optional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests