Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keda Identity Is not authenitcating to Service Bus after few hours #3977

Closed
tshaiman opened this issue Dec 7, 2022 · 77 comments
Closed

Keda Identity Is not authenitcating to Service Bus after few hours #3977

tshaiman opened this issue Dec 7, 2022 · 77 comments
Labels
bug Something isn't working

Comments

@tshaiman
Copy link

tshaiman commented Dec 7, 2022

Report

While running Keda 2.8 .1on AKS 1.24.6 on few regions we see that after some time ( could be Days / Hours )
Keda looses the Managed Identity and has many authentication errors .

Restarting Keda Pod will fix the issue.
we wonder if this is a bug in Workload Identity / Azure VM restart or Keda Issue .
we also want to see whether Keda Pod Health Check can be integrated with those logs and restart itself in case those error occurs

Expected Behavior

  • let Keda restart in case those error start to fire

Actual Behavior

  • Keda keeps emmitting those logs without the option to self-heal

Steps to Reproduce the Problem

  1. AKS with 1.24.6
  2. 17 Service bus Queues
  3. Enable Workload identity Federation on the cluster + on the TriggerAuthentication
  4. Keda has Service Account + Federation Credentials is working
  5. Keda is running OK for few hours and then all of a suddent looses Token

Logs from KEDA operator

2022-12-07T14:36:38Z    ERROR   scalehandler    Error getting scale decision    {"scaledobject.Name": "cb", "scaledObject.Namespace": "vi-be-map", "scaleTarget.Name": "vi-cb-api", "error": "error reading service account token - open : no such file or directory"}
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:278
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:149

KEDA Version

2.8.1

Kubernetes Version

1.24

Platform

Microsoft Azure

Scaler Details

Azure Service Bus

Anything else?

  • the Logs says it cannot find the service account token but the service account token is there .
  • Workload Identity Federation is active
@tshaiman tshaiman added the bug Something isn't working label Dec 7, 2022
@JorTurFer
Copy link
Member

Hum... weird.
Could this be any error in the AKS itself, which doesn't inject the token IDK why? @v-shenoy ?

@v-shenoy
Copy link
Contributor

v-shenoy commented Dec 9, 2022

Hum... weird. Could this be any error in the AKS itself, which doesn't inject the token IDK why? @v-shenoy ?

@tshaiman Is this consistently reproducible? Did you set a custom token expiration period using --set azureWorkload.tokenExpiration while installing?

If this error aligns with the time that the token is about to expire, then maybe something is wrong with the way we're reading the token. If it's not, then there is some error in the updated token being injected into the pod. But I can't tell why this would happen, @JorTurFer.

@tshaiman
Copy link
Author

tshaiman commented Dec 9, 2022

@v-shenoy : this was not happening more then once per region we are deployed ( total 23 regions ) but the trigger was moving from Ubuntu VM's to Mariner Based images on AKS.
it is important to say that the Keda-Controller lost its auth token after around 2 days from the VM deployment.
when we looked at the pod age it was 2d where the Expiration token is set for 3600 (!)
meaning something is not aligned between the expiration scheduled which is every hour and the fact that we lost identity after 2 days.

it is also important to know ( even may be irrelevant) that we ran with replicas=2 , and 1 Pod was always idle ( like Active-Passive) kind of things , where they did not share the workloads between them .

@tshaiman
Copy link
Author

tshaiman commented Dec 9, 2022

@v-shenoy : Karma is a bitch. I was just speaking about not reproducing this item and then I saw it again on many of our regions after pods were around 10H in the air.

we have decided to increase the token expiration to max value 86400 and forcely restart Keda every hour
do you happen to have contact in Workload Identity Team ?

╰─ k logs -f  -l app=keda-operator -n keda --since 5m
        /workspace/pkg/scaling/cache/scalers_cache.go:94
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:278
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:149
2022-12-09T19:07:54Z    ERROR   scalehandler    Error getting scale decision    {"scaledobject.Name": "tm", "scaledObject.Namespace": "vi-be-map", "scaleTarget.Name": "vi-tm-api", "error": "error reading service account token - open : no such file or directory"}
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:278
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:149

@v-shenoy
Copy link
Contributor

v-shenoy commented Dec 9, 2022

I haven't talked to people in the workload identity team myself, before. But I think @aramase can help.

@v-shenoy
Copy link
Contributor

v-shenoy commented Dec 9, 2022

On an unrelated but tertiary issue about workload identity. We need to refactor some parts there. The structs here embed context which is not a recommended Go practice. Should we have an issue for this? @JorTurFer

@JorTurFer
Copy link
Member

On an unrelated but tertiary issue about workload identity. We need to refactor some parts there. The structs here embed context which is not a recommended Go practice. Should we have an issue for this? @JorTurFer

I think so, having an issue we could have help from contributors, without it, we will be who do it

@tshaiman
Copy link
Author

@JorTurFer : I want to become contributor , but my GoLang is a bit rough in the last 2 years (* moved to other languages)
do you think the ramp up is reasonable ?

@JorTurFer
Copy link
Member

You'll never know if you don't try xD. We are here to help if you have any question or doubt :)

@tshaiman
Copy link
Author

I'm starting to migrate to Keda 2.9 and I do see a diffrence in the Deployment yaml where it now contains additional
annotation to use workload Identity.
usually this annotation needs to be placed on the Service Account , but I wonder if that change has got anything to do with the issue we see .

in addition I want to empheseise that we ran on replicaSet=2 , ensuring each Keda Pod runs on a different VM with
PodAntiAffinity Rules. in such case they were acting like active/passive and only 1 Pod actually handled traffic.
thinking out loud whether this can also cause the issue we saw.
keda

@tshaiman
Copy link
Author

@v-shenoy / @JorTurFer :

i have some updates on new experiment done on Keda 2.9.1 :

Replicas : 2
Token Expiration Timeout : 3600
Node Pool : Mariner2.0
Is Trigger Auth annotated with Azure-Workloads : Yes
Is Keda aad-Pod-Identity : No
Time Of running : 16H
Number Of Errors : 1
Is Recurring : No

ct": {"name":"op","namespace":"vi-be-map"}, "namespace": "vi-be-map", "name": "op", "reconcileID": "5ea568d6-6cb6-4e00-9102-766b8d4f1298"}
2022-12-18T02:41:33Z    ERROR   azure_servicebus_scaler error getting service bus entity length {"type": "ScaledObject", "namespace": "vi-be-map", "name": "aed", "error": "ChainedTokenCredential: failed to acquire a token.\nAttempted credentials:\n\tAzureCLICredential: fork/exec /bin/sh: no such file or directory\n\terror acquiring aad token - unable to resolve an endpoint: server response error:\n Get \"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/v2.0/.well-known/openid-configuration\": context deadline exceeded"}
github.com/kedacore/keda/v2/pkg/scalers.(*azureServiceBusScaler).GetMetricsAndActivity
        /workspace/pkg/scalers/azure_servicebus_scaler.go:266
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetScaledObjectState
        /workspace/pkg/scaling/cache/scalers_cache.go:136
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:360
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:162
       

@v-shenoy
Copy link
Contributor

I'm starting to migrate to Keda 2.9 and I do see a diffrence in the Deployment yaml where it now contains additional
annotation to use workload Identity.
usually this annotation needs to be placed on the Service Account , but I wonder if that change has got anything to do with the issue we see .

in addition I want to empheseise that we ran on replicaSet=2 , ensuring each Keda Pod runs on a different VM with
PodAntiAffinity Rules. in such case they were acting like active/passive and only 1 Pod actually handled traffic.
thinking out loud whether this can also cause the issue we saw.
keda

As per Workload Identity docs, both the pods and the service account require the label now.

@tshaiman
Copy link
Author

@v-shenoy this is inaccurate.
there is no "azure/workload.identity/use" label on the pod
https://learn.microsoft.com/en-us/azure/aks/workload-identity-overview
image

@v-shenoy
Copy link
Contributor

@tshaiman
Copy link
Author

@v-shenoy : well 2 diffrent docs from the same provider . not surprising . anyhow I don't think that is the root cause for the "file not found" issue here on this bug . as we are running without this label and its working for few hours and then looses the file

@v-shenoy
Copy link
Contributor

@v-shenoy : well 2 diffrent docs from the same provider . not surprising . anyhow I don't think that is the root cause for the "file not found" issue here on this bug . as we are running without this label and its working for few hours and then looses the file

I think the docs on learn.microsoft.com have not been updated yet.

Yup. I don't think these are related, just wanted to clarify on it.

@v-shenoy
Copy link
Contributor

@v-shenoy / @JorTurFer :

i have some updates on new experiment done on Keda 2.9.1 :

Replicas : 2
Token Expiration Timeout : 3600
Node Pool : Mariner2.0
Is Trigger Auth annotated with Azure-Workloads : Yes
Is Keda aad-Pod-Identity : No
Time Of running : 16H
Number Of Errors : 1
Is Recurring : No

ct": {"name":"op","namespace":"vi-be-map"}, "namespace": "vi-be-map", "name": "op", "reconcileID": "5ea568d6-6cb6-4e00-9102-766b8d4f1298"}
2022-12-18T02:41:33Z    ERROR   azure_servicebus_scaler error getting service bus entity length {"type": "ScaledObject", "namespace": "vi-be-map", "name": "aed", "error": "ChainedTokenCredential: failed to acquire a token.\nAttempted credentials:\n\tAzureCLICredential: fork/exec /bin/sh: no such file or directory\n\terror acquiring aad token - unable to resolve an endpoint: server response error:\n Get \"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/v2.0/.well-known/openid-configuration\": context deadline exceeded"}
github.com/kedacore/keda/v2/pkg/scalers.(*azureServiceBusScaler).GetMetricsAndActivity
        /workspace/pkg/scalers/azure_servicebus_scaler.go:266
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetScaledObjectState
        /workspace/pkg/scaling/cache/scalers_cache.go:136
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:360
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:162
       

The only thing I wonder here is this somehow specific to AKS and Mariner, or does this happen for all OS'es?

@tshaiman
Copy link
Author

@v-shenoy : only AKS on mariner . it never happened when we ran on Ubuntu

@v-shenoy
Copy link
Contributor

That's interesting. This makes it more of an AKS issue than KEDA, doesn't it?

@tshaiman
Copy link
Author

absolutely
they got a bug as well as the Workload Identity

i’m trying everything now 😔

@v-shenoy
Copy link
Contributor

absolutely they got a bug as well as the Workload Identity

i’m trying everything now pensive

Is there an issue created on AKS / Workload Identity for this? If so can you link it here?

@tshaiman
Copy link
Author

sure .
Azure/azure-workload-identity#665

for AKS Team I've used internal Microsoft Bug reporting system

i have another concern related to Docs of Keda.

image

looking at how to set the Trigger Authentication it is not clear what is the identityId field represent ?
Is it Label Name ? (a.k.a the selector ) ?
is it Client Id ? ( It is for Workload Identity ,and in such case the name should be IdentityClientId, not IdentityId)
Is it Full Resource Identity Id of the ARM represenation of managed Identity ? ( "/subscription/...../resourcegroup/my-rg/my-managed-identity-name" ) .

The Term ID is misleading since in case of Workload its not Id , its client Id , and in case of Pod Identity its also not Id
its the label / selector if I'm not mistaken

@tshaiman
Copy link
Author

tshaiman commented Dec 18, 2022

@v-shenoy
more NEW logs from Keda 2.9.1 :

keda-operator-d5464cdd6-zvdw2 keda-operator 2022-12-18T15:03:48Z        ERROR   azure_servicebus_scaler error getting service bus entity length {"type": "ScaledObject", "namespace": "vi-be-map-dev11", "name": "rc-visolo", "error": "ChainedTokenCredential: failed to acquire a token.\nAttempted credentials:\n\tAzureCLICredential: fork/exec /bin/sh: no such file or directory\n\terror reading service account token - open : no such file or directory"}
keda-operator-d5464cdd6-zvdw2 keda-operator github.com/kedacore/keda/v2/pkg/scalers.(*azureServiceBusScaler).GetMetricsAndActivity
keda-operator-d5464cdd6-zvdw2 keda-operator     /workspace/pkg/scalers/azure_servicebus_scaler.go:266
keda-operator-d5464cdd6-zvdw2 keda-operator github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetScaledObjectState
keda-operator-d5464cdd6-zvdw2 keda-operator     /workspace/pkg/scaling/cache/scalers_cache.go:136
keda-operator-d5464cdd6-zvdw2 keda-operator github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
^Ckeda-operator-d5464cdd6-zvdw2 keda-operator   /workspace/pkg/scaling/scale_handler.go:360
keda-operator-d5464cdd6-zvdw2 keda-operator github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
keda-operator-d5464cdd6-zvdw2 keda-operator     /workspace/pkg/scaling/scale_handler.go:162
keda-operator-d5464cdd6-zvdw2 keda-operator 2022-12-18T15:03:48Z        ERROR   azure_servicebus_scaler error getting service bus entity length {"type": "ScaledObject", "namespace": "vi-be-map-dev11", "name": "celebs", "error": "ChainedTokenCredential: failed to acquire a token.\nAttempted credentials:\n\tAzureCLICredential: fork/exec /bin/sh: no such file or directory\n\terror reading service account token - open : no such file or directory"}
keda-operator-d5464cdd6-zvdw2 keda-operator github.com/kedacore/keda/v2/pkg/scalers.(*azureServiceBusScaler).GetMetricsAndActivity
keda-operator-d5464cdd6-zvdw2 keda-operator     /workspace/pkg/scalers/azure_servicebus_scaler.go:266
keda-operator-d5464cdd6-zvdw2 keda-operator github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetScaledObjectState
keda-operator-d5464cdd6-zvdw2 keda-operator     /workspace/pkg/scaling/cache/scalers_cache.go:136


so what we saw here was never seen before as this is a ChainedTokenCredential error : I guess the AzureCLI was not remvoed from the Chanin list and it causes lots of errors.

@tshaiman
Copy link
Author

closing this thread and opening a dedicated 2.9.1 issue

@tshaiman
Copy link
Author

@v-shenoy @JorTurFer @tomkerkhove
i have some questions from the workload identity team :

@tshaiman Is KEDA using the workload identity webhook? The webhook is not part of the runtime of the pod, it only mutates the pod during deploy time to add the volume for projected service account token. The token is generated by kubelet and written in the volume, so there is no workload identity webhook in the picture during runtime. The token is renewed at 80% expiry by the kubelet process (xref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#serviceaccount-token-volume-projection)

Please share more details on how the pod is reading the service account token, what environment variables it's using, pod spec for us to provide some pointers.”

@tshaiman tshaiman reopened this Dec 20, 2022
@tshaiman
Copy link
Author

the workload identity bug thread is here :
Azure/azure-workload-identity#665

@tomkerkhove
Copy link
Member

Thank you @tshaiman!

@jmos5156
Copy link

jmos5156 commented Jan 9, 2023

Hello we seem to be getting similar errors on from AKS

2023-01-09T19:45:39Z	ERROR	azure_servicebus_scaler	error getting service bus entity length	{"type": "ScaledObject", "namespace": "do", "name": "azure-servicebus-queue-scaled-eventparser", "error": "ChainedTokenCredential: failed to acquire a token.\nAttempted credentials:\n\tAzureCLICredential: fork/exec /bin/sh: no such file or directory\n\terror reading service account token - open : no such file or directory"}
github.com/kedacore/keda/v2/pkg/scalers.(*azureServiceBusScaler).GetMetricsAndActivity
	/workspace/pkg/scalers/azure_servicebus_scaler.go:266
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetScaledObjectState
	/workspace/pkg/scaling/cache/scalers_cache.go:140
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
	/workspace/pkg/scaling/scale_handler.go:356
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
	/workspace/pkg/scaling/scale_handler.go:162
2023-01-09T19:45:39Z	ERROR	scalers_cache	error getting scale decision	{"scaledobject.Name": "azure-servicebus-queue-scaled-eventparser", "scaledObject.Namespace": "do", "scaleTarget.Name": "eventparser", "error": "ChainedTokenCredential: failed to acquire a token.\nAttempted credentials:\n\tAzureCLICredential: fork/exec /bin/sh: no such file or directory\n\terror reading service account token - open : no such file or directory"}
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetScaledObjectState
	/workspace/pkg/scaling/cache/scalers_cache.go:154
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
	/workspace/pkg/scaling/scale_handler.go:356
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
	/workspace/pkg/scaling/scale_handler.go:162

We're running
AKS - version 1.23.5
aad-pod - chart version 4.1.15
keda - chart version: 2.9.0

We have tried several different version include version 2.9.1 and all exhibit the same error.

TIA

@tshaiman
Copy link
Author

tshaiman commented Jan 9, 2023

@jmos5156 : actually we ran on
workloadIdentity, not on podIdentity
have you configure pod identity correctly with
“az aks pod-identity” command ? there are many permission you need to grant to the vm/vmss if you are using pod identity and the cli does it for you .

so just making sure this is the same issue- i’m not sure 🤔

@JorTurFer
Copy link
Member

I think @jmos5156 issue isn't related with this and it's related with this other issue

@Dragonsong3k
Copy link

I came to +1 this.

I just updated my AKS Cluster to 1.24.6 from 1.21 as well as AADPOD update.

No ScaledJobs are kicking off and here is the keda-operator log

2023-01-10T06:28:23Z    ERROR    azure_servicebus_scaler    error getting service bus entity length    {"type": "ScaledJob", "namespace": "gaccc", "name": "provision-svc-ctrl-job", "error": "ChainedTokenCredential: failed to acquire a token.\nAttempted credentials:\n\tAzureCLICredential: fork/exec /bin/sh: no such file or directory\n\terror reading service account token - open : no such file or directory"}
github.com/kedacore/keda/v2/pkg/scalers.(*azureServiceBusScaler).GetMetricsAndActivity
    /workspace/pkg/scalers/azure_servicebus_scaler.go:266
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).getScaledJobMetrics
    /workspace/pkg/scaling/cache/scalers_cache.go:306
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).IsScaledJobActive
    /workspace/pkg/scaling/cache/scalers_cache.go:178
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
    /workspace/pkg/scaling/scale_handler.go:372
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
    /workspace/pkg/scaling/scale_handler.go:162
2023-01-10T06:28:23Z    INFO    scaleexecutor    Scaling Jobs    {"scaledJob.Name": "provision-svc-ctrl-job", "scaledJob.Namespace": "gaccc", "Number of running Jobs": 0}
2023-01-10T06:28:23Z    INFO    scaleexecutor    Scaling Jobs    {"scaledJob.Name": "provision-svc-ctrl-job", "scaledJob.Namespace": "gaccc", "Number of pending Jobs ": 0}

Are there any work arounds by any chance?

Thanks.

@JorTurFer
Copy link
Member

Are there any work arounds by any chance?

No, there isn't any workaround because it's related with a bug. This PR fixes it

@tomkerkhove
Copy link
Member

Are you sure that is the issue? It looks like there is another underlying issue as well wrt the missing files

@tomkerkhove
Copy link
Member

Are there any work arounds by any chance?

No, there isn't any workaround because it's related with a bug. This PR fixes it

To be clear, this is for the pod identity problem reported by @jmos5156 & @Dragonsong3k which is tracked in #4026. Not the issue @tshaiman is having

@Dragonsong3k
Copy link

@tomkerkhove thank you for the clarification.

@tshaiman
Copy link
Author

tshaiman commented Jan 12, 2023

@tomkerkhove / @JorTurFer : I have another important update .
I have managed to reproduce the issue on Ubuntu node pool , meaning this is not a Mariner issue as well.

the good news , I have Step-by-Steps instruction on how you can easily recreated the bug
( I did that during my ski vacation on a private subscription ;-) )

  1. Create AKS Cluster , onboard it to workload Identity , make sure you choose version 1.24 ( this is important for later steps)
  2. use a single Node Pool for system , use D4as_v5 with 2 node pools ( just a suggestion )
  3. install Keda 2.9.1 , enable workload identity and assign user-Assigned managed Identity
  4. deploy simple Service Bus Queue , and create basic ScaleObject on a queue on that Service Bus Queue
  5. Sanity : ensure everything works by placing a message on the queue and verify your pods go up.

===> now for the fun part
6. upgrade your cluster from 1.24 to 1.25 using the Azure Portal for both the control plane + node pools.

as I always suspected something is happening when keda Pods needs to move from one node to the other due to system updates or node upgrades this is the easiest way to recreate the issue , and it can also explain why somethimes it took days or more for the error to occur : we need to see that the pods are evicted from one Node to another.

result : the same old "File was not find" issue.
image

@JorTurFer
Copy link
Member

JorTurFer commented Jan 12, 2023

WOW!!
Thanks for the clear explanation 🙇
I'm trying to reproduce it just draining the node and if it doesn't work just draining, tomorrow I'll try with the upgrade
@kedacore/keda-core-contributors , I still think that this isn't related with KEDA itself as we doesn't manage the token file at all, it's workload identity webhook who does it. Should we escalate it in MSFT somehow?

@JorTurFer
Copy link
Member

JorTurFer commented Jan 12, 2023

One thing @tshaiman , how many instances of azure-wi-webhook do you have? Is the pod properly mutated by the hook? I mean, the token isn't mounted, but the pod manifest has the token defined?
Could it be that KEDA pod is moved together with the workload identity webhook pods during the node draining? If this is the case, the mutating webhook could be skipped (because there isn't any wi webhook instance ready to execute it), not mounting the volume (which is our case), but I'd say that if that's the case, the pod manifest doesn't have the token volume section

@JorTurFer
Copy link
Member

JorTurFer commented Jan 12, 2023

I have reproduced the behaviour draining a node in my cluster with 2 nodes, with 1 instance of workload identity without PDB. When I drain the node where KEDA and the wi webhooks are, the new KEDA pod is scheduled without the volume because when KEDA pods are scheduled, the wi mutating webhook is down.
I guess this isn't your case because default workload identity helm chart deploys 2 replicas + pdb to ensure at least 1 replica always but I prefer to ask just in case

@tshaiman
Copy link
Author

@JorTurFer : I deployed workload Identity as AKS add on ( using az aks update --enable-workload-identity) not with the helm chat. this installs version 0.14 which isn't the latest ( 0.15 is the latest)
so I get 2 replicas of the azure-wi-webhook but no PDB is defined, which is strange since I do see the PDB is integral part of the helm chart .

image

could that be the issue you've mentioned ?
regarding your quetsion "Is the pod properly mutated by the hook?" not sure I understand what Pod are you referring to ? the Keda Pod or the webhook pod ? can you provide some short examples on the commands needed to be run here ?

@v-shenoy
Copy link
Contributor

Very interesting discussion here. @JorTurFer I had a thought regarding your point about lack of PDBs for the Workload Identity webhook pods, making the potentially unavailable during a node drain to mount the token volume onto the KEDA pods. The webhook is responsible for injecting the right environment variables as well as mounting the volume. How is it that the environment variables are present but not the volume?

@JorTurFer
Copy link
Member

JorTurFer commented Jan 13, 2023

okey, let me share my theory (it could be crazy):
as you have 2 nodes and the webhooks don't have pdbs, during the 2nd node draining, you lose both instances at the same time when you move KEDA pods, why?
Lets say that you have 1 instance in the node A and another in the node B, during node A update, both instances are moved to node B with KEDA pods as well, for keeping node A empty. During the node B draining, all KEDA instances are evicted but also webhooks instances, so there is a period during the process without any webhbook available for mutating the pods.
How to validate my theory:
if you get the pods yaml in the beginning, when they are working well, you have to see one volume defined like this:

- name: azure-identity-token
      projected:
        sources:
          - serviceAccountToken:
              audience: api://AzureADTokenExchange
              expirationSeconds: 3600
              path: azure-identity-token
        defaultMode: 420

which is mounted in the pod like this:

- name: azure-identity-token
  readOnly: true
  mountPath: /var/run/secrets/azure/tokens

At this moment, the pods also have some environment variables defined (maybe they are other extra or they have different values):

- name: AZURE_CLIENT_ID
- name: AZURE_TENANT_ID
- name: AZURE_FEDERATED_TOKEN_FILE
  value: /var/run/secrets/azure/tokens/azure-identity-token
- name: AZURE_AUTHORITY_HOST
   value: https://login.microsoftonline.com/

These 3 things are added by the mutating webhook, this means that the deployment doesn't have them, only the pods because it is the workload identity mutating webhooks who add them.
If my theory is correct and there is a period with workload identity webhooks unavailable at the same time as KEDA pods are being moved, you should check the new pods yaml and these things have to be missing.

As the lack of these things is the cause of KEDA issues using managed identities (because the SKDs uses them), if my theory is correct, this can be the root cause.

@tshaiman
Copy link
Author

@JorTurFer : Your Theory is CONFIRMED !
I re-created the scenario and have before/after yaml files.
indeed the file after node drains has No AZURE_CLIENT_ID/TOKEN_FILE/TENANT_ID and does not have the volume.
well done @JorTurFer
now - > lets push this forward as critical bug to WIF people .

here is the pod definition AFTER :

    env:
    - name: WATCH_NAMESPACE
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: OPERATOR_NAME
      value: keda-operator
    - name: KEDA_HTTP_DEFAULT_TIMEOUT
      value: "3000"
    image: ghcr.io/kedacore/keda:2.9.1
    imagePullPolicy: Always
 
 ....
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-p5qf6
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true

so as you can see both the volume and the ENV variables are missing
thanks a lot

@v-shenoy
Copy link
Contributor

okey, let me share my theory (it could be crazy): as you have 2 nodes and the webhooks don't have pdbs, during the 2nd node draining, you lose both instances at the same time when you move KEDA pods, why? Lets say that you have 1 instance in the node A and another in the node B, during node A update, both instances are moved to node B with KEDA pods as well, for keeping node A empty. During the node B draining, all KEDA instances are evicted but also webhooks instances, so there is a period during the process without any webhbook available for mutating the pods. How to validate my theory: if you get the pods yaml in the beginning, when they are working well, you have to see one volume defined like this:

- name: azure-identity-token
      projected:
        sources:
          - serviceAccountToken:
              audience: api://AzureADTokenExchange
              expirationSeconds: 3600
              path: azure-identity-token
        defaultMode: 420

which is mounted in the pod like this:

- name: azure-identity-token
  readOnly: true
  mountPath: /var/run/secrets/azure/tokens

At this moment, the pods also have some environment variables defined (maybe they are other extra or they have different values):

- name: AZURE_CLIENT_ID
- name: AZURE_TENANT_ID
- name: AZURE_FEDERATED_TOKEN_FILE
  value: /var/run/secrets/azure/tokens/azure-identity-token
- name: AZURE_AUTHORITY_HOST
   value: https://login.microsoftonline.com/

These 3 things are added by the mutating webhook, this means that the deployment doesn't have them, only the pods because it is the workload identity mutating webhooks who add them. If my theory is correct and there is a period with workload identity webhooks unavailable at the same time as KEDA pods are being moved, you should check the new pods yaml and these things have to be missing.

As the lack of these things is the cause of KEDA issues using managed identities (because the SKDs uses them), if my theory is correct, this can be the root cause.

Yeah, I understood this. My question was (which in hindsight is dumb) is how was it possible for only the volume mount to be missing while the env variables are present. Clearly the env variables are also absent. And the file path read by the KEDA is "", which obviously has not mounted token.

I think we should print the file path in the error message from our end when we are unable to read the token.

@v-shenoy
Copy link
Contributor

@JorTurFer : Your Theory is CONFIRMED ! I re-created the scenario and have before/after yaml files. indeed the file after node drains has No AZURE_CLIENT_ID/TOKEN_FILE/TENANT_ID and does not have the volume. well done @JorTurFer now - > lets push this forward as critical bug to WIF people .

here is the pod definition AFTER :

    env:
    - name: WATCH_NAMESPACE
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: OPERATOR_NAME
      value: keda-operator
    - name: KEDA_HTTP_DEFAULT_TIMEOUT
      value: "3000"
    image: ghcr.io/kedacore/keda:2.9.1
    imagePullPolicy: Always
 
 ....
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-p5qf6
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true

so as you can see both the volume and the ENV variables are missing thanks a lot

So, essentially WI needs to have PDBs deployed as a part of the AKS add-on, which is already done when installed via the Helm chart.

@v-shenoy
Copy link
Contributor

So, now that we have pinpointed this issue due to node drain / missing PDBs. Do you have ideas as to why was it initially only happening on Mariner? @tshaiman @JorTurFer

@tshaiman
Copy link
Author

@v-shenoy : it wasn’t always happening on Mariner , we got to the wrong conclusion since the trigger who got the node drain was switching to mariner .

my colleague @yehiam livneh has added the following suggestion to the keda team:

“It’s also Keda issue since they don't have any liveness probes . They don't know that they lost their identity - if they had liveness then they would have restarted and everything would be okay.”

i agree with his statement

@JorTurFer
Copy link
Member

“It’s also Keda issue since they don't have any liveness probes . They don't know that they lost their identity - if they had liveness then they would have restarted and everything would be okay.”

i agree with his statement

Maybe we can include somehow a dynamic livenessProbe based on needed resources, e.g: If there is any TriggerAuthentication with workload identity (azure/gcp) or aws role assumptions we can check those files as part of the probe.
TBH, I don't like this idea because a single failing trigger can force a crashloop and KEDA works, the problem it isn't that KEDA fails or is death, KEDA works but the trigger fails.
Maybe we can add a log trace saying that the file doesn't exist, but that message is already available error reading service account token - open : no such file or directory

@tshaiman
Copy link
Author

@JorTurFer I see your point , but as a reminder - the pod does goes to healthy once it is restarted since the volume mount works then.
i do believe this is the exact role of health ! to
report that it cannot operate if it really can’t
today the health check is always “true” leaving it with no real value.

since we already have the log + azure alerts on top of logs , i think the real impact here would be to add a metric on such scenario - so that prom alerts can leverage that .

having said that - indeed this is not a keda error
can we now join hands to work with WIF people on it ?

@JorTurFer
Copy link
Member

JorTurFer commented Jan 13, 2023

i do believe this is the exact role of health ! to
report that it cannot operate if it really can’t
today the health check is always “true” leaving it with no real value.

You are right, but let's say you have an error with workload identity (any persistent issue), KEDA won't start because it will be restarting all the time, if it's a transient error, nothing happens, but with other errors could be a pain. If all your workloads use WI, this can make sense , but in the case you have multiple triggers with and without WI, this doesn't make sense. e.g:we have a product with 15-20 Prometheus triggers and only 1 case with WI integration for reading Azure EventHub topics, in our case this could be a problem for example

The operator is who requests the metrics to the upstreams since 2.9, so we cannot restart it all the time if we are not 100% sure because metrics server won't response to HPA controller.

Maybe a new parameter in the TriggerAuthentication for including the podIdentity in the health probes, but this is risky because on development team could impact to other team. With a multi-tenant scenario I see this use case better.

@kedacore/keda-core-contributors WDYT?

@v-shenoy
Copy link
Contributor

i do believe this is the exact role of health ! to
report that it cannot operate if it really can’t
today the health check is always “true” leaving it with no real value.

You are right, but let's say you have an error with workload identity (any persistent issue), KEDA won't start because it will be restarting all the time, if it's a transient error, nothing happens, but with other errors could be a pain. If all your workloads use WI, this can make sense , but in the case you have multiple triggers with and without WI, this doesn't make sense. e.g:we have a product with 15-20 Prometheus triggers and only 1 case with WI integration for reading Azure EventHub topics, in our case this could be a problem for example

The operator is who requests the metrics to the upstreams since 2.9, so we cannot restart it all the time if we are not 100% sure because metrics server won't response to HPA controller.

Maybe a new parameter in the TriggerAuthentication for including the podIdentity in the health probes, but this is risky because on development team could impact to other team. With a multi-tenant scenario I see this use case better.

@kedacore/keda-core-contributors WDYT?

I agree with this. This needs to be discussed thoroughly.

@tshaiman
Copy link
Author

@v-shenoy @JorTurFer : another input that caused me to look at things differently - Yet again.

  1. I have installed Workload Identity with the latest helm chart -> PDB is now installed with min=1 on webhook.
  2. have only 1 system node pool where keda is installed at
  3. Stopped the cluster and restarted it again using the Portal .
  4. Keda -> Looses the token again , "no such file" infamous error , no Volume for Cert and No Environment variables.
  5. Restart the keda pod restore everything back - the volume and the env variables.

so this makes me think that there is something wrong in the way keda pod is starting itself , it will happen on all scenarios where the keda pod is evicted from the node , regardless if there is a PDB or not .

I mean the fact that the keda pod does not have the ENV variables when it is evicted from one Node pool to the other
is a big warning sign here , don't you think ?

@tshaiman
Copy link
Author

tshaiman commented Jan 15, 2023

since workload identity team has shared the following information, i’m closing this ticket as it’s not keda issue :

“… webhook use fail policy Ignore to not block pod admission. So if the keda pods are run before the webhook is running, they'll not be mutated. As mentioned in the thread before…we are adding a object selector and mandating label on pods to enforce pod admission through the webhook.”

i will add another issue for request to enable us to do health checks based on the existence of the token file . since the base image is distroless - we cannot use “cat” command

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

8 participants