Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workload Identity on Azure stopped working on 2.9.2 #4123

Closed
tshaiman opened this issue Jan 16, 2023 · 21 comments · Fixed by #4125
Closed

Workload Identity on Azure stopped working on 2.9.2 #4123

tshaiman opened this issue Jan 16, 2023 · 21 comments · Fixed by #4125
Assignees
Labels
bug Something isn't working

Comments

@tshaiman
Copy link

Report

Took 2.9.2 for a quick ride .
configuration is :

  • workload identity enabled : true
  • Pod Identity enabled : false

Expected Behavior

identity should work as it did for Version 2.8.1

Actual Behavior

there is a regression in the code where Azure workload Identity does not work any more without Pod Identity.

Steps to Reproduce the Problem

  1. install Keda 2.9.2 with Workload identity enabled but without Pod Identity :

podIdentity:
activeDirectory:
identity: ""
azureWorkload:
enabled: true
clientId: "some_client_id"
tenantId: "some_tenant_id

you will see the error below , keda cannot authenticate.

  1. Go Back to Keda 2.8.1 , all works fine

Logs from KEDA operator

2023-01-16T20:18:09Z	ERROR	scalers_cache	error getting scale decision	{"scaledobject.Name": "app-scaler", "scaledObject.Namespace": "app", "scaleTarget.Name": "app-deploy", "error": "GET https://workload.servicebus.windows.net/app-queue\n--------------------------------------------------------------------------------\nRESPONSE 401: 401 Unauthorized\nERROR CODE: 401\n--------------------------------------------------------------------------------\n<Error><Code>401</Code><Detail>Manage,EntityRead claims required for this operation. TrackingId:e7629186-6733-4ff9-b463-840bda671846_G31, SystemTracker:workload.servicebus.windows.net:app-queue, Timestamp:2023-01-16T20:18:09</Detail></Error>\n--------------------------------------------------------------------------------\n"}
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetScaledObjectState
	/workspace/pkg/scaling/cache/scalers_cache.go:155
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
	/workspace/pkg/scaling/scale_handler.go:360
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
	/workspace/pkg/scaling/scale_handler.go:162

KEDA Version

2.9.2

Kubernetes Version

1.24

Platform

Microsoft Azure

Scaler Details

Azure Service Bus

Anything else?

I know there was a bug fix to merge some Pod Identity + Workload Identity , make sure the tests covers cases where :

  • WIF is on , Pod Identity is Off,
  • Pod Identity is on , WIF is Off.
  • Both Are on.
@tshaiman tshaiman added the bug Something isn't working label Jan 16, 2023
@zroubalik
Copy link
Member

@JorTurFer PTAL ^

Thanks for the quick report!

@JorTurFer
Copy link
Member

Do you have both tools installed with different identities assigned?

@JorTurFer
Copy link
Member

We introduced a chained token provider to simplify the code (as Microsoft does with DefaultAzureCrendential), basically it does a fallback from one to another in case of failure, but it's true they can conflict between them (and I didn't notice it, sorry).
Does split them again into 2 chained token credential make sense?

@JorTurFer
Copy link
Member

Anything else?

I know there was a bug fix to merge some Pod Identity + Workload Identity , make sure the tests covers cases where :

  • WIF is on , Pod Identity is Off,
  • Pod Identity is on , WIF is Off.
  • Both Are on.

It's not easy to test all the scenarios because we install both identity providers as dependencies in the beginning, we cannot remove them because that can affect to other tests, I tested them during the development of the change, but it's true I used the same identity for both, and that can hide the failure

@JorTurFer
Copy link
Member

The only option I can see is to split them again, adding a note explaining why we keep them split. Pod identity will be deprecated at the end of this year, so the duplication "has a removal date". WDTY @kedacore/keda-core-contributors @v-shenoy ?

@JorTurFer JorTurFer self-assigned this Jan 17, 2023
@JorTurFer JorTurFer moved this from Proposed to In Review in Roadmap - KEDA Core Jan 17, 2023
@github-project-automation github-project-automation bot moved this from In Review to Ready To Ship in Roadmap - KEDA Core Jan 17, 2023
@tshaiman
Copy link
Author

this is a critical bug, any ETA for its shipping ?
waiting for 2.10 may be in long time
any chance this will come out as hot fix ?

@JorTurFer
Copy link
Member

JorTurFer commented Jan 23, 2023

hi @tshaiman
We are discussion if we will do a hotfix release or not, but just for being 100% sure if it will work, I'm building and pushing the multi-arch images with the same code that would be released (I have just cherry picked the commit into release/v2.9 branch and build the images from there).
Are you willing to test them? at least to double-check if they work or not, if they solve the issue, we can proceed with another hotfix release soon, but just to be 100% sure this time, as this is the only issue for the hotfix release.
tags are these:

  • docker.io/jorturfer/keda:2.9.3
  • docker.io/jorturfer/keda-metrics-apiserver:2.9.3

As I have said, both have been generated from #4156

@tshaiman
Copy link
Author

@JorTurFer : gladly , I'm on it now , will let you know soon

@tshaiman
Copy link
Author

@JorTurFer / @tomkerkhove : For now it seems to be working with AzureIdentity , I would like to park it for 12-24H on Test cluster , and another deploy with only Pod-Identity for some more time -> and then I can say for sure its finalized.
sounds good ?

p.s: the helm chart i took is 2.9.3 chart , but the appVersion there is 2.9.2 causing it to be with misallignment with the tag :

image: {{ .Values.image.keda.repository }}:{{ .Values.image.keda.tag | default .Chart.AppVersion }}

wil translate to : docker.io/jorturfer/keda:2.9.2 not docker.io/jorturfer/keda:2.9.3

@JorTurFer
Copy link
Member

Hey,
Sure, take your time! as I said, if we finally release another hotfix, we'd like to be 100% sure about the solution.

p.s: the helm chart i took is 2.9.3 chart , but the appVersion there is 2.9.2 causing it to be with misallignment with the tag :

We have different versioning for them because we have different release flows and periods, but it's true that the misallignment is confusing, I can regenerate the image with the new tag temporally if it's easier for you

@tshaiman
Copy link
Author

@JorTurFer : no need , I have it ready already.

any updates on our ask to use CAT / Healthcheck ?

@JorTurFer
Copy link
Member

any updates on our ask to use CAT / Healthcheck ?

Not from my side, but you can ask in that issue directly :)

@pinkfloydx33
Copy link

The image docker.io/jorturfer/keda:2.9.3 fixed this for us. Oddly it was only our azure-servicebus scaler that was having the issue. All other azure scalers (mostly azure-queue) were working just fine. We were not using both mechanisms for authentication, azure-workload only

I see this is closed... will there be a hotfix release for it? We're ok using unofficial image for validation/test purposes... but we can't operate that way normally.

@JorTurFer
Copy link
Member

@tshaiman is also testing the solution in his environment. Once we know if this works or not, we will decide about the hotfix release

@tshaiman
Copy link
Author

tshaiman commented Jan 25, 2023

@JorTurFer : when the Pod runs with WorkloadIdentity -> all is fine after 24H no errors detected
but when moving to Pod-Identity Deployment we get many errors :

keda-operator-747869597-w4zcp keda-operator 2023-01-25T14:16:13Z        ERROR   scalers_cache   error getting scale decision    {"scaledobject.Name": "frameextraction", "scaledObject.Namespace": "vi-be-map", "scaleTarget.Name": "vi-frameextraction", "error": "ChainedTokenCredential: failed to acquire a token.\nAttempted credentials:\n\tmanaged identity timed out"}
keda-operator-747869597-w4zcp keda-operator github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetScaledObjectState
keda-operator-747869597-w4zcp keda-operator     /workspace/pkg/scaling/cache/scalers_cache.go:155
keda-operator-747869597-w4zcp keda-operator github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
keda-operator-747869597-w4zcp keda-operator     /workspace/pkg/scaling/scale_handler.go:360
keda-operator-747869597-w4zcp keda-operator github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
keda-operator-747869597-w4zcp keda-operator     /workspace/pkg/scaling/scale_handler.go:162
keda-operator-747869597-w4zcp keda-operator 2023-01-25T14:16:13Z        INFO    scaleexecutor   Successfully set ScaleTarget replicas count to ScaledObject fallback.replicas   {"scaledobject.Name": "frameextraction", "scaledObject.Namespace": "vi-be-map", "scaleTarget.Name": "vi-frameextraction", "Original Replicas Count": 5, "New Replicas Count": 5}

So we saw that on Version 2.9.2 as well , meaning problem is not fixed

@JorTurFer
Copy link
Member

Do you have any log in the pod identity pods? The error says that after 3 seconds the pod identity hasn't responded (or is unreachable) but it has tried the connection. This timeout is managed as an HTTP Timeout
, could you try increasing from 3 seconds to 10-15? just to double-check if it works.
I'll check it again on my own cluster and also with automated e2e tests

@tshaiman
Copy link
Author

False Alarm !
sorry for this misleading information - cluster was configured incorrectly.
experiment begins now for next 24H and it looks good !

stay tuned

@tomkerkhove
Copy link
Member

SweatyHeatGIF

@tshaiman
Copy link
Author

@tomkerkhove @JorTurFer : I can confirm now after several days with this version that its stable and working on all modes:

  • workload identity turned on
  • pod Identity Turned On
  • having mixed mode : keda has workload identity+pod identity and the trigger authenticatiuon uses PodIdentity.

all combinations worked with no issues.

@tomkerkhove
Copy link
Member

🎉🎉🎉

@JorTurFer
Copy link
Member

@tshaiman , KEDA v2.9.3 has just been released.

@JorTurFer JorTurFer moved this from Ready To Ship to Done in Roadmap - KEDA Core Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants