-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keda 2.5 does not cleanly update from 2.4 #2381
Comments
Hi, |
This issue resurfaced on me again, after a several hour delay. This is the second time I have had the issue surface on me, each failing several hours after first release. I will note that following the incident of the first time, I had deleted and recreated all scaledObjects and HPA objects while already deployed to keda 2.5 to ensure that there wouldn't be any potentially stale references left over. As I have had the issue a second time in multiple environments: this has not helped. I support many environments, across these many environments I have two sets of behaviours:
For scenario 1) I have the following example where I DO see a mismatch between the enumeration of the available resources and what's actually queryable. This behaviour is consistent and reproducible across
Just as part of writing this up, I note that if I restart the (already 2.5) keda metrics API server, it begins returning the correctly named metrics when enumerating, data works properly just the same
In 2) where keda actually fails I receive the following error messages and an inability to query the metrics.
Unfortunately I don't have a live example of the API output from today, but when I was investigating this for the first time I had the following unusual output. I now wonder if there is a cache expiring, or maybe a leader election changing of some sort that is causing a revert of the metric names. I do believe the two metrics below of the new format were fixed as the result of deleting and re-creating the ScaledObject definition. I wonder if restarting the metrics API server would have again refreshed the metrics to the point they resolve correctly. But needing to restart keda every few hours remains undesirable behaviour :)
|
But KEDA doesn't expose the same metric in different ways depending on the lifetime, I mean, maybe could you have more than 1 instance? The index is calculated inside the scaler in all
Despite if the index had been wrong, you should have all with s0-xxxx 🤔 I have tried in a EKS v1.21 and it works correctly and the same behavior is with an AKS v1.20. |
Just to double-check, you have tried deleting and creating again the ScaledObjects, right? I mean, in the clusters where you have problems, did you also delete and create them again? |
Could you run |
|
The images look good :/ It's weird because the difference is not at trigger level, it's at ScaledObject level and I updated KEDA in our (company) clusters without any issues, basically the new version of the operator updated the HPAs and the new version of the metrics server exposes them. We use RabbitMQ triggers but as I said, the change is at ScaledObject level 🤔 Maybe there is any specific behavior in prometheus scaler but I don't think so... Let's see tomorrow |
Hi there. Has there been any update on this issue? We're also seeing a similar issue. We have 3 clusters where we have upgraded from 2.4.0 to 2.5.0, and two of them are producing errors.
|
🤦 |
For this particular metric, when querying the metrics manually, we're seeing s0 instead of s1 in the metric name...
FYI, we are seeing issues with multiple metrics |
could your problem be related with this? |
^ That actually does sound somewhat related. I definitely observed mismatches between what the metricNames were according to the metricsServer and what the HPA loop was actually querying. |
okay, there are 2 different workarounds if you are affected by this error:
Could you check if these workarounds mitigate the problem? Just to know if this is the root cause and not invest time digging in the problem |
First I'll need to see if I can actually find a way to reproduce this issue reliably: So far I've only been able to trigger it in my largest 4 environments :) I'll see about trying to reproduce this today in a dev space. If I find a way to reproduce, what I'll actually do is |
nice! |
Our team will be testing this in the new year. |
@glassnick Did you manage to give this a try? |
I tried myself to reproduce the error that I had originally been seeing, but I was unable to. I had observed the error only in my largest environments so I had the suspicion that the error was related to number of objects in the cluster. Unfortunately just throwing large numbers of objects at a dev environment was not enough to reproduce at least the error cases I had seen. My next step is kind of the nuclear one with running keda in my largest environments where I can reproduce the bug under the delve debugger and breakpointing exactly what's failing. It might be a while before I have time to continue at that level of debugging for my own particular issues though. |
Report
There appears to be a bug that prevents a clean and safe upgrade from keda 2.4 to keda 2.5, possibly related to this PR which changed metric names or This one . This affects pre-existing ScaledObjects are are present at the time of the 2.4 upgrade.
The symptom would be that the HPA loop would be attempting to evaluate a metric which did not actually exist within the Kubernetes external metrics API. Below is a snippet of the output of a
kubectl describe hpa
where the new keda 2.5 format of metrics would be queried and the log output of the external-metrics APIReverting to Keda 2.4 would immediately fix the issue and resume using the old names.
When in an errored state, remediation was possible by deleting all ScaledObjects and recreating them. This appeared to cause a reconciliation for the recreated scaledObject to the point the new-style metric becomes available.
Expected Behavior
Expect Keda 2.5 to immediately and reliably work out of the box for existing scaledObject definitions.
Actual Behavior
Upgrading from Keda 2.4 to Keda 2.5 is disruptive for pre-existing scaledObject-managed HPAs. New style metrics are inaccessible from the keda metrics APIserver
Steps to Reproduce the Problem
This issue may be difficult to reproduce. This only occurred in 2 out of my 30 kubernetes clusters. But it consistently happened within those 2. I am entirely unclear as to why the 2 clusters persistently had the issue: they should be identically configured as the rest.
Logs from KEDA operator
KEDA Version
2.5.0
Kubernetes Version
1.20
Platform
Other
Scaler Details
Prometheus
Anything else?
No response
The text was updated successfully, but these errors were encountered: