Keda 2.5 does not cleanly update from 2.4 #2381

bpinske · 2021-12-03T22:58:06Z

Report

There appears to be a bug that prevents a clean and safe upgrade from keda 2.4 to keda 2.5, possibly related to this PR which changed metric names or This one . This affects pre-existing ScaledObjects are are present at the time of the 2.4 upgrade.

The symptom would be that the HPA loop would be attempting to evaluate a metric which did not actually exist within the Kubernetes external metrics API. Below is a snippet of the output of a kubectl describe hpa where the new keda 2.5 format of metrics would be queried and the log output of the external-metrics API

Metrics:                                               ( current / target )
  "s1-prometheus-burrow_lag" (target average value):   <unknown> / 120M
  resource cpu on pods  (as a percentage of request):  108% (3267m) / 100%

  Warning  FailedGetExternalMetric  3m6s (x791 over 3h22m)  horizontal-pod-autoscaler  unable to get external metric s001/s1-prometheus-burrow_lag/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name:,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for s1-prometheus-burrow_lag

kubectl get --raw '/apis/external.metrics.k8s.io/v1beta1/namespaces/s001/s1-prometheus-burrow_lag?labelSelector=app=myApp' | jq
Error from server: No matching metrics found for s1-prometheus-burrow_lag

Reverting to Keda 2.4 would immediately fix the issue and resume using the old names.
When in an errored state, remediation was possible by deleting all ScaledObjects and recreating them. This appeared to cause a reconciliation for the recreated scaledObject to the point the new-style metric becomes available.

Expected Behavior

Expect Keda 2.5 to immediately and reliably work out of the box for existing scaledObject definitions.

Actual Behavior

Upgrading from Keda 2.4 to Keda 2.5 is disruptive for pre-existing scaledObject-managed HPAs. New style metrics are inaccessible from the keda metrics APIserver

Steps to Reproduce the Problem

Deploy Keda 2.4
Create scaledObjects of Prometheus
Upgrade to Keda 2.5

This issue may be difficult to reproduce. This only occurred in 2 out of my 30 kubernetes clusters. But it consistently happened within those 2. I am entirely unclear as to why the 2 clusters persistently had the issue: they should be identically configured as the rest.

Logs from KEDA operator

 Warning  FailedGetExternalMetric  3m6s (x791 over 3h22m)  horizontal-pod-autoscaler  unable to get external metric s001/s1-prometheus-burrow_lag/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name=myApp:,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for s1-prometheus-burrow_lag

KEDA Version

2.5.0

Kubernetes Version

1.20

Platform

Other

Scaler Details

Prometheus

Anything else?

No response

The text was updated successfully, but these errors were encountered:

JorTurFer · 2021-12-05T22:46:43Z

Hi,
The change in the name is the expected behavior. Is the metric server updated to v2.5 too?
If you query available metrics manually, are you getting old names? The update should be automatic in both cases, the operator should update the HPA and the metric server should expose it without any extra action from your side.

bpinske · 2021-12-05T23:58:28Z

This issue resurfaced on me again, after a several hour delay. This is the second time I have had the issue surface on me, each failing several hours after first release. I will note that following the incident of the first time, I had deleted and recreated all scaledObjects and HPA objects while already deployed to keda 2.5 to ensure that there wouldn't be any potentially stale references left over. As I have had the issue a second time in multiple environments: this has not helped.

I support many environments, across these many environments I have two sets of behaviours:

One where keda 2.5 works without issue
One where keda 2.5 works for approximately 9~ hours before the new style metrics begin failing to resolve. This has happened 5 times now across 3 days and 3 environments.

For scenario 1) I have the following example where I DO see a mismatch between the enumeration of the available resources and what's actually queryable. This behaviour is consistent and reproducible across

(⎈)➜  ~ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq

{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "prometheus-https---thanos-example-com-burrow_lag",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    },
}

(⎈)➜  ~ kubectl get --raw '/apis/external.metrics.k8s.io/v1beta1/namespaces/pool/s0-prometheus-burrow_lag?labelSelector=scaledobject.keda.sh/name=myApp' | jq
{
  "kind": "ExternalMetricValueList",
  "apiVersion": "external.metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "metricName": "s0-prometheus-burrow_lag",
      "metricLabels": null,
      "timestamp": "2021-12-05T23:30:10Z",
      "value": "0"
    }
  ]
}

Just as part of writing this up, I note that if I restart the (already 2.5) keda metrics API server, it begins returning the correctly named metrics when enumerating, data works properly just the same

(⎈)➜  ~ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "s0-prometheus-burrow_lag",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    }

In 2) where keda actually fails I receive the following error messages and an inability to query the metrics.

apiVersion="autoscaling/v2beta2" type="Warning" reason="FailedGetExternalMetric" message="unable to get external metric s001/s2-prometheus-burrow_lag_sensor/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for s2-prometheus-burrow_lag_sensor"

Unfortunately I don't have a live example of the API output from today, but when I was investigating this for the first time I had the following unusual output. I now wonder if there is a cache expiring, or maybe a leader election changing of some sort that is causing a revert of the metric names. I do believe the two metrics below of the new format were fixed as the result of deleting and re-creating the ScaledObject definition. I wonder if restarting the metrics API server would have again refreshed the metrics to the point they resolve correctly. But needing to restart keda every few hours remains undesirable behaviour :)

friday morning example broken

kubectl get --raw '/apis/external.metrics.k8s.io/v1beta1' | jq '.resources[].name'
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag_sensor"
"prometheus-https---thanos-example-com-burrow_lag"
"s0-prometheus-burrow_lag_sensor"
"s1-prometheus-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"

JorTurFer · 2021-12-06T00:30:57Z

But KEDA doesn't expose the same metric in different ways depending on the lifetime, I mean, maybe could you have more than 1 instance? The index is calculated inside the scaler in all GetMetricSpecForScaling and it's evaluated internally, so I can't understand for example this output:

kubectl get --raw '/apis/external.metrics.k8s.io/v1beta1' | jq '.resources[].name'
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag_sensor"
"prometheus-https---thanos-example-com-burrow_lag"
"s0-prometheus-burrow_lag_sensor"
"s1-prometheus-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"

Despite if the index had been wrong, you should have all with s0-xxxx 🤔
I'm not sure if the metrics are cached at k8s level (maybe yes and that's the problem), I know that KEDA Metrics Server caches metric name, but again, the metric name should contain sx-xxxx
Do you know more about this @zroubalik @coderanger ?

I have tried in a EKS v1.21 and it works correctly and the same behavior is with an AKS v1.20.
I'm not able to reproduce it :(

JorTurFer · 2021-12-06T00:34:31Z

Just to double-check, you have tried deleting and creating again the ScaledObjects, right? I mean, in the clusters where you have problems, did you also delete and create them again?

bpinske · 2021-12-06T00:36:33Z

yes I had deleted and recreated the scaledObjects after upgrading to keda 2.5

Here is a screenshot of the metric values over time. Note the flatline occurring in the middle of the night. That during that flatline we were receiving the below error message which continued until we reverted to keda 2.4 at which points metrics began to flow again using the old name convention.

During the period of the flatline when the issue starts, the metrics server begins throwing 500s because it's unable to resolve requests for the metric name

  Warning  FailedGetExternalMetric  3m6s (x791 over 3h22m)  horizontal-pod-autoscaler  unable to get external metric s001/s1-prometheus-burrow_lag/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name:,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for s1-prometheus-burrow_lag

Given that I seem pretty capable of reproducing this issue in production :(, is there a guide anywhere documented for how to attach gdb or delve to the metricsServer right on the server when the issue arise? Or where I can find symbols to do that?

JorTurFer · 2021-12-06T00:48:12Z

Could you run kubectl get pods -n {KEDA_NAMESPACE} -o jsonpath="{..imageID}" and paste the output please?

bpinske · 2021-12-06T00:53:28Z

kedacore/keda@sha256:8fba3ab792c0e9d14ab046cda739e0925a39277c991122fd40474a59958bbd19 
ghcr.io/kedacore/keda-metrics-apiserver@sha256:77e4967dc13cb8b3c6f1dcf0b6c5ad9e44e09daa621b567d73f7318627551756

JorTurFer · 2021-12-06T01:09:06Z

The images look good :/
Tomorrow I will prepare an environment with KEDA v2.4 and I will try to reproduce the problem with any ScaledObject with Prometheus triggers updating to v2.5 (here it's 2h00 now).

It's weird because the difference is not at trigger level, it's at ScaledObject level and I updated KEDA in our (company) clusters without any issues, basically the new version of the operator updated the HPAs and the new version of the metrics server exposes them. We use RabbitMQ triggers but as I said, the change is at ScaledObject level 🤔

Maybe there is any specific behavior in prometheus scaler but I don't think so... Let's see tomorrow

glassnick · 2021-12-23T10:56:15Z

Hi there.

Has there been any update on this issue?

We're also seeing a similar issue. We have 3 clusters where we have upgraded from 2.4.0 to 2.5.0, and two of them are producing errors.
We are using Azure Service Bus for the events, and getting this output
As you see below, the metric "s1-azure-servicebus-st-xxx" is showing as 'unknown'.
.

kubectl describe hpa keda-hpa-file-xxx

Name:                                                              keda-hpa-file-xxx
Namespace:                                                         default
Labels:                                                            app.kubernetes.io/managed-by=Helm
                                                                   scaledobject.keda.sh/name=file-xxx
Annotations:                                                       <none>
CreationTimestamp:                                                 Fri, 10 Dec 2021 11:51:54 +0000
Reference:                                                         Deployment/file-xxx
Metrics:                                                           ( current / target )
  "s1-azure-servicebus-st-xxx" (target average value):   <unknown> / 5
  "s1-azure-servicebus-mi-xxx" (target average value):  0 / 5
Min replicas:                                                      1
Max replicas:                                                      15
Deployment pods:                                                   1 current / 1 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from external metric s1-azure-servicebus-mi-xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: file-xxx,},MatchExpressions:[]LabelSelectorRequirement{},})
  ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
Events:
  Type     Reason                   Age                    From                       Message
  ----     ------                   ----                   ----                       -------
  Warning  FailedGetExternalMetric  34s (x782 over 3h18m)  horizontal-pod-autoscaler  unable to get external metric default/s1-azure-servicebus-st-xxx/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: file-xxx,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for s1-azure-servicebus-st-xxx

JorTurFer · 2021-12-23T11:20:44Z

🤦
I have to apologize, I didn't have so much time and I totally forgot this issue :(
I will take a look during the next week
Sorry

glassnick · 2021-12-23T11:46:01Z

For this particular metric, when querying the metrics manually, we're seeing s0 instead of s1 in the metric name...

"s0-azure-servicebus-st-xxx"

FYI, we are seeing issues with multiple metrics

JorTurFer · 2021-12-23T12:34:15Z

could your problem be related with this?

bpinske · 2021-12-23T17:11:34Z

^ That actually does sound somewhat related. I definitely observed mismatches between what the metricNames were according to the metricsServer and what the HPA loop was actually querying.

JorTurFer · 2021-12-23T17:42:48Z

okay, there are 2 different workarounds if you are affected by this error:

Update the SO to bump the generation (update de manifest, not delete and create it again)
Restart KEDA pods to recreate the cache

Could you check if these workarounds mitigate the problem? Just to know if this is the root cause and not invest time digging in the problem

bpinske · 2021-12-23T17:48:11Z

First I'll need to see if I can actually find a way to reproduce this issue reliably: So far I've only been able to trigger it in my largest 4 environments :)

I'll see about trying to reproduce this today in a dev space. If I find a way to reproduce, what I'll actually do is

build keda myself and cherrypick in your cache deletion commit to see if that fixes it.
If the above doesn't fix it, I'm going to revert this and this one at a time to try and bisect down to the real cause.

JorTurFer · 2021-12-23T17:52:47Z

nice!
Thanks for your help ❤️
I'm thinking and probably updating the SO name is enough to avoid the cache because the (cache) key is generated using the name and namespace

glassnick · 2021-12-24T13:26:34Z

Our team will be testing this in the new year.

tomkerkhove · 2022-01-18T16:33:09Z

@glassnick Did you manage to give this a try?

bpinske · 2022-01-18T17:02:10Z

I tried myself to reproduce the error that I had originally been seeing, but I was unable to.

I had observed the error only in my largest environments so I had the suspicion that the error was related to number of objects in the cluster. Unfortunately just throwing large numbers of objects at a dev environment was not enough to reproduce at least the error cases I had seen.

My next step is kind of the nuclear one with running keda in my largest environments where I can reproduce the bug under the delve debugger and breakpointing exactly what's failing. It might be a while before I have time to continue at that level of debugging for my own particular issues though.

bpinske added the bug Something isn't working label Dec 3, 2021

bpinske changed the title ~~Keda 2.5 does not cleaning update from 2.4 prometheus~~ Keda 2.5 does not cleanly update from 2.4 prometheus Dec 4, 2021

bpinske changed the title ~~Keda 2.5 does not cleanly update from 2.4 prometheus~~ Keda 2.5 does not cleanly update from 2.4 Dec 23, 2021

jamesdowning-chip mentioned this issue Jan 13, 2022

HPA FailedGetExternalMetric Warning while using Prometheus Scalar #2480

Closed

tomkerkhove mentioned this issue Feb 7, 2022

use correct index in ScalerBuilder.Factory #2593

Merged

2 tasks

JorTurFer closed this as completed in #2593 Feb 8, 2022

vujadeyoon mentioned this issue Feb 14, 2022

KEDA 2.6.1 Unable to get external metric for AWS SQS. #2632

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keda 2.5 does not cleanly update from 2.4 #2381

Keda 2.5 does not cleanly update from 2.4 #2381

bpinske commented Dec 3, 2021 •

edited

Loading

JorTurFer commented Dec 5, 2021

bpinske commented Dec 5, 2021 •

edited

Loading

JorTurFer commented Dec 6, 2021 •

edited

Loading

JorTurFer commented Dec 6, 2021

bpinske commented Dec 6, 2021 •

edited

Loading

JorTurFer commented Dec 6, 2021

bpinske commented Dec 6, 2021

JorTurFer commented Dec 6, 2021

glassnick commented Dec 23, 2021

JorTurFer commented Dec 23, 2021

glassnick commented Dec 23, 2021

JorTurFer commented Dec 23, 2021

bpinske commented Dec 23, 2021

JorTurFer commented Dec 23, 2021

bpinske commented Dec 23, 2021 •

edited

Loading

JorTurFer commented Dec 23, 2021

glassnick commented Dec 24, 2021

tomkerkhove commented Jan 18, 2022

bpinske commented Jan 18, 2022 •

edited

Loading

Keda 2.5 does not cleanly update from 2.4 #2381

Keda 2.5 does not cleanly update from 2.4 #2381

Comments

bpinske commented Dec 3, 2021 • edited Loading

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

JorTurFer commented Dec 5, 2021

bpinske commented Dec 5, 2021 • edited Loading

JorTurFer commented Dec 6, 2021 • edited Loading

JorTurFer commented Dec 6, 2021

bpinske commented Dec 6, 2021 • edited Loading

JorTurFer commented Dec 6, 2021

bpinske commented Dec 6, 2021

JorTurFer commented Dec 6, 2021

glassnick commented Dec 23, 2021

JorTurFer commented Dec 23, 2021

glassnick commented Dec 23, 2021

JorTurFer commented Dec 23, 2021

bpinske commented Dec 23, 2021

JorTurFer commented Dec 23, 2021

bpinske commented Dec 23, 2021 • edited Loading

JorTurFer commented Dec 23, 2021

glassnick commented Dec 24, 2021

tomkerkhove commented Jan 18, 2022

bpinske commented Jan 18, 2022 • edited Loading

bpinske commented Dec 3, 2021 •

edited

Loading

bpinske commented Dec 5, 2021 •

edited

Loading

JorTurFer commented Dec 6, 2021 •

edited

Loading

bpinske commented Dec 6, 2021 •

edited

Loading

bpinske commented Dec 23, 2021 •

edited

Loading

bpinske commented Jan 18, 2022 •

edited

Loading