Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keda-Operator OOM problem after upgrade to Keda v2.11.* #4789

Closed
andreb89 opened this issue Jul 13, 2023 · 9 comments · Fixed by kedacore/keda-docs#1250
Closed

Keda-Operator OOM problem after upgrade to Keda v2.11.* #4789

andreb89 opened this issue Jul 13, 2023 · 9 comments · Fixed by kedacore/keda-docs#1250
Assignees
Labels
bug Something isn't working

Comments

@andreb89
Copy link

Report

Hi,
we have an OOM problem in Kubernetes (AKS 1.26.3) with the keda-operator introduced with version 2.11.*. We are using Postgres- and Prometheus trigger for scaled jobs. For now, we downgraded to 2.10.1 again, where we do not have this issue.

Grafana metrics for the keda-operator pod with 2.11.1:
image

After the downgrade to 2.10.1:
image

I added some keda-operator pod logs. but nothing useful is really found around the time the OOM happens.

We are using the default resource request/limits, e.g. keda-operator:

    Limits:
      cpu:     1
      memory:  1000Mi
    Requests:
      cpu:      100m
      memory:   100Mi

We have about 500 scaledjobs instances and 1 scaledobjects instance. Most of the jobs have a Prometheus trigger with the following template:

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  annotations:
    meta.helm.sh/release-name: worker
    meta.helm.sh/release-namespace: ks-ns
  creationTimestamp: "2022-01-26T17
  finalizers:
  - finalizer.keda.sh
  generation: 21
  labels:
    ...
  name: worker
  namespace: ks-ns
  resourceVersion: "647584627"
  uid: 057f6b4b-8cc0-43aa-a16c-fe9ab7611d79
spec:
  failedJobsHistoryLimit: 10
  jobTargetRef:
    activeDeadlineSeconds: 1800
    backoffLimit: 6
    template:
      metadata:
        creationTimestamp: null
        labels:
         ...
      spec:
        containers:
        ...
    ttlSecondsAfterFinished: 3600
  maxReplicaCount: 20
  pollingInterval: 5
  rolloutStrategy: default
  scalingStrategy: {}
  successfulJobsHistoryLimit: 1
  triggers:
  - metadata:
      metricName: serverless_pendingjobs
      query: max(serverless_pendingjobs{queue="queue", namespace="ks-ns"})
      serverAddress: http://[cluster]:9090
      threshold: "1"
    type: prometheus

Expected Behavior

Memory consumption should stay the same after the Keda version update.

Actual Behavior

Huge jump in memory consumption after the upgrade.

Steps to Reproduce the Problem

Have a bigger cluster with a lot of different scale jobs and try the Keda version upgrade from 2.10.* to 2.11.*.

Maybe this will happen for you, too. Honestly unclear.

Logs from KEDA operator

  | Jul 11, 2023 @ 01:10:54.809 | 2023-07-10T23:10:54Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appg-lucmerge-appf", "scaledJob.Namespace": "staging-appg-application", "Number of pending Jobs ": 0}
  | Jul 11, 2023 @ 01:10:54.809 | 2023-07-10T23:10:54Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appf-bundle-finalizer-m", "scaledJob.Namespace": "staging-appf-application", "Number of pending Jobs ": 0}
  | Jul 11, 2023 @ 01:10:54.827 | 2023-07-10T23:10:54Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "appf-import-data-xl", "scaledJob.Namespace": "staging-appf-application", "Effective number of max jobs": 0}
  | Jul 11, 2023 @ 01:10:54.827 | 2023-07-10T23:10:54Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "appf-import-data-xl", "scaledJob.Namespace": "staging-appf-application", "Number of jobs": 0}
  | Jul 11, 2023 @ 01:10:54.827 | 2023-07-10T23:10:54Z	INFO	scaleexecutor	Created jobs	{"scaledJob.Name": "appf-import-data-xl", "scaledJob.Namespace": "staging-appf-application", "Number of jobs": 0}
  | Jul 11, 2023 @ 01:10:55.749 | 2023-07-10T23:10:55Z	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	Starting manager
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	Git Commit: b8dbd298cf9001b1597a2756fd0be4fa4df2059f
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	KEDA Version: 2.11.1
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	Running on Kubernetes 1.26	{"version": "v1.26.3"}
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	Go Version: go1.20.5
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	Go OS/Arch: linux/amd64
  | Jul 11, 2023 @ 01:10:55.874 | 2023-07-10T23:10:55Z	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
  | Jul 11, 2023 @ 01:10:55.874 | 2023-07-10T23:10:55Z	INFO	starting server	{"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
  | Jul 11, 2023 @ 01:10:55.874 | I0710 23:10:55.874494       1 leaderelection.go:245] attempting to acquire leader lease staging-keda-serverless/operator.keda.sh...
  | Jul 11, 2023 @ 01:10:56.668 | I0710 23:10:56.668438       1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="158.111µs" userAgent="kube-probe/1.26" audit-ID="b69846c6-e714-4b2e-8109-460408fc4fa0" srcIP="10.4.8.122:49694" resp=200
  | Jul 11, 2023 @ 01:10:56.668 | I0710 23:10:56.668040       1 httplog.go:132] "HTTP" verb="GET" URI="/readyz" latency="213.915µs" userAgent="kube-probe/1.26" audit-ID="280df468-0e0a-4222-84ee-0aed41f7c566" srcIP="10.4.8.122:49710" resp=200
  | Jul 11, 2023 @ 01:11:01.274 | I0710 23:11:01.274250       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="11.654072ms" userAgent="Go-http-client/2.0" audit-ID="7c0694eb-ae9d-4f70-b035-452bbd726728" srcIP="10.4.1.72:50284" resp=200
  | Jul 11, 2023 @ 01:11:01.274 | I0710 23:11:01.274315       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="11.800979ms" userAgent="Go-http-client/2.0" audit-ID="5d45c41a-6417-4d00-9e08-41b10b1477c1" srcIP="10.4.1.72:50284" resp=200
  | Jul 11, 2023 @ 01:11:01.274 | I0710 23:11:01.274552       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="12.225696ms" userAgent="Go-http-client/2.0" audit-ID="f1594dda-145d-4acb-b768-b74e1608460a" srcIP="10.4.1.72:50284" resp=200
  | Jul 11, 2023 @ 01:11:01.284 | I0710 23:11:01.283973       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="21.652078ms" userAgent="Go-http-client/2.0" audit-ID="11e6d398-2328-4f5a-abdc-8301c366b3b4" srcIP="10.4.1.72:50284" resp=200
  | Jul 11, 2023 @ 01:11:01.284 | I0710 23:11:01.283992       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="21.631177ms" userAgent="Go-http-client/2.0" audit-ID="3c64c045-a66f-4f51-9c09-a43a411cf3fc" srcIP="10.4.1.72:50284" resp=200
  | Jul 11, 2023 @ 01:11:02.144 | I0710 23:11:02.144046       1 httplog.go:132] "HTTP" verb="GET" URI="/openapi/v2" latency="18.651856ms" userAgent="" audit-ID="3d4f8945-2587-42c2-9703-3a10a6862d03" srcIP="10.4.1.72:52592" resp=304
  | Jul 11, 2023 @ 01:11:02.144 | I0710 23:11:02.144171       1 httplog.go:132] "HTTP" verb="GET" URI="/openapi/v3" latency="17.086093ms" userAgent="" audit-ID="d2813a34-fd4a-4bf0-a79e-d434a98a8cba" srcIP="10.4.1.72:52592" resp=200
  | Jul 11, 2023 @ 01:11:05.252 | I0710 23:11:05.252401       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="13.751949ms" userAgent="kube-controller-manager/v1.26.3 (linux/amd64) kubernetes/f18584a/system:serviceaccount:kube-system:resourcequota-controller" audit-ID="e25b658c-cdc2-4eb9-9b42-8e3c5dde004f" srcIP="10.4.1.72:52592" resp=200
  | Jul 11, 2023 @ 01:11:06.670 | I0710 23:11:06.670464       1 httplog.go:132] "HTTP" verb="GET" URI="/readyz" latency="221.919µs" userAgent="kube-probe/1.26" audit-ID="710f9c70-3312-4855-a44d-d88e5d548618" srcIP="10.4.8.122:51914" resp=200
  | Jul 11, 2023 @ 01:11:06.673 | I0710 23:11:06.673761       1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="164.413µs" userAgent="kube-probe/1.26" audit-ID="77c18635-cf67-4ae2-a901-6e9ed6568e06" srcIP="10.4.8.122:51928" resp=200
  | Jul 11, 2023 @ 01:11:06.848 | I0710 23:11:06.848846       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="14.929647ms" userAgent="kube-controller-manager/v1.26.3 (linux/amd64) kubernetes/f18584a/system:serviceaccount:kube-system:generic-garbage-collector" audit-ID="af777025-5f23-413c-8672-1ed69c616df0" srcIP="10.4.1.72:52592" resp=200
  | Jul 11, 2023 @ 01:11:08.502 | I0710 23:11:08.502363       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="14.735956ms" userAgent="kube-controller-manager/v1.26.3 (linux/amd64) kubernetes/f18584a/controller-discovery" audit-ID="c8b38271-99a6-4c7b-9fb7-dab401da7004" srcIP="10.4.1.72:52592" resp=200
  | Jul 11, 2023 @ 01:11:12.084 | I0710 23:11:12.084112       1 leaderelection.go:255] successfully acquired lease staging-keda-serverless/operator.keda.sh
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting EventSource	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting EventSource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting EventSource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v2.HorizontalPodAutoscaler"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting EventSource	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting Controller	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting Controller	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting EventSource	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting Controller	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting Controller	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
  | Jul 11, 2023 @ 01:11:12.095 | 2023-07-10T23:11:12Z	INFO	grpc_server	Starting Metrics Service gRPC Server	{"address": ":9666"}
  | Jul 11, 2023 @ 01:11:12.488 | 2023-07-10T23:11:12Z	INFO	Starting workers	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "worker count": 1}
  | Jul 11, 2023 @ 01:11:12.488 | 2023-07-10T23:11:12Z	INFO	Reconciling ScaledJob	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-leak-detection-m","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-leak-detection-m", "reconcileID": "5bb6525a-17bb-4149-a142-1970bf22d248"}
  | Jul 11, 2023 @ 01:11:12.488 | 2023-07-10T23:11:12Z	INFO	RolloutStrategy is deprecated, please us Rollout.Strategy in order to define the desired strategy for job rollouts	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-leak-detection-m","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-leak-detection-m", "reconcileID": "5bb6525a-17bb-4149-a142-1970bf22d248"}
  | Jul 11, 2023 @ 01:11:12.488 | 2023-07-10T23:11:12Z	INFO	Starting workers	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "worker count": 5}
  | Jul 11, 2023 @ 01:11:12.488 | 2023-07-10T23:11:12Z	INFO	Starting workers	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "worker count": 1}
  | Jul 11, 2023 @ 01:11:12.489 | 2023-07-10T23:11:12Z	INFO	Starting workers	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "worker count": 1}
  | Jul 11, 2023 @ 01:11:12.490 | 2023-07-10T23:11:12Z	INFO	"metricName" is deprecated and will be removed in v2.12, please do not set it anymore	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"appac-analysis-calculation-worker","namespace":"staging-appac-application"}, "namespace": "staging-appac-application", "name": "appac-analysis-calculation-worker", "reconcileID": "23148f01-9201-4373-a770-6d7d4b5bbcf7", "trigger.type": "prometheus"}
  | Jul 11, 2023 @ 01:11:12.490 | 2023-07-10T23:11:12Z	INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"appac-analysis-calculation-worker","namespace":"staging-appac-application"}, "namespace": "staging-appac-application", "name": "appac-analysis-calculation-worker", "reconcileID": "23148f01-9201-4373-a770-6d7d4b5bbcf7"}
  | Jul 11, 2023 @ 01:11:13.296 | 2023-07-10T23:11:13Z	INFO	Initializing Scaling logic according to ScaledJob Specification	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-leak-detection-m","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-leak-detection-m", "reconcileID": "5bb6525a-17bb-4149-a142-1970bf22d248"}
  | Jul 11, 2023 @ 01:11:13.309 | 2023-07-10T23:11:13Z	INFO	Initializing Scaling logic according to ScaledObject Specification	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"appac-analysis-calculation-worker","namespace":"staging-appac-application"}, "namespace": "staging-appac-application", "name": "appac-analysis-calculation-worker", "reconcileID": "23148f01-9201-4373-a770-6d7d4b5bbcf7"}
  | Jul 11, 2023 @ 01:11:13.310 | 2023-07-10T23:11:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appd-leak-detection-m", "scaledJob.Namespace": "staging-appd-application", "Number of running Jobs": 0}
  | Jul 11, 2023 @ 01:11:13.310 | 2023-07-10T23:11:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appd-leak-detection-m", "scaledJob.Namespace": "staging-appd-application", "Number of pending Jobs ": 0}
  | Jul 11, 2023 @ 01:11:13.315 | 2023-07-10T23:11:13Z	INFO	Reconciling ScaledJob	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-toll-qa-routing-l","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-toll-qa-routing-l", "reconcileID": "1fac22ef-d1df-4c3b-b7bf-ea6196839a2e"}
  | Jul 11, 2023 @ 01:11:13.315 | 2023-07-10T23:11:13Z	INFO	RolloutStrategy is deprecated, please us Rollout.Strategy in order to define the desired strategy for job rollouts	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-toll-qa-routing-l","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-toll-qa-routing-l", "reconcileID": "1fac22ef-d1df-4c3b-b7bf-ea6196839a2e"}
  | Jul 11, 2023 @ 01:11:13.315 | 2023-07-10T23:11:13Z	INFO	Initializing Scaling logic according to ScaledJob Specification	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-toll-qa-routing-l","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-toll-qa-routing-l", "reconcileID": "1fac22ef-d1df-4c3b-b7bf-ea6196839a2e"}
  | Jul 11, 2023 @ 01:11:13.320 | 2023-07-10T23:11:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appd-toll-qa-routing-l", "scaledJob.Namespace": "staging-appd-application", "Number of running Jobs": 0}
  | Jul 11, 2023 @ 01:11:13.320 | 2023-07-10T23:11:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appd-toll-qa-routing-l", "scaledJob.Namespace": "staging-appd-application", "Number of pending Jobs ": 0}

KEDA Version

2.11.1

Kubernetes Version

1.26

Platform

Microsoft Azure

Scaler Details

Prometheus & Postgres

Anything else?

No response

@andreb89 andreb89 added the bug Something isn't working label Jul 13, 2023
@JorTurFer JorTurFer mentioned this issue Jul 14, 2023
16 tasks
@tomkerkhove
Copy link
Member

@JorTurFer If I get it correctly, we did not patch this in 2.11.2, is that correct?

@JorTurFer
Copy link
Member

JorTurFer commented Aug 17, 2023

I think that we didn't patched this. @zroubalik is going to check this

@yuvalweber
Copy link
Contributor

Hey @andreb89 can you please maybe share a memory map of keda before it gets oom + apiserver request logs you get from the keda operator service account?

For me it helped to investigate regarding OOM issue I had with earlier version of keda - #4687

If you need help regarding how to activate the and get memory map you can go to the debugging section.
I think maybe we should add the possibility to enable this using some kind of Env Variable (memory profiling).
What do you think @JorTurFer @tomkerkhove ?

@JorTurFer
Copy link
Member

I think that enabling the memory profiling on demand could be worth for debugging some advanced and complex scenarios.
Are you willing to contribute with that @yuvalweber ?

WDYT @tomkerkhove @zroubalik ?

@yuvalweber
Copy link
Contributor

I think I can try since I already did this kind of thing before.
Do you think it's preferred to dump it to a file or do it as a web server which you can send a request and return a memory profile (like they do in Prometheus)?

@zroubalik
Copy link
Member

@yuvalweber this would be great, I think that the more user friendly option the better, thus I am leaning towards using a webserver, that is started only when the profiling is enabled. Thanks for doing this

@yuvalweber
Copy link
Contributor

Yes I see that the web server option is the more friendly one.
Please assign it to me and I'll promise I'll do it.
I just go on vacation for a week so I'll try to do it after

@jonaskint
Copy link

We had a similar issue after upgrading. We solved it by limiting the namespaces to watch.

@BEvgeniyS
Copy link

BEvgeniyS commented Mar 5, 2024

We're upgrading to latest supported version for k8s v1.25 (v2.11.2), and we're having same issues with OOM... Adding profiler doesn't really solve the issue

EDIT:
After some digging, we've found that the issue was default limit of 1Gi introduced in helm chart, somewhere between 2.8.2 and 2.11.2.

After commenting out the limits (and raising requests just in case), the memory usage is back to the same levels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants