Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datadog query with functions never returns healthy #2761

Closed
dalgibbard opened this issue Mar 15, 2022 · 11 comments · Fixed by #2771
Closed

Datadog query with functions never returns healthy #2761

dalgibbard opened this issue Mar 15, 2022 · 11 comments · Fixed by #2771
Assignees
Labels
bug Something isn't working

Comments

@dalgibbard
Copy link

dalgibbard commented Mar 15, 2022

Report

Edit: see comments below, also amended the Issue title to better match the issue at hand

I have a Datadog ScaledObject resource setup, which looks like the following (Secret called keda-myapp-datadog-secret already created):

---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: myapp
  namespace: myapp
spec:
  secretTargetRef:
  - parameter: apiKey
    name: keda-myapp-datadog-secret
    key: apiKey
  - parameter: appKey
    name: keda-myapp-datadog-secret
    key: appKey
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: myapp
  namespace: myapp
spec:
  scaleTargetRef:
    name: myapp
  minReplicaCount:  1
  maxReplicaCount:  3
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 600
          policies:
          - type: Pods
            value: 1
            periodSeconds: 60
  triggers:
  - type: datadog
    metadata:
      query: "per_second(sum:http.requests{service:myapp,dc:us-west-2}.rollup(avg, 60))"
      queryValue: "100"
      type: "Average"
      age: "300"
    authenticationRef:
      name: myapp
  - type: datadog
    metadata:
      query: "sum:http.backlog{service:myapp,dc:us-west-2}.rollup(avg, 60)"
      queryValue: "50"
      type: "Average"
      age: "300"
    authenticationRef:
      name: myapp

Note specifically, the two datadog triggers in the single ScaledObject.

With this setup, the scaled object reports the following Health state (from describe):

External Metric Names:                                                                                                                                                                                             
    s0-datadog-per_second(sum-http-requests
    s1-datadog-sum-http-backlog
Health:
    s0-datadog-per_second(sum-http-requests:
        Number Of Failures:  63
        Status:              Failing
    s1-datadog-sum-http-backlog:
        Number Of Failures:  0
        Status:              Happy

Note that the first Trigger is failing, whilst the second is Happy.

And the keda-operator-metrics-apiserver logs show:

E0315 11:01:52.364068       1 logr.go:270] keda_metrics_adapter/datadog_scaler "msg"="error getting metrics from Datadog" "error"="no Datadog metrics returned"
E0315 11:01:52.364168       1 scale_resolvers.go:164] keda_metrics_adapter/scalehandler "msg"="Error getting triggerAuth" "error"="context canceled" "name"="myapp" "namespace"="myapp" "type"="ScaledObject" "triggerAuthRef.Name"="myapp"
E0315 11:01:52.376993       1 provider.go:124] keda_metrics_adapter/provider "msg"="error getting metric for scaler" "error"="error parsing Datadog metadata: no api key given"  "scaledObject.Name"="myapp" "scaledObject.Namespace"="myapp" "scaler"={}
E0315 11:01:52.377048       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"no matching metrics found for s0-datadog-per_second(sum-http-requests"}: no matching metrics found for s0-datadog-per_second(sum-http-requests
I0315 11:01:53.322901       1 trace.go:205] Trace[1690803741]: "List" url:/apis/external.metrics.k8s.io/v1beta1/namespaces/myapp/s1-datadog-sum-http-backlog,user-agent:kube-controller-manager/v1.21.5 (linux/amd64) kubernetes/5236faf/system:serviceaccount:kube-system:horizontal-pod-autoscaler,audit-id:694c8fc1-3d50-455d-9bb5-e22514569a2a,client:172.16.41.151,accept:application/vnd.kubernetes.protobuf, */*,protocol:HTTP/2.0 (15-Mar-2022 11:01:52.380) (total time: 942ms):
Trace[1690803741]: ---"Listing from storage done" 941ms (11:01:53.322)
Trace[1690803741]: [942.018779ms] [942.018779ms] END

Note here, that it's reporting no api key given even though it's using the exact same authenticationRef value to the other trigger.

Expected Behavior

When:

  • Multiple Datadog triggers defined on a single ScaledObject
  • All triggers using the same TriggerAuth object
  • Trigger Auth loaded from Kubernetes Secret

All Triggers should load metrics successfully.

Actual Behavior

Only the last trigger correctly loads the configured TriggerAuth. Other triggers report no api key given

Steps to Reproduce the Problem

See report for example YAML

Logs from KEDA operator

See above.

KEDA Version

2.6.1

Kubernetes Version

1.21

Platform

Amazon Web Services

Scaler Details

Datadog

Anything else?

No response

@dalgibbard dalgibbard added the bug Something isn't working label Mar 15, 2022
@tomkerkhove tomkerkhove moved this to Proposed in Roadmap - KEDA Core Mar 15, 2022
@dalgibbard
Copy link
Author

Actually; two edits on this:
When I dropped it back to just the single query:

      query: "per_second(sum:http.requests{service:myapp,dc:us-west-2}.rollup(avg, 60))"

I found that this constantly errors anyway. Dropping per_second() fixes it; though it is a valid query in the Datadog UI.

Changing this to not have per_second():

      query: "sum:http.requests{service:myapp,dc:us-west-2}.rollup(avg, 60)"

Everything works fine for a while, but I still intermittently see both:

"error"="error getting metrics from Datadog: no Datadog metrics returned" 

and

"error"="error parsing Datadog metadata: no api key given"

For context, I'm in the process of trying to switch fully to Keda instead of using the Datadog Cluster Agent, in light of #470

I see there was a recent issue/merge for: #2694 // #2657
Please let me know if these fixes are likely to resolve the issues here, and when a keda release is likely to happen?

Thanks!

@zroubalik
Copy link
Member

@arapulido FYI

@dalgibbard
Copy link
Author

dalgibbard commented Mar 16, 2022

With regards to the problem where Keda thinks my query is invalid- i've done a little digging, and it seems that this part is causing me issues:

if !aggregator.MatchString(q) {

which points to
aggregator = regexp.MustCompile(`^(avg|sum|min|max):.*`)

In the datadog query for:

per_second(sum:http.requests{service:myapp,dc:us-west-2}.rollup(max, 2))

This code section is prefixing this with avg: which makes it an invalid query. I have to wrap this query with the function per_second(), because it's a counter metric, and I'm trying to derive requests-per-second from it.

The per_second() query is valid, and is how datadog defines it when using their UI:
image
image

There is a lot of possible functions that could be set here. I'm not sure that auto-prefixing a user's query is good practice, especially without notifying the user that it's happening.
Some other example functions which 'wrap' the entire query include: default_zero, outliers, anomolies, abs, log2, log10, forecast, autosmooth, trend_line, and many more.

Fixing this should solve the first issue at least.
Maybe we can change the matcher to something like:

aggregator = regexp.MustCompile(`^(\w+\()?(avg|sum|min|max):.*`)

This would allow for an optional function wrapper (formed of letters, numbers or underscores) before the aggregator value.
eg: my_func(avg: matches.

I'm kind of hoping that the recent commits to this scaler will solve the last remaining problem for "error"="error parsing Datadog metadata: no api key given"; but we'll see :)

@arapulido
Copy link
Contributor

@dalgibbard Thanks a lot for digging a bit deeper into the issue! And yes, this should be allowed. When querying a single metric, Datadog uses the avg aggregator by default, that's why I was prepending it. Let me try some requests directly to the API first, to see if those are required or not.

Also, feel free to contribute to a patch :) Take into account that the Datadog Cluster Agent is for now the official way to HPA using Datadog Metrics, maintained by a whole team, while the Datadog KEDA scaler is just me, on my own time :)

@dalgibbard
Copy link
Author

dalgibbard commented Mar 16, 2022

@arapulido Appreciate you working on this - and yeah, we were using the Cluster Agent initially, but this blocks our ability to scale on other metrics which is important for us :)

I've submitted a PR, but I haven't fully tested it, as I haven't quite worked out how I could inject a custom build into our EKS Clusters (they're quite tightly controlled). The tests pass at least!
It should be a matter of deploying it, and using a query with default_zero() or something, and making sure it shows "Happy" in the ScaledObject :)

@arapulido
Copy link
Contributor

@dalgibbard Fantastic! Thanks a lot! I will test this out this afternoon. Thanks a lot for debugging the issue and contribute to a patch

@dalgibbard
Copy link
Author

dalgibbard commented Mar 16, 2022

@arapulido I've just thought... My PR is a bit over simplified, because a user could actually wrap multiple functions; for example:

top(per_second(abs(sum:http.requests{service:myapp,dc:us-west-2}.rollup(max, 2))), 5, 'mean', 'desc')

Some take params, some don't etc.
I'm really not sure how to work around that :/

@arapulido
Copy link
Contributor

Sounds good, let me try some tests against the API directly, and I will update here.

@dalgibbard
Copy link
Author

I did amend the regex to support multiple functions; but ultimately the source of truth for if a query is valid or not, is datadog's API :) I'll leave it with you for now anyway! And thanks :)

@dalgibbard dalgibbard changed the title Datadog Scaler reports "no api key given" when multiple triggers defined Datadog query with functions never returns healthy Mar 16, 2022
@arapulido
Copy link
Contributor

@dalgibbard Thanks, and I agree, I am making some changes to rely more on the API for validation, rather than trying to cover all potential use cases. Feel free to close your PR, as mine will supersede that one. Thanks!

@dalgibbard
Copy link
Author

Awesome, have closed my PR :D

@JorTurFer JorTurFer moved this from Proposed to In Progress in Roadmap - KEDA Core Mar 17, 2022
@JorTurFer JorTurFer assigned JorTurFer and arapulido and unassigned JorTurFer Mar 17, 2022
Repository owner moved this from In Progress to Ready To Ship in Roadmap - KEDA Core Mar 22, 2022
@tomkerkhove tomkerkhove moved this from Ready To Ship to Done in Roadmap - KEDA Core Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
4 participants