-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datadog scaler is not able to find matching metrics #2657
Comments
This sounds similar to #2632, can you please double check? |
It seems issue #2632 was related to missing IAM permission. in my case we don't interect with any IAM service. Authentication with Datadog works fine after 10 to 15 minutes it suddenly starts throwing error and it auto-resolve. May be it is tied to Datadog rate limiting. one thing for sure we need to get better on is Logging HTTP response in this scenario. i tried log level DEBUG but didn't see any useful info. |
@arapulido(Who is active contributor for datadog scaler) also confirmed she is able to re-produce this issue on local env after running few minutes. |
We are seeing the same behavior. Here's an excerpt from the HPA:
|
Definately something to improve - Anyone open to contributing this? |
Yes, I am already looking into this and I am working on some other improvements. I will work on a patch that makes this more resilient, and also to make it clearer in the error when the user hits rate-limiting. |
Awesome, thank you! |
I just created a PR to fix this (and other improvements) in 2.6. The problem described in this issue is partially caused by the selection of a time window that is too small (15 seconds). Datadog doesn't have that metric yet in many cases, and returns an empty value, that logs as a warning in the HPA. Always try in general a bigger window, and we will use the last point (most recent) returned. Ideally, we should discourage this behaviour (selecting a specific window), so in 2.7 we should probably introduce a breaking change and remove the "age" parameter. |
Makes sense, thanks!
We can't do that though, so we will have to wait for v3.0 and document this. Thoughts @kedacore/keda-maintainers? |
That's OK, we can wait. Or even make other changes that don't break the API; i.e. allow a minimum of 90 seconds for an the age parameter: would that be allowed for a 2.7 change? Thanks! |
If we can wait is always better, but in the worst case, Datadog Scaler has been introduced 2 months ago, I guess that the user base is not huge yet (but I prefer to wait if it's possible) |
I believe there is a legitimate use case where datadog would return an empty value. For example, imagine a query counting requests with 500 status. In our case, we have such a trigger that scales an object if a certain threshold (of 500's) is reached. Ideally there would be times when there are no 500's at all (even in a large window of time). Am I understanding the problem correctly? If so, would it be possible to interpret an empty value as 0? Or, is that difficult because the HPA is where the root problem lies? |
Agreed, valid concern and will definately happening. I would argue 0 is a bit misleading because you cannot separate that from a real 0 value but -1 is also not ideal, so 0 is fine I guess. |
Yeah, I agree 0 could be misleading. Perhaps a default value could be provided as an argument? In my scenario, I could safely set it to 0. Others may want a different default value in the event that the metric is null. One thing is for sure, the existing behavior is undesirable under most (all?) circumstances. Logging a warning that the metric is null seems appropriate, but breaking the trigger is not ideal. In my case, the HPA scaled pods up because of another trigger, and then never scaled them down because the null metric broke the comparison. |
Fully agree. @arapulido Would you mind incorporating the following:
Thoughts @kedacore/keda-maintainers? |
I agree with the first point, for the second point, I think that we should raise an error if the metric is not available. KEDA already has a fallback system for doing it, only a raised error is needed, I think that if we add this local fallback system, we are duplicating the responsibilities of the current fallback system |
I think there are some cases in which not having a metric doesn't mean that it should be 0. I will be doing some more testing on my side, and I will post here the findings. |
I think that filling with a number in all cases is misleading, and we should only do it if the user explicitly asks for it. So, what about the following: Add a new |
Sounds good to me but would call it metricUnavailableValue |
Report
I have datadog scaler configured on AWS EKS Cluster with keda-2.6.1.
I am using Nginx request per second metric for scaling it is working fine.
Setup works fine as expected for few minutes. After that it starts throwing error about not able to find metrics. and It auto-recovers in few minutes. it stays unstable continuously.
Error events on HPA
Expected Behavior
Once it is able to fetch metrics from Datadog it should work in steady state.
Actual Behavior
It is throwing error about not able to fetch metrics and it auto-recover.
Steps to Reproduce the Problem
Logs from KEDA operator
Error logs on keda-operator-metrics-api
KEDA Version
2.6.1
Kubernetes Version
1.21
Platform
Amazon Web Services
Scaler Details
Datadog
Anything else?
cc : @arapulido
The text was updated successfully, but these errors were encountered: