[Infrastructure UI] Refactor rate aggregations for Metric Threshold Rule and move evaluations to ES#125585
Closed
simianhacker wants to merge 8 commits intoelastic:mainfrom
Closed
[Infrastructure UI] Refactor rate aggregations for Metric Threshold Rule and move evaluations to ES#125585simianhacker wants to merge 8 commits intoelastic:mainfrom
simianhacker wants to merge 8 commits intoelastic:mainfrom
Conversation
…ule and move evaluations to ES
💔 Build FailedFailed CI StepsTest Failures
Metrics [docs]Async chunks
History
To update your PR or re-run it, just comment with: |
Member
Author
|
Closing in favor of #126214 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR closes #118820 by refactoring the rate aggregations to use 2 filter buckets with a range and a bucket script to calculate the derivative instead of using a
date_histogram.Along with this change, I also refactored the evaluations to happen inside of Elasticsearch instead of in Kibana. This is done using a combination of
bucket_scriptsand abucket_selector. Thebucket_selectoris only employed when the user has unchecked "Alert me if a group stops reporting data". If a user doesn't need to track missing groups, they will get a performance boost because the query only returns the groups that match the conditions. For high cardinality datasets, this will significantly reduce the load on the alerting framework due to tracking missing groups and sending notifications for them.If the user does need to track the missing groups, this PR will still give them a modest performance boost because the code requires less iterations of the data for calculating the evaluations.
Here is a sample query with the rate aggregation with a group by on
host.nameand theAlert me if a group stops reporting datais unchecked:There is a caveat with this approach, when there is "no data" for the time range and we are using a document count, the
shouldTriggerandshouldWarnbucket scripts will be missing. For "non group by" queries, this means we need to treat the document count as ZERO and the evaluation must be done in Kibana in case the user hasdoc_count < 1ordoc_count == 0for the condition. Fortunately, the performance cost is non-existent in this scenario since we are only looking at a single bucket.This PR also includes a change to the way we report missing groups in a document count condition. Prior to this PR, we would backfill missing groups with ZERO for document count rules and NULL for aggregated metrics. This is actually a bug because the user asked "Alert me if a group stops reporting data". When we backfill with ZERO but the condition is
doc_count > 1the user would not get any notification for the missing groups. With this change, we trigger a NO DATA alert regardless of the condition or metric for missing groups which matches the intent of "Alert me if a group stops reporting data" option.This PR also removes the "Drop Partial Buckets" functionality since we've moved away from using the
date_histogramfor rate aggregations.Checklist
Delete any items that are not applicable to this PR.
Risk Matrix
Delete this section if it is not applicable to this PR.
Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.
When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:
doc_countaggregations will produce a NO DATA alert instead of the triggered alert