[Infrastructure UI][Rules] Refactor Metric Threshold rule to push evaluations to Elasticsearch#126214
Conversation
…ule and move evaluations to ES
|
retest |
|
@elasticmachine merge upstream |
|
merge conflict between base and head |
|
retest |
1 similar comment
|
retest |
0a71ac3 to
7317156
Compare
|
@stevedodson I'm not very familiar with the mechanics of |
|
@simianhacker - as long as 'Build and Deploy to Cloud' succeeds the tests don't need to pass. I've now got this PR running in cloud. Thank you! |
| .join('\n'); | ||
| /* | ||
| * Custom recovery actions aren't yet available in the alerting framework | ||
| * Uncomment the code below once they've been implemented |
There was a problem hiding this comment.
@simianhacker I see you removed the commented code for recovery actions. Do we have a ticket to implement custom recovery actions and build recovered alert reason? I don't want us to forget implementing this, now that we removed this comment.
| .filter((result) => result[group].isNoData) | ||
| .map((result) => buildNoDataAlertReason({ ...result[group], group })) | ||
| .join('\n'); | ||
| } else if (nextState === AlertStates.ERROR) { |
There was a problem hiding this comment.
@simianhacker Don't we have an error state anymore?
There was a problem hiding this comment.
Not in the same sense. There was this error condition that existed in the old implementation that no longer exists in the new methodology. Errors from this point forward are going to be exceptions caught by the framework.
| const actionGroupId = | ||
| nextState === AlertStates.OK | ||
| ? RecoveredActionGroup.id | ||
| : nextState === AlertStates.NO_DATA |
There was a problem hiding this comment.
@simianhacker AlertsState refers actually to RulesState, right? The naming confuses me. Probably refactoring alerts to rules is out of scope of this PR. Shall I create another ticket to refactor wrong uses of alerts to rules?
There was a problem hiding this comment.
Yes, we should create a new ticket for standardizing variable names
|
@simianhacker I did a bit of testing and alerts got triggered fine. I created a rule to Then I started metricbeat again and I started getting following alert: What I was wondering though is if we should have an extra recovered alert for the group that started reporting data again. Most probably currently generated alerts are fine, this is just a thought I made and I put it here for possible consideration. On another note, I found another bug where Last updated value in the flyout is wrong. Instead of showing the last updated value, it shows when alert was started (You can see the bug in the 2 screenshots I posted above. Both screenshots have the same value, whereas they shouldn't). I'll create another issue for this. |
|
A heads up that we're seeing quite a few restarts on this Cloud instance due to it running out of memory. I haven't looked through the changes to see if it could be the cause, or if it's unrelated but wanted to raise it.
|
|
@simianhacker I added this review to our board as an External Review and it's now in our External Review queue. I bumped it up above the 3 other reviews requested from AO because it sounds more urgent, let me know if that's not the case. We're trying to do one ER at a time to limit how much effect they have on team output. cc: @smith |
💚 Build SucceededMetrics [docs]Async chunks
History
To update your PR or re-run it, just comment with: |
| timeframe, | ||
| 100, | ||
| true, | ||
| void 0, |
There was a problem hiding this comment.
TIL on void 0 vs undefined 👍🏻



Summary
This PR pushes ALL the processing down to Elasticsearch including the group by tracking. With this PR instead of gathering all the groupings, we only need to detect the groups that were either excluded or new between the previous run and the current run. We do this by extending the time frame of the run to include the previous run and the current run. Then we create 2 buckets that represent each period (
previousPeriodandcurrentPeriod) and compare the document counts for each groups to determine if the group has gone missing or has either returned or is new. If the groups has gone missing, we track it in in the state. Once the group re-appears, we remove it from the rule state. If the group is new but hasn't triggered the conditions, we ignore it.This PR also closes #118820 by refactoring the rate aggregations to use 2 filter buckets with a range and a bucket script to calculate the derivative instead of using a
date_histogram.Along with this change, I also refactored the evaluations to happen inside of Elasticsearch instead of in Kibana. This is done using a combination of
bucket_scriptsand abucket_selector. Thebucket_selectoris only employed when the user has unchecked "Alert me if a group stops reporting data". If a user doesn't need to track missing groups, they will get a performance boost because the query only returns the groups that match the conditions. For high cardinality datasets, this will significantly reduce the load on the alerting framework due to tracking missing groups and sending notifications for them.Here is a sample query with the rate aggregation with a group by on
host.nameand theAlert me if a group stops reporting datais unchecked:There is a caveat with this approach, when there is "no data" for the time range and we are using a document count, the
shouldTriggerandshouldWarnbucket scripts will be missing. For "non group by" queries, this means we need to treat the document count as ZERO and the evaluation must be done in Kibana in case the user hasdoc_count < 1ordoc_count == 0for the condition. Fortunately, the performance cost is non-existent in this scenario since we are only looking at a single bucket.This PR also includes a change to the way we report missing groups in a document count condition. Prior to this PR, we would backfill missing groups with ZERO for document count rules and NULL for aggregated metrics. This is actually a bug because the user asked "Alert me if a group stops reporting data". When we backfill with ZERO but the condition is
doc_count > 1the user would not get any notification for the missing groups. With this change, we trigger a NO DATA alert regardless of the condition or metric for missing groups which matches the intent of "Alert me if a group stops reporting data" option.This PR also removes the "Drop Partial Buckets" functionality since we've moved away from using the
date_histogramfor rate aggregations.