[Infrastructure UI][Rules] Refactor Metric Threshold rule to push evaluations to Elasticsearch by simianhacker · Pull Request #126214 · elastic/kibana

simianhacker · 2022-02-22T23:35:38Z

Summary

This PR pushes ALL the processing down to Elasticsearch including the group by tracking. With this PR instead of gathering all the groupings, we only need to detect the groups that were either excluded or new between the previous run and the current run. We do this by extending the time frame of the run to include the previous run and the current run. Then we create 2 buckets that represent each period (previousPeriod and currentPeriod) and compare the document counts for each groups to determine if the group has gone missing or has either returned or is new. If the groups has gone missing, we track it in in the state. Once the group re-appears, we remove it from the rule state. If the group is new but hasn't triggered the conditions, we ignore it.

This PR also closes #118820 by refactoring the rate aggregations to use 2 filter buckets with a range and a bucket script to calculate the derivative instead of using a date_histogram.

Along with this change, I also refactored the evaluations to happen inside of Elasticsearch instead of in Kibana. This is done using a combination of bucket_scripts and a bucket_selector. The bucket_selector is only employed when the user has unchecked "Alert me if a group stops reporting data". If a user doesn't need to track missing groups, they will get a performance boost because the query only returns the groups that match the conditions. For high cardinality datasets, this will significantly reduce the load on the alerting framework due to tracking missing groups and sending notifications for them.

Here is a sample query with the rate aggregation with a group by on host.name and the Alert me if a group stops reporting data is unchecked:

{
  "track_total_hits": true,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": 1644959592366,
              "lte": 1644959892366,
              "format": "epoch_millis"
            }
          }
        },
        {
          "exists": {
            "field": "system.network.in.bytes"
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "groupings": {
      "composite": {
        "size": 10000,
        "sources": [
          {
            "groupBy0": {
              "terms": {
                "field": "host.name"
              }
            }
          }
        ]
      },
      "aggs": {
        "aggregatedValue_first_bucket": {
          "filter": {
            "range": {
              "@timestamp": {
                "gte": 1644959592366,
                "lt": 1644959742366,
                "format": "epoch_millis"
              }
            }
          },
          "aggs": {
            "maxValue": {
              "max": {
                "field": "system.network.in.bytes"
              }
            }
          }
        },
        "aggregatedValue_second_bucket": {
          "filter": {
            "range": {
              "@timestamp": {
                "gte": 1644959742366,
                "lt": 1644959892366,
                "format": "epoch_millis"
              }
            }
          },
          "aggs": {
            "maxValue": {
              "max": {
                "field": "system.network.in.bytes"
              }
            }
          }
        },
        "aggregatedValue": {
          "bucket_script": {
            "buckets_path": {
              "first": "aggregatedValue_first_bucket.maxValue",
              "second": "aggregatedValue_second_bucket.maxValue"
            },
            "script": "params.second > 0.0 && params.first > 0.0 && params.second > params.first ? (params.second - params.first) / 150: null"
          }
        },
        "shouldWarn": {
          "bucket_script": {
            "buckets_path": {},
            "script": "0"
          }
        },
        "shouldTrigger": {
          "bucket_script": {
            "buckets_path": {
              "value": "aggregatedValue"
            },
            "script": "params.value > 150000 ? 1 : 0"
          }
        },
        "selectedBucket": {
          "bucket_selector": {
            "buckets_path": {
              "shouldWarn": "shouldWarn",
              "shouldTrigger": "shouldTrigger"
            },
            "script": "params.shouldWarn > 0 || params.shouldTrigger > 0"
          }
        }
      }
    }
  }
}

There is a caveat with this approach, when there is "no data" for the time range and we are using a document count, the shouldTrigger and shouldWarn bucket scripts will be missing. For "non group by" queries, this means we need to treat the document count as ZERO and the evaluation must be done in Kibana in case the user has doc_count < 1 or doc_count == 0 for the condition. Fortunately, the performance cost is non-existent in this scenario since we are only looking at a single bucket.

This PR also includes a change to the way we report missing groups in a document count condition. Prior to this PR, we would backfill missing groups with ZERO for document count rules and NULL for aggregated metrics. This is actually a bug because the user asked "Alert me if a group stops reporting data". When we backfill with ZERO but the condition is doc_count > 1 the user would not get any notification for the missing groups. With this change, we trigger a NO DATA alert regardless of the condition or metric for missing groups which matches the intent of "Alert me if a group stops reporting data" option.

This PR also removes the "Drop Partial Buckets" functionality since we've moved away from using the date_histogram for rate aggregations.

…ule and move evaluations to ES

…factor-rate-aggs

stevedodson · 2022-03-01T10:01:15Z

retest

stevedodson · 2022-03-01T13:12:19Z

@elasticmachine merge upstream

kibanamachine · 2022-03-01T13:12:21Z

merge conflict between base and head

stevedodson · 2022-03-01T14:05:44Z

retest

stevedodson · 2022-03-01T16:11:22Z

retest

…factor-rate-aggs

…reshold rule

simianhacker · 2022-03-01T20:41:04Z

@stevedodson I'm not very familiar with the mechanics of ci:cloud-deploy, do ALL the test need to be passing for that to work? If so I'll probably need another day or so to sort that out since this is a pretty big change.

stevedodson · 2022-03-02T10:10:09Z

@simianhacker - as long as 'Build and Deploy to Cloud' succeeds the tests don't need to pass. I've now got this PR running in cloud. Thank you!

…factor-group-by

mgiota · 2022-03-21T15:04:14Z

x-pack/plugins/infra/server/lib/alerting/metric_threshold/metric_threshold_executor.ts

          .join('\n');
-        /*
-         * Custom recovery actions aren't yet available in the alerting framework
-         * Uncomment the code below once they've been implemented


@simianhacker I see you removed the commented code for recovery actions. Do we have a ticket to implement custom recovery actions and build recovered alert reason? I don't want us to forget implementing this, now that we removed this comment.

mgiota · 2022-03-21T15:07:29Z

x-pack/plugins/infra/server/lib/alerting/metric_threshold/metric_threshold_executor.ts

            .filter((result) => result[group].isNoData)
            .map((result) => buildNoDataAlertReason({ ...result[group], group }))
            .join('\n');
-        } else if (nextState === AlertStates.ERROR) {


@simianhacker Don't we have an error state anymore?

Not in the same sense. There was this error condition that existed in the old implementation that no longer exists in the new methodology. Errors from this point forward are going to be exceptions caught by the framework.

mgiota · 2022-03-21T15:14:22Z

x-pack/plugins/infra/server/lib/alerting/metric_threshold/metric_threshold_executor.ts

        const actionGroupId =
          nextState === AlertStates.OK
            ? RecoveredActionGroup.id
+            : nextState === AlertStates.NO_DATA


@simianhacker AlertsState refers actually to RulesState, right? The naming confuses me. Probably refactoring alerts to rules is out of scope of this PR. Shall I create another ticket to refactor wrong uses of alerts to rules?

Yes, we should create a new ticket for standardizing variable names

mgiota · 2022-03-21T20:33:15Z

@simianhacker I did a bit of testing and alerts got triggered fine. I created a rule to alert me if a group stops reporting data, I stopped metricbeat and I successfully got following alert:

Then I started metricbeat again and I started getting following alert:

What I was wondering though is if we should have an extra recovered alert for the group that started reporting data again. Most probably currently generated alerts are fine, this is just a thought I made and I put it here for possible consideration.

On another note, I found another bug where Last updated value in the flyout is wrong. Instead of showing the last updated value, it shows when alert was started (You can see the bug in the 2 screenshots I posted above. Both screenshots have the same value, whereas they shouldn't). I'll create another issue for this.

tylersmalley · 2022-03-23T18:40:26Z

A heads up that we're seeing quite a few restarts on this Cloud instance due to it running out of memory. I haven't looked through the changes to see if it could be the cause, or if it's unrelated but wanted to raise it.

A node in your cluster 4d45dd6050a1423aa6617eddf193f7dd (kibana-pr-126214) ran out of
memory at 05:42 on March 23, 2022. As a precaution, we automatically restarted the node
instance-0000000000.

…factor-group-by

…-fix'

…factor-group-by

jasonrhodes · 2022-05-03T13:00:10Z

@simianhacker I added this review to our board as an External Review and it's now in our External Review queue. I bumped it up above the 3 other reviews requested from AO because it sounds more urgent, let me know if that's not the case. We're trying to do one ER at a time to limit how much effect they have on team output.

cc: @smith

…factor-group-by

kibana-ci · 2022-05-16T17:31:16Z

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`infra`	1002.6KB	1001.8KB	-845.0B

History

💚 Build #41782 succeeded d129cfb
💚 Build #41564 succeeded da633bd
💚 Build #36990 succeeded 4838d79
💔 Build #35354 failed fd13165
💚 Build #31824 succeeded 6fc0928
💔 Build #31642 failed d50eaa5

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

matschaffer

A little tough to weigh in on such a big change, but it seems good overall. I tried to exercise it with a local kibana & metricbeat but ended up getting pretty confused by the rules/alerts UI flow in general. I think it was working? Not sure. But I guess that's an issue for another PR :)

matschaffer · 2022-05-17T05:58:56Z

x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/metric_query.test.ts

      timeframe,
      100,
+      true,
+      void 0,


TIL on void 0 vs undefined 👍🏻

simianhacker added 7 commits February 14, 2022 13:52

[Infrastructure UI] Refactor rate aggregations for Metric Threshold R…

0fc3739

…ule and move evaluations to ES

fixing linting issues

facd4ea

fixing linting issues

9f18f18

fixing test

bdef278

removing unused translations

99cde2c

Merge branch 'main' of github.com:elastic/kibana into issue-118820-re…

469d669

…factor-rate-aggs

Fixing timerange

dea1db4

stevedodson added the ci:deploy-cloud label Mar 1, 2022

stevedodson added ci:deploy-cloud and removed ci:deploy-cloud labels Mar 1, 2022

simianhacker added 2 commits March 1, 2022 13:24

Merge branch 'main' of github.com:elastic/kibana into issue-118820-re…

17d7991

…factor-rate-aggs

[Infrastructure UI] Refactor missing group by tracking for Metrics Th…

7317156

…reshold rule

simianhacker force-pushed the issue-118820-refactor-group-by branch from 0a71ac3 to 7317156 Compare March 1, 2022 20:39

stevedodson added a commit to stevedodson/kibana-2 that referenced this pull request Mar 4, 2022

Adding elastic#126839 to elastic#126214

82329e7

jasonrhodes mentioned this pull request Mar 8, 2022

[Infrastructure UI] Optimization for Metric Threshold Rule for 7.17 #126545

Merged

simianhacker mentioned this pull request Mar 9, 2022

[Infrastructure UI] Refactor rate aggregations for Metric Threshold Rule and move evaluations to ES #125585

Closed

2 tasks

Merge branch 'main' of github.com:elastic/kibana into issue-118820-re…

d511162

…factor-group-by

simianhacker requested a review from a team March 9, 2022 17:54

simianhacker added 2 commits March 16, 2022 14:33

Removing unused translations

18ed8cb

Merge branch 'main' of github.com:elastic/kibana into issue-118820-re…

e266563

…factor-group-by

simianhacker mentioned this pull request Mar 16, 2022

[Logs UI][Rules] Refactor Logs Threshold Rule to push evaluations to Elasticsearch #127925

Closed

Using startedAt instead of moment() or new Date()

d50eaa5

simianhacker removed the ci:deploy-cloud label Mar 18, 2022

mgiota self-requested a review March 18, 2022 14:13

Merge branch 'main' of github.com:elastic/kibana into issue-118820-re…

6fc0928

…factor-group-by

mgiota reviewed Mar 21, 2022

View reviewed changes

mgiota approved these changes Mar 21, 2022

View reviewed changes

simianhacker added v8.3.0 and removed v8.2.0 labels Mar 29, 2022

simianhacker added 3 commits March 31, 2022 15:24

Merge branch 'main' of github.com:elastic/kibana into issue-118820-re…

fd13165

…factor-group-by

update mocklibs with fake basePath object

611ee5c

Merge branch 'main' of github.com:elastic/kibana into issue-118820-re…

4838d79

…factor-group-by

simianhacker mentioned this pull request Apr 28, 2022

[Infrastructure UI][Rules] Add "no data" tracking to host for Inventory Threshold Rule #131205

Closed

Merge branch 'main' of github.com:elastic/kibana into issue-118820-re…

ccdb6dc

…factor-group-by

simianhacker changed the title ~~[Infrastructure UI] Refactor Metric Threshold rule to push evaluations to Elasticsearch~~ [Infrastructure UI][Rules] Refactor Metric Threshold rule to push evaluations to Elasticsearch Apr 29, 2022

kibanamachine and others added 2 commits April 29, 2022 18:36

[CI] Auto-commit changed files from 'node scripts/eslint --no-cache -…

da633bd

…-fix'

Merge branch 'main' of github.com:elastic/kibana into issue-118820-re…

d129cfb

…factor-group-by

Merge branch 'main' of github.com:elastic/kibana into issue-118820-re…

fbc78f8

…factor-group-by

matschaffer approved these changes May 17, 2022

View reviewed changes

simianhacker merged commit aa3ace8 into elastic:main May 17, 2022

kibanamachine added the backport:skip This PR does not require backporting label May 17, 2022

simianhacker deleted the issue-118820-refactor-group-by branch April 17, 2024 15:37

Conversation

simianhacker commented Feb 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

stevedodson commented Mar 1, 2022

Uh oh!

stevedodson commented Mar 1, 2022

Uh oh!

kibanamachine commented Mar 1, 2022

Uh oh!

stevedodson commented Mar 1, 2022

Uh oh!

stevedodson commented Mar 1, 2022

Uh oh!

simianhacker commented Mar 1, 2022

Uh oh!

stevedodson commented Mar 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgiota Mar 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simianhacker Mar 28, 2022

Choose a reason for hiding this comment

Uh oh!

mgiota Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

simianhacker Mar 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgiota Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

simianhacker Mar 28, 2022

Choose a reason for hiding this comment

Uh oh!

mgiota commented Mar 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tylersmalley commented Mar 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jasonrhodes commented May 3, 2022

Uh oh!

kibana-ci commented May 16, 2022

💚 Build Succeeded

Metrics [docs]

Async chunks

History

Uh oh!

matschaffer left a comment

Choose a reason for hiding this comment

Uh oh!

matschaffer May 17, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

simianhacker commented Feb 22, 2022 •

edited

Loading

stevedodson commented Mar 2, 2022 •

edited

Loading

mgiota Mar 21, 2022 •

edited

Loading

simianhacker Mar 28, 2022 •

edited

Loading

mgiota commented Mar 21, 2022 •

edited

Loading

tylersmalley commented Mar 23, 2022 •

edited

Loading