[Infrastructure UI] Refactor rate aggregations for Metric Threshold Rule and move evaluations to ES by simianhacker · Pull Request #125585 · elastic/kibana

simianhacker · 2022-02-14T21:22:29Z

Summary

This PR closes #118820 by refactoring the rate aggregations to use 2 filter buckets with a range and a bucket script to calculate the derivative instead of using a date_histogram.

Along with this change, I also refactored the evaluations to happen inside of Elasticsearch instead of in Kibana. This is done using a combination of bucket_scripts and a bucket_selector. The bucket_selector is only employed when the user has unchecked "Alert me if a group stops reporting data". If a user doesn't need to track missing groups, they will get a performance boost because the query only returns the groups that match the conditions. For high cardinality datasets, this will significantly reduce the load on the alerting framework due to tracking missing groups and sending notifications for them.

If the user does need to track the missing groups, this PR will still give them a modest performance boost because the code requires less iterations of the data for calculating the evaluations.

Here is a sample query with the rate aggregation with a group by on host.name and the Alert me if a group stops reporting data is unchecked:

{
  "track_total_hits": true,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": 1644959592366,
              "lte": 1644959892366,
              "format": "epoch_millis"
            }
          }
        },
        {
          "exists": {
            "field": "system.network.in.bytes"
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "groupings": {
      "composite": {
        "size": 10000,
        "sources": [
          {
            "groupBy0": {
              "terms": {
                "field": "host.name"
              }
            }
          }
        ]
      },
      "aggs": {
        "aggregatedValue_first_bucket": {
          "filter": {
            "range": {
              "@timestamp": {
                "gte": 1644959592366,
                "lt": 1644959742366,
                "format": "epoch_millis"
              }
            }
          },
          "aggs": {
            "maxValue": {
              "max": {
                "field": "system.network.in.bytes"
              }
            }
          }
        },
        "aggregatedValue_second_bucket": {
          "filter": {
            "range": {
              "@timestamp": {
                "gte": 1644959742366,
                "lt": 1644959892366,
                "format": "epoch_millis"
              }
            }
          },
          "aggs": {
            "maxValue": {
              "max": {
                "field": "system.network.in.bytes"
              }
            }
          }
        },
        "aggregatedValue": {
          "bucket_script": {
            "buckets_path": {
              "first": "aggregatedValue_first_bucket.maxValue",
              "second": "aggregatedValue_second_bucket.maxValue"
            },
            "script": "params.second > 0.0 && params.first > 0.0 && params.second > params.first ? (params.second - params.first) / 150: null"
          }
        },
        "shouldWarn": {
          "bucket_script": {
            "buckets_path": {},
            "script": "0"
          }
        },
        "shouldTrigger": {
          "bucket_script": {
            "buckets_path": {
              "value": "aggregatedValue"
            },
            "script": "params.value > 150000 ? 1 : 0"
          }
        },
        "selectedBucket": {
          "bucket_selector": {
            "buckets_path": {
              "shouldWarn": "shouldWarn",
              "shouldTrigger": "shouldTrigger"
            },
            "script": "params.shouldWarn > 0 || params.shouldTrigger > 0"
          }
        }
      }
    }
  }
}

There is a caveat with this approach, when there is "no data" for the time range and we are using a document count, the shouldTrigger and shouldWarn bucket scripts will be missing. For "non group by" queries, this means we need to treat the document count as ZERO and the evaluation must be done in Kibana in case the user has doc_count < 1 or doc_count == 0 for the condition. Fortunately, the performance cost is non-existent in this scenario since we are only looking at a single bucket.

This PR also includes a change to the way we report missing groups in a document count condition. Prior to this PR, we would backfill missing groups with ZERO for document count rules and NULL for aggregated metrics. This is actually a bug because the user asked "Alert me if a group stops reporting data". When we backfill with ZERO but the condition is doc_count > 1 the user would not get any notification for the missing groups. With this change, we trigger a NO DATA alert regardless of the condition or metric for missing groups which matches the intent of "Alert me if a group stops reporting data" option.

This PR also removes the "Drop Partial Buckets" functionality since we've moved away from using the date_histogram for rate aggregations.

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Unit or functional tests were updated or added to match the most common scenarios

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk	Probability	Severity	Mitigation/Notes
Changes to how missing groups are handled in `doc_count` aggregations will produce a NO DATA alert instead of the triggered alert	High	Low	Users will still get notifications but the type of alert will be different.

…ule and move evaluations to ES

…factor-rate-aggs

kibana-ci · 2022-03-01T21:26:45Z

💔 Build Failed

Failed CI Steps

Test Failures

[job] [logs] Default CI Group #18 / apis MetricsUI Endpoints Metric Threshold Alerts Executor with rate data without groupBy should alert on rate
[job] [logs] Default CI Group #18 / apis MetricsUI Endpoints Metric Threshold Alerts Executor with rate data without groupBy should alert on rate

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`infra`	927.8KB	927.0KB	-845.0B

History

💔 Build #24764 failed dea1db4
💚 Build #24413 succeeded 469d669
💔 Build #24114 failed bdef278
💔 Build #24088 failed 9f18f18
💔 Build #24075 failed facd4ea

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

simianhacker · 2022-03-09T17:12:36Z

Closing in favor of #126214

simianhacker added 7 commits February 14, 2022 13:52

[Infrastructure UI] Refactor rate aggregations for Metric Threshold R…

0fc3739

…ule and move evaluations to ES

fixing linting issues

facd4ea

fixing linting issues

9f18f18

fixing test

bdef278

removing unused translations

99cde2c

Merge branch 'main' of github.com:elastic/kibana into issue-118820-re…

469d669

…factor-rate-aggs

Fixing timerange

dea1db4

simianhacker mentioned this pull request Feb 22, 2022

[Infrastructure UI][Rules] Refactor Metric Threshold rule to push evaluations to Elasticsearch #126214

Merged

Merge branch 'main' of github.com:elastic/kibana into issue-118820-re…

17d7991

…factor-rate-aggs

simianhacker closed this Mar 9, 2022

simianhacker deleted the issue-118820-refactor-rate-aggs branch April 17, 2024 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Infrastructure UI] Refactor rate aggregations for Metric Threshold Rule and move evaluations to ES#125585

[Infrastructure UI] Refactor rate aggregations for Metric Threshold Rule and move evaluations to ES#125585
simianhacker wants to merge 8 commits intoelastic:mainfrom
simianhacker:issue-118820-refactor-rate-aggs

simianhacker commented Feb 14, 2022 •

edited

Loading

Uh oh!

kibana-ci commented Mar 1, 2022 •

edited

Loading

Uh oh!

simianhacker commented Mar 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simianhacker commented Feb 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Risk Matrix

Uh oh!

kibana-ci commented Mar 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💔 Build Failed

Failed CI Steps

Test Failures

Metrics [docs]

Async chunks

History

Uh oh!

simianhacker commented Mar 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simianhacker commented Feb 14, 2022 •

edited

Loading

kibana-ci commented Mar 1, 2022 •

edited

Loading