Skip to content

[Infrastructure UI] Refactor rate aggregations for Metric Threshold Rule and move evaluations to ES#125585

Closed
simianhacker wants to merge 8 commits intoelastic:mainfrom
simianhacker:issue-118820-refactor-rate-aggs
Closed

[Infrastructure UI] Refactor rate aggregations for Metric Threshold Rule and move evaluations to ES#125585
simianhacker wants to merge 8 commits intoelastic:mainfrom
simianhacker:issue-118820-refactor-rate-aggs

Conversation

@simianhacker
Copy link
Copy Markdown
Member

@simianhacker simianhacker commented Feb 14, 2022

Summary

This PR closes #118820 by refactoring the rate aggregations to use 2 filter buckets with a range and a bucket script to calculate the derivative instead of using a date_histogram.

Along with this change, I also refactored the evaluations to happen inside of Elasticsearch instead of in Kibana. This is done using a combination of bucket_scripts and a bucket_selector. The bucket_selector is only employed when the user has unchecked "Alert me if a group stops reporting data". If a user doesn't need to track missing groups, they will get a performance boost because the query only returns the groups that match the conditions. For high cardinality datasets, this will significantly reduce the load on the alerting framework due to tracking missing groups and sending notifications for them.

If the user does need to track the missing groups, this PR will still give them a modest performance boost because the code requires less iterations of the data for calculating the evaluations.

Here is a sample query with the rate aggregation with a group by on host.name and the Alert me if a group stops reporting data is unchecked:

{
  "track_total_hits": true,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": 1644959592366,
              "lte": 1644959892366,
              "format": "epoch_millis"
            }
          }
        },
        {
          "exists": {
            "field": "system.network.in.bytes"
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "groupings": {
      "composite": {
        "size": 10000,
        "sources": [
          {
            "groupBy0": {
              "terms": {
                "field": "host.name"
              }
            }
          }
        ]
      },
      "aggs": {
        "aggregatedValue_first_bucket": {
          "filter": {
            "range": {
              "@timestamp": {
                "gte": 1644959592366,
                "lt": 1644959742366,
                "format": "epoch_millis"
              }
            }
          },
          "aggs": {
            "maxValue": {
              "max": {
                "field": "system.network.in.bytes"
              }
            }
          }
        },
        "aggregatedValue_second_bucket": {
          "filter": {
            "range": {
              "@timestamp": {
                "gte": 1644959742366,
                "lt": 1644959892366,
                "format": "epoch_millis"
              }
            }
          },
          "aggs": {
            "maxValue": {
              "max": {
                "field": "system.network.in.bytes"
              }
            }
          }
        },
        "aggregatedValue": {
          "bucket_script": {
            "buckets_path": {
              "first": "aggregatedValue_first_bucket.maxValue",
              "second": "aggregatedValue_second_bucket.maxValue"
            },
            "script": "params.second > 0.0 && params.first > 0.0 && params.second > params.first ? (params.second - params.first) / 150: null"
          }
        },
        "shouldWarn": {
          "bucket_script": {
            "buckets_path": {},
            "script": "0"
          }
        },
        "shouldTrigger": {
          "bucket_script": {
            "buckets_path": {
              "value": "aggregatedValue"
            },
            "script": "params.value > 150000 ? 1 : 0"
          }
        },
        "selectedBucket": {
          "bucket_selector": {
            "buckets_path": {
              "shouldWarn": "shouldWarn",
              "shouldTrigger": "shouldTrigger"
            },
            "script": "params.shouldWarn > 0 || params.shouldTrigger > 0"
          }
        }
      }
    }
  }
}

There is a caveat with this approach, when there is "no data" for the time range and we are using a document count, the shouldTrigger and shouldWarn bucket scripts will be missing. For "non group by" queries, this means we need to treat the document count as ZERO and the evaluation must be done in Kibana in case the user has doc_count < 1 or doc_count == 0 for the condition. Fortunately, the performance cost is non-existent in this scenario since we are only looking at a single bucket.

This PR also includes a change to the way we report missing groups in a document count condition. Prior to this PR, we would backfill missing groups with ZERO for document count rules and NULL for aggregated metrics. This is actually a bug because the user asked "Alert me if a group stops reporting data". When we backfill with ZERO but the condition is doc_count > 1 the user would not get any notification for the missing groups. With this change, we trigger a NO DATA alert regardless of the condition or metric for missing groups which matches the intent of "Alert me if a group stops reporting data" option.

This PR also removes the "Drop Partial Buckets" functionality since we've moved away from using the date_histogram for rate aggregations.

Checklist

Delete any items that are not applicable to this PR.

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk Probability Severity Mitigation/Notes
Changes to how missing groups are handled in doc_count aggregations will produce a NO DATA alert instead of the triggered alert High Low Users will still get notifications but the type of alert will be different.

@kibana-ci
Copy link
Copy Markdown

kibana-ci commented Mar 1, 2022

💔 Build Failed

Failed CI Steps

Test Failures

  • [job] [logs] Default CI Group #18 / apis MetricsUI Endpoints Metric Threshold Alerts Executor with rate data without groupBy should alert on rate
  • [job] [logs] Default CI Group #18 / apis MetricsUI Endpoints Metric Threshold Alerts Executor with rate data without groupBy should alert on rate

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
infra 927.8KB 927.0KB -845.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@simianhacker
Copy link
Copy Markdown
Member Author

Closing in favor of #126214

@simianhacker simianhacker deleted the issue-118820-refactor-rate-aggs branch April 17, 2024 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Metrics UI] Refactor rate aggregation for Metric Threshold Alerts to eliminate "Drop Partial Buckets"

2 participants