Skip to content

[Alerting] Skip running disabled rules#119239

Merged
chrisronline merged 10 commits intoelastic:mainfrom
chrisronline:alerting/do_not_run_disabled_rule
Nov 29, 2021
Merged

[Alerting] Skip running disabled rules#119239
chrisronline merged 10 commits intoelastic:mainfrom
chrisronline:alerting/do_not_run_disabled_rule

Conversation

@chrisronline
Copy link
Copy Markdown
Contributor

@chrisronline chrisronline commented Nov 19, 2021

Resolves #106292

See the comments in the issue for some background on this change.

At a high level, we are attempting to fix a bug where if you quickly enable/disable a rule in the UI (or through the HTTP api), it will eventually error out due to a missing apiKey (related to this call.

After digging a bit, I'm fairly confident this bug is caused by a race condition between when we actually update the rule saved object from the disable route and when the task is removed as a result of disabling. If the rule executes in between these two async operations, the apiKey will not exist and will result in this error scenario.

To combat this, we have a couple choices:

  1. Check if the rule is enabled earlier in the run block and bail on executing the rule

  2. Enhance the apiKey check to also look for the enabled flag and handle a disabled state there

A couple of differences with these approaches. In approach 1, we'd avoid writing any event log entries that happen before we pull the apiKey and we'd also avoid needing to change the status of the rule itself - it'd simply show as disabled in the UI as we'd "pretend" the last rule execution never happened.

In approach 2, we'd need to write an error to the event log and therefore, would need a new error type and ability to communicate this state to the user in a helpful way. This is not ideal but it can be useful to help users understand why a rule didn't execute. Approach 1 also lends itself to a new issue where we could see a "saved object not found" error earlier in the stack trace and need to account for that to ensure users have the proper visibility into the state of the rule.

We ended up going with approach 2 and adding a new error due to "disabled" status for rules. I'm up for hearing better ways to handle this, as it's important the user understand how to get out of this error state (by simply reenabling the rule)

Here is how it looks in the UI:

Screen Shot 2021-11-23 at 11 22 28 AM

Here is a sample event log document
{
  "_index": ".kibana-event-log-8.1.0-000001",
  "_id": "vNuQTX0BuBMBoQctzvXi",
  "_score": null,
  "_source": {
    "@timestamp": "2021-11-23T16:11:14.777Z",
    "event": {
      "provider": "alerting",
      "action": "execute",
      "kind": "alert",
      "category": ["stackAlerts"],
      "start": "2021-11-23T16:11:14.777Z",
      "end": "2021-11-23T16:11:14.779Z",
      "duration": 2000000,
      "reason": "disabled",
      "outcome": "failure"
    },
    "kibana": {
      "saved_objects": [
        {
          "rel": "primary",
          "type": "alert",
          "id": "f7fc6d50-4bd7-11ec-9374-d755ae9ba571",
          "type_id": ".es-query"
        }
      ],
      "task": {
        "scheduled": "2021-11-23T16:11:11.556Z",
        "schedule_delay": 3221000000
      },
      "alerting": {
        "status": "error"
      },
      "server_uuid": "5b2de169-2785-441b-ae8c-186a1936b17d",
      "version": "8.1.0"
    },
    "rule": {
      "id": "f7fc6d50-4bd7-11ec-9374-d755ae9ba571",
      "license": "basic",
      "category": ".es-query",
      "ruleset": "stackAlerts"
    },
    "error": {
      "message": "Rule failed to execute because rule ran after it was disabled."
    },
    "message": ".es-query:f7fc6d50-4bd7-11ec-9374-d755ae9ba571: execution failed",
    "ecs": {
      "version": "1.8.0"
    }
  },
  "sort": [1637683874777]
}
Here is a sample rule saved object
{
  "_index" : ".kibana_8.1.0_001",
  "_id" : "alert:f7fc6d50-4bd7-11ec-9374-d755ae9ba571",
  "_score" : 3.8148828,
  "_source" : {
    "alert" : {
      "params" : {
        "esQuery" : """{
"query":{
"match_all" : {}
}
}""",
        "size" : 100,
        "timeWindowSize" : 5,
        "timeWindowUnit" : "m",
        "threshold" : [
          0
        ],
        "thresholdComparator" : ">",
        "index" : [
          ".kibana-event-log-8.1.0"
        ],
        "timeField" : "@timestamp"
      },
      "consumer" : "alerts",
      "schedule" : {
        "interval" : "1s"
      },
      "tags" : [ ],
      "name" : "Break",
      "throttle" : null,
      "actions" : [
        {
          "actionTypeId" : ".server-log",
          "params" : {
            "level" : "info",
            "message" : "go"
          },
          "actionRef" : "preconfigured:my_server_log",
          "group" : "query matched"
        }
      ],
      "enabled" : false,
      "alertTypeId" : ".es-query",
      "notifyWhen" : "onActiveAlert",
      "apiKeyOwner" : null,
      "apiKey" : null,
      "legacyId" : null,
      "createdBy" : "elastic",
      "updatedBy" : "elastic",
      "createdAt" : "2021-11-22T21:05:50.057Z",
      "updatedAt" : "2021-11-23T16:11:14.071Z",
      "muteAll" : false,
      "mutedInstanceIds" : [ ],
      "executionStatus" : {
        "status" : "error",
        "lastExecutionDate" : "2021-11-23T16:11:14.777Z",
        "error" : {
          "reason" : "disabled",
          "message" : "Rule failed to execute because rule ran after it was disabled."
        },
        "lastDuration" : 2
      },
      "meta" : {
        "versionApiKeyLastmodified" : "8.1.0"
      },
      "scheduledTaskId" : null
    },
    "type" : "alert",
    "references" : [ ],
    "namespaces" : [
      "default"
    ],
    "migrationVersion" : {
      "alert" : "8.0.0"
    },
    "coreMigrationVersion" : "8.1.0",
    "updated_at" : "2021-11-23T16:11:14.781Z"
  }

@chrisronline chrisronline marked this pull request as ready for review November 23, 2021 17:49
@chrisronline chrisronline requested a review from a team as a code owner November 23, 2021 17:49
@chrisronline chrisronline added Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v8.0.0 labels Nov 23, 2021
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

Copy link
Copy Markdown
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Verified that this works as expected. Nice job!

I think this is a separate issue, but when I clicked into the rule details, I expected to see the full error message but since the rule was disabled, I just see a disabled state with no error message. Since we are now showing some additional rule information in the rule details view (like last execution status and durations), maybe the disabled state should be updated? @mdefazio Any thoughts?

Screen Shot 2021-11-29 at 1 28 17 PM

@chrisronline chrisronline force-pushed the alerting/do_not_run_disabled_rule branch from 7d1083c to 9df58ce Compare November 29, 2021 19:27
Copy link
Copy Markdown
Contributor

@YulNaumenko YulNaumenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kibana-ci
Copy link
Copy Markdown

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
triggersActionsUi 778.3KB 778.6KB +360.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
alerting 36.6KB 36.7KB +22.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @chrisronline

@chrisronline chrisronline merged commit 1489426 into elastic:main Nov 29, 2021
@chrisronline chrisronline deleted the alerting/do_not_run_disabled_rule branch November 29, 2021 21:47
@kibanamachine kibanamachine added the backport:skip This PR does not require backporting label Nov 29, 2021
@chrisronline
Copy link
Copy Markdown
Contributor Author

I think this is a separate issue, but when I clicked into the rule details, I expected to see the full error message but since the rule was disabled, I just see a disabled state with no error message. Since we are now showing some additional rule information in the rule details view (like last execution status and durations), maybe the disabled state should be updated?

@ymao1 Sorry, I missed this!

I'm happy to open a separate issue to discuss this, but my original thinking is the user just wants to disable the rule and the error message feels a bit unhelpful to persist (in general) as it doesn't/shouldn't affect them re-enabling the rule in the future. Perhaps this isn't the right thinking here, so lemme know your thoughts and I'm happy to open a new issue to discuss

@mdefazio
Copy link
Copy Markdown
Contributor

Reading through the problem here I think I get a sense of the need to bubble up an error. Perhaps these are obvious statements, but this is what came to mind:

  • The current error message is likely confusing—I think I would expect it not to run if it was disabled. So having an error that it couldn't bc it was disabled feels odd. Perhaps I'm missing something here though.
    • Is it perhaps just a warning instead of an error? "This rule was pending when disabled and could not complete..." or something like that?
  • I agree with @ymao1 in that it would be nice to persist some information if I disable a rule. If the rule had run previously, I can imagine it would be helpful to see that history of executions even though its disabled.
  • @chrisronline I think it's worth opening an issue. If we are showing an error in one view and then not in another that may cause confusion. However, I think you're right in that it might not make sense to show this in the same way we would other errors. So part of this is also determining the right copy to use.

@chrisronline
Copy link
Copy Markdown
Contributor Author

@mdefazio @ymao1 Done! #119981

TinLe pushed a commit to TinLe/kibana that referenced this pull request Dec 22, 2021
* Skip running disabled rules

* Move to separate file

* Add status that reflects a rule unable to run due to not enabled

* Add better message

* PR feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework release_note:fix Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v8.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[alerting] API key lost when frequently disabling / enabling a rule

7 participants