[Alerting] Skip running disabled rules by chrisronline · Pull Request #119239 · elastic/kibana

chrisronline · 2021-11-19T21:17:37Z

See the comments in the issue for some background on this change.

At a high level, we are attempting to fix a bug where if you quickly enable/disable a rule in the UI (or through the HTTP api), it will eventually error out due to a missing apiKey (related to this call.

After digging a bit, I'm fairly confident this bug is caused by a race condition between when we actually update the rule saved object from the disable route and when the task is removed as a result of disabling. If the rule executes in between these two async operations, the apiKey will not exist and will result in this error scenario.

To combat this, we have a couple choices:

Check if the rule is enabled earlier in the run block and bail on executing the rule
Enhance the apiKey check to also look for the enabled flag and handle a disabled state there

A couple of differences with these approaches. In approach 1, we'd avoid writing any event log entries that happen before we pull the apiKey and we'd also avoid needing to change the status of the rule itself - it'd simply show as disabled in the UI as we'd "pretend" the last rule execution never happened.

In approach 2, we'd need to write an error to the event log and therefore, would need a new error type and ability to communicate this state to the user in a helpful way. This is not ideal but it can be useful to help users understand why a rule didn't execute. Approach 1 also lends itself to a new issue where we could see a "saved object not found" error earlier in the stack trace and need to account for that to ensure users have the proper visibility into the state of the rule.

We ended up going with approach 2 and adding a new error due to "disabled" status for rules. I'm up for hearing better ways to handle this, as it's important the user understand how to get out of this error state (by simply reenabling the rule)

Here is how it looks in the UI:

Here is a sample event log document

{
  "_index": ".kibana-event-log-8.1.0-000001",
  "_id": "vNuQTX0BuBMBoQctzvXi",
  "_score": null,
  "_source": {
    "@timestamp": "2021-11-23T16:11:14.777Z",
    "event": {
      "provider": "alerting",
      "action": "execute",
      "kind": "alert",
      "category": ["stackAlerts"],
      "start": "2021-11-23T16:11:14.777Z",
      "end": "2021-11-23T16:11:14.779Z",
      "duration": 2000000,
      "reason": "disabled",
      "outcome": "failure"
    },
    "kibana": {
      "saved_objects": [
        {
          "rel": "primary",
          "type": "alert",
          "id": "f7fc6d50-4bd7-11ec-9374-d755ae9ba571",
          "type_id": ".es-query"
        }
      ],
      "task": {
        "scheduled": "2021-11-23T16:11:11.556Z",
        "schedule_delay": 3221000000
      },
      "alerting": {
        "status": "error"
      },
      "server_uuid": "5b2de169-2785-441b-ae8c-186a1936b17d",
      "version": "8.1.0"
    },
    "rule": {
      "id": "f7fc6d50-4bd7-11ec-9374-d755ae9ba571",
      "license": "basic",
      "category": ".es-query",
      "ruleset": "stackAlerts"
    },
    "error": {
      "message": "Rule failed to execute because rule ran after it was disabled."
    },
    "message": ".es-query:f7fc6d50-4bd7-11ec-9374-d755ae9ba571: execution failed",
    "ecs": {
      "version": "1.8.0"
    }
  },
  "sort": [1637683874777]
}

Here is a sample rule saved object

{
  "_index" : ".kibana_8.1.0_001",
  "_id" : "alert:f7fc6d50-4bd7-11ec-9374-d755ae9ba571",
  "_score" : 3.8148828,
  "_source" : {
    "alert" : {
      "params" : {
        "esQuery" : """{
"query":{
"match_all" : {}
}
}""",
        "size" : 100,
        "timeWindowSize" : 5,
        "timeWindowUnit" : "m",
        "threshold" : [
          0
        ],
        "thresholdComparator" : ">",
        "index" : [
          ".kibana-event-log-8.1.0"
        ],
        "timeField" : "@timestamp"
      },
      "consumer" : "alerts",
      "schedule" : {
        "interval" : "1s"
      },
      "tags" : [ ],
      "name" : "Break",
      "throttle" : null,
      "actions" : [
        {
          "actionTypeId" : ".server-log",
          "params" : {
            "level" : "info",
            "message" : "go"
          },
          "actionRef" : "preconfigured:my_server_log",
          "group" : "query matched"
        }
      ],
      "enabled" : false,
      "alertTypeId" : ".es-query",
      "notifyWhen" : "onActiveAlert",
      "apiKeyOwner" : null,
      "apiKey" : null,
      "legacyId" : null,
      "createdBy" : "elastic",
      "updatedBy" : "elastic",
      "createdAt" : "2021-11-22T21:05:50.057Z",
      "updatedAt" : "2021-11-23T16:11:14.071Z",
      "muteAll" : false,
      "mutedInstanceIds" : [ ],
      "executionStatus" : {
        "status" : "error",
        "lastExecutionDate" : "2021-11-23T16:11:14.777Z",
        "error" : {
          "reason" : "disabled",
          "message" : "Rule failed to execute because rule ran after it was disabled."
        },
        "lastDuration" : 2
      },
      "meta" : {
        "versionApiKeyLastmodified" : "8.1.0"
      },
      "scheduledTaskId" : null
    },
    "type" : "alert",
    "references" : [ ],
    "namespaces" : [
      "default"
    ],
    "migrationVersion" : {
      "alert" : "8.0.0"
    },
    "coreMigrationVersion" : "8.1.0",
    "updated_at" : "2021-11-23T16:11:14.781Z"
  }

…disabled_rule

elasticmachine · 2021-11-23T17:49:55Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

…disabled_rule

ymao1

LGTM! Verified that this works as expected. Nice job!

I think this is a separate issue, but when I clicked into the rule details, I expected to see the full error message but since the rule was disabled, I just see a disabled state with no error message. Since we are now showing some additional rule information in the rule details view (like last execution status and durations), maybe the disabled state should be updated? @mdefazio Any thoughts?

x-pack/plugins/alerting/server/task_runner/task_runner.ts

x-pack/plugins/alerting/server/task_runner/task_runner_cancel.test.ts

x-pack/plugins/alerting/server/task_runner/task_runner.test.ts

x-pack/plugins/triggers_actions_ui/public/application/sections/alerts_list/translations.ts

…disabled_rule

YulNaumenko

LGTM

kibana-ci · 2021-11-29T20:43:06Z

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`triggersActionsUi`	778.3KB	778.6KB	+360.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id	before	after	diff
`alerting`	36.6KB	36.7KB	+22.0B

History

💚 Build #9734 succeeded 7d1083ca4a36760e802ddd1b350b6dedf7c59e3e
💔 Build #9628 failed 681510532e0ee9ffa01643010e68996cec489016
💛 Build #8871 was flaky 040e521
💔 Build #8850 failed 933dc0e
💚 Build #8751 succeeded 446f462
💔 Build #8634 failed e2fea79

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @chrisronline

chrisronline · 2021-11-30T14:20:21Z

I think this is a separate issue, but when I clicked into the rule details, I expected to see the full error message but since the rule was disabled, I just see a disabled state with no error message. Since we are now showing some additional rule information in the rule details view (like last execution status and durations), maybe the disabled state should be updated?

@ymao1 Sorry, I missed this!

I'm happy to open a separate issue to discuss this, but my original thinking is the user just wants to disable the rule and the error message feels a bit unhelpful to persist (in general) as it doesn't/shouldn't affect them re-enabling the rule in the future. Perhaps this isn't the right thinking here, so lemme know your thoughts and I'm happy to open a new issue to discuss

mdefazio · 2021-11-30T14:34:44Z

Reading through the problem here I think I get a sense of the need to bubble up an error. Perhaps these are obvious statements, but this is what came to mind:

The current error message is likely confusing—I think I would expect it not to run if it was disabled. So having an error that it couldn't bc it was disabled feels odd. Perhaps I'm missing something here though.
- Is it perhaps just a warning instead of an error? "This rule was pending when disabled and could not complete..." or something like that?
I agree with @ymao1 in that it would be nice to persist some information if I disable a rule. If the rule had run previously, I can imagine it would be helpful to see that history of executions even though its disabled.
@chrisronline I think it's worth opening an issue. If we are showing an error in one view and then not in another that may cause confusion. However, I think you're right in that it might not make sense to show this in the same way we would other errors. So part of this is also determining the right copy to use.

chrisronline · 2021-11-30T15:52:36Z

@mdefazio @ymao1 Done! #119981

* Skip running disabled rules * Move to separate file * Add status that reflects a rule unable to run due to not enabled * Add better message * PR feedback

chrisronline added 6 commits November 19, 2021 16:07

Skip running disabled rules

ab6a9f9

Move to separate file

0b57ae9

Add status that reflects a rule unable to run due to not enabled

aa1b8f5

Merge remote-tracking branch 'elastic/main' into alerting/do_not_run_…

e2fea79

…disabled_rule

Merge remote-tracking branch 'elastic/main' into alerting/do_not_run_…

446f462

…disabled_rule

Merge remote-tracking branch 'elastic/main' into alerting/do_not_run_…

3dd11f6

…disabled_rule

chrisronline marked this pull request as ready for review November 23, 2021 17:49

chrisronline requested a review from a team as a code owner November 23, 2021 17:49

chrisronline added Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v8.0.0 labels Nov 23, 2021

chrisronline added the release_note:fix label Nov 23, 2021

chrisronline self-assigned this Nov 23, 2021

chrisronline added v8.1.0 and removed v8.0.0 labels Nov 23, 2021

chrisronline added 2 commits November 23, 2021 13:44

Add better message

933dc0e

Merge remote-tracking branch 'elastic/main' into alerting/do_not_run_…

040e521

…disabled_rule

ymao1 approved these changes Nov 29, 2021

View reviewed changes

chrisronline added 2 commits November 29, 2021 13:56

Merge remote-tracking branch 'elastic/main' into alerting/do_not_run_…

126b94c

…disabled_rule

PR feedback

9df58ce

chrisronline force-pushed the alerting/do_not_run_disabled_rule branch from 7d1083c to 9df58ce Compare November 29, 2021 19:27

YulNaumenko approved these changes Nov 29, 2021

View reviewed changes

chrisronline merged commit 1489426 into elastic:main Nov 29, 2021

chrisronline deleted the alerting/do_not_run_disabled_rule branch November 29, 2021 21:47

kibanamachine added the backport:skip This PR does not require backporting label Nov 29, 2021

chrisronline mentioned this pull request Nov 30, 2021

[Alerting] Improve messaging on rule detail page when rule is disabled #119981

Open

ymao1 mentioned this pull request Jan 4, 2022

[Security Solution] Disable legacy rules on upgrade to 8.x #121442

Merged

1 task

ymao1 mentioned this pull request Feb 11, 2022

[Security Solution] Rule is getting failed intermittently when enable or disable the rule even not able to enable the failed rule from rule details page #120872

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] Skip running disabled rules#119239

[Alerting] Skip running disabled rules#119239
chrisronline merged 10 commits intoelastic:mainfrom
chrisronline:alerting/do_not_run_disabled_rule

chrisronline commented Nov 19, 2021 •

edited

Loading

Uh oh!

elasticmachine commented Nov 23, 2021

Uh oh!

ymao1 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YulNaumenko left a comment

Uh oh!

kibana-ci commented Nov 29, 2021

Uh oh!

chrisronline commented Nov 30, 2021

Uh oh!

mdefazio commented Nov 30, 2021

Uh oh!

chrisronline commented Nov 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

chrisronline commented Nov 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Nov 23, 2021

Uh oh!

ymao1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YulNaumenko left a comment

Choose a reason for hiding this comment

Uh oh!

kibana-ci commented Nov 29, 2021

💚 Build Succeeded

Metrics [docs]

Async chunks

Page load bundle

History

Uh oh!

chrisronline commented Nov 30, 2021

Uh oh!

mdefazio commented Nov 30, 2021

Uh oh!

chrisronline commented Nov 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

chrisronline commented Nov 19, 2021 •

edited

Loading