[Alerting] Skip running disabled rules#119239
Conversation
|
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
ymao1
left a comment
There was a problem hiding this comment.
LGTM! Verified that this works as expected. Nice job!
I think this is a separate issue, but when I clicked into the rule details, I expected to see the full error message but since the rule was disabled, I just see a disabled state with no error message. Since we are now showing some additional rule information in the rule details view (like last execution status and durations), maybe the disabled state should be updated? @mdefazio Any thoughts?
x-pack/plugins/alerting/server/task_runner/task_runner_cancel.test.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/triggers_actions_ui/public/application/sections/alerts_list/translations.ts
Outdated
Show resolved
Hide resolved
7d1083c to
9df58ce
Compare
💚 Build Succeeded
Metrics [docs]Async chunks
Page load bundle
History
To update your PR or re-run it, just comment with: |
@ymao1 Sorry, I missed this! I'm happy to open a separate issue to discuss this, but my original thinking is the user just wants to disable the rule and the error message feels a bit unhelpful to persist (in general) as it doesn't/shouldn't affect them re-enabling the rule in the future. Perhaps this isn't the right thinking here, so lemme know your thoughts and I'm happy to open a new issue to discuss |
|
Reading through the problem here I think I get a sense of the need to bubble up an error. Perhaps these are obvious statements, but this is what came to mind:
|
* Skip running disabled rules * Move to separate file * Add status that reflects a rule unable to run due to not enabled * Add better message * PR feedback

Resolves #106292
See the comments in the issue for some background on this change.
At a high level, we are attempting to fix a bug where if you quickly enable/disable a rule in the UI (or through the HTTP api), it will eventually error out due to a missing
apiKey(related to this call.After digging a bit, I'm fairly confident this bug is caused by a race condition between when we actually update the rule saved object from the disable route and when the task is removed as a result of disabling. If the rule executes in between these two async operations, the apiKey will not exist and will result in this error scenario.
To combat this, we have a couple choices:
Check if the rule is enabled earlier in the run block and bail on executing the rule
Enhance the
apiKeycheck to also look for theenabledflag and handle a disabled state thereA couple of differences with these approaches. In approach 1, we'd avoid writing any event log entries that happen before we pull the
apiKeyand we'd also avoid needing to change the status of the rule itself - it'd simply show as disabled in the UI as we'd "pretend" the last rule execution never happened.In approach 2, we'd need to write an error to the event log and therefore, would need a new error type and ability to communicate this state to the user in a helpful way. This is not ideal but it can be useful to help users understand why a rule didn't execute. Approach 1 also lends itself to a new issue where we could see a "saved object not found" error earlier in the stack trace and need to account for that to ensure users have the proper visibility into the state of the rule.
We ended up going with approach 2 and adding a new error due to "disabled" status for rules. I'm up for hearing better ways to handle this, as it's important the user understand how to get out of this error state (by simply reenabling the rule)
Here is how it looks in the UI:
Here is a sample event log document
Here is a sample rule saved object