Skip to content

[ResponseOps] resolve conflicts when updating alert docs after rule execution#166283

Merged
ymao1 merged 9 commits intoelastic:mainfrom
pmuellr:alerting/158403-alert-conflicts
Sep 28, 2023
Merged

[ResponseOps] resolve conflicts when updating alert docs after rule execution#166283
ymao1 merged 9 commits intoelastic:mainfrom
pmuellr:alerting/158403-alert-conflicts

Conversation

@pmuellr
Copy link
Contributor

@pmuellr pmuellr commented Sep 12, 2023

resolves: #158403

When conflicts are detected while updating alert docs after a rule runs, we'll try to resolve the conflict by mget()'ing the alert documents again, to get the updated OCC info _seq_no and _primary_term. We'll also get the current versions of "ad-hoc" updated fields (which caused the conflict), like workflow status, case assignments, etc. And then attempt to update the alert doc again, with that info, which should get it back up-to-date.

Note that the rule registry was not touched here. During this PR's development, I added the retry support to it, but then my function tests were failing because there were never any conflicts happening. Turns out rule registry mget's the alerts before it updates them, to get the latest values. So they won't need this fix.

It's also not clear to me if this can be exercised in serverless, since it requires the use of an alerting framework based AaD implementation AND the ability to ad-hoc update alerts. I think this can only be done with Elasticsearch Query and Index Threshold, and only when used in metrics scope, so it will show up in the metrics UX, which is where you can add the alerts to the case.

manual testing

It's hard! I've seen the conflict messages before, but it's quite difficult to get them to go off whenever you want. The basic idea is to get a rule that uses alerting framework AAD (not rule registry, which is not affected the same way with conflicts (they mget alerts right before updating them), set it to run on a 1s interval, and probably also configure TM to run a 1s interval, via the following configs:

xpack.alerting.rules.minimumScheduleInterval.value: "1s"
xpack.task_manager.poll_interval: 1000

You want to get the rule to execute often and generate a lot of alerts, and run for as long as possible. Then while it's running, add the generated alerts to cases. Here's the EQ rule definition I used:

image

I selected the alerts from the o11y alerts page, since you can't add alerts to cases from the stack page. Hmm. :-). Sort the alert list by low-high duration, so the newest alerts will be at the top. Refresh, select all the rules (set page to show 100), then add to case from the ... menu. If you force a conflict, you should see something like this in the Kibana logs:

[ERROR] [plugins.alerting] Error writing alerts: 168 successful, 100 conflicts, 0 errors:
[INFO ] [plugins.alerting] Retrying bulk update of 100 conflicted alerts
[INFO ] [plugins.alerting] Retried bulk update of 100 conflicted alerts succeeded

@pmuellr pmuellr force-pushed the alerting/158403-alert-conflicts branch 2 times, most recently from bd23e8d to 9e66ce9 Compare September 22, 2023 12:43
…xecution

resolves: elastic#158403

When conflicts are detected while updating alert docs after a rule
runs, we'll try to resolve the conflict by `mget()`'ing the alert
documents again, to get the updated OCC info `_seq_no` and
`_primary_term`.  We'll also get the current versions of "ad-hoc"
updated fields (which caused the conflict), like workflow status,
case assignments, etc.  And then attempt to update the alert doc again,
with that info, which should get it back up-to-date.

- hard-code the fields to refresh
- add skeletal version of a function test
- add some debugging for CI-only/not-local test failure
- change new rule type to wait for signal to finish
- a little clean up, no clue why this passes locally and fails in CI though
- dump rule type registry to see if rule type there
- remove diagnostic code from FT
- a real test that passes locally, for alerting framework
- add test for RR, but it's failing as it doesn't resolve conflicts yet
- fix some stuff, add support for upcoming untracked alert status
- change recover algorithm to retry subsequent times corectly
- remove RR support (not needed), so simplify other things
- remove more RR bits (TransportStuff) and add jest tests
- add multi-alert bulk update function test
- rebase main
@pmuellr pmuellr force-pushed the alerting/158403-alert-conflicts branch from 9e66ce9 to c4bbddb Compare September 22, 2023 12:55
@pmuellr pmuellr added Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v8.11.0 labels Sep 23, 2023
@pmuellr pmuellr marked this pull request as ready for review September 23, 2023 01:34
@pmuellr pmuellr requested a review from a team as a code owner September 23, 2023 01:34
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@ymao1
Copy link
Contributor

ymao1 commented Sep 27, 2023

@elasticmachine merge upstream

@ymao1
Copy link
Contributor

ymao1 commented Sep 28, 2023

@elasticmachine merge upstream

@kibana-ci
Copy link

kibana-ci commented Sep 28, 2023

💛 Build succeeded, but was flaky

  • Buildkite Build
  • Commit: a980767
  • Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-166283-a980767306c3

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #27 / serverless common API Alerting APIs Alerting rules "after each" hook for "should retry when appropriate"
  • [job] [logs] FTR Configs #26 / serverless search UI empty pages should show search specific empty page in discover

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/alerting-api-integration-helpers 24 27 +3

Canvas Sharable Runtime

The Canvas "shareable runtime" is an bundle produced to enable running Canvas workpads outside of Kibana. This bundle is included in third-party webpages that embed canvas and therefor should be as slim as possible.

id before after diff
module count - 5603 +5603
total size - 5.9MB +5.9MB
Unknown metric groups

API count

id before after diff
@kbn/alerting-api-integration-helpers 24 27 +3

ESLint disabled line counts

id before after diff
alerting 89 90 +1

Total ESLint disabled count

id before after diff
alerting 91 92 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Works as expected.

@ymao1 ymao1 merged commit e6e3e2d into elastic:main Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting ci:build-serverless-image Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v8.11.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Response Ops][Alerting] Discuss how to handle possible conflicts when updating alerts-as-data documents

5 participants