[ResponseOps] resolve conflicts when updating alert docs after rule execution by pmuellr · Pull Request #166283 · elastic/kibana

pmuellr · 2023-09-12T15:51:34Z

When conflicts are detected while updating alert docs after a rule runs, we'll try to resolve the conflict by mget()'ing the alert documents again, to get the updated OCC info _seq_no and _primary_term. We'll also get the current versions of "ad-hoc" updated fields (which caused the conflict), like workflow status, case assignments, etc. And then attempt to update the alert doc again, with that info, which should get it back up-to-date.

Note that the rule registry was not touched here. During this PR's development, I added the retry support to it, but then my function tests were failing because there were never any conflicts happening. Turns out rule registry mget's the alerts before it updates them, to get the latest values. So they won't need this fix.

It's also not clear to me if this can be exercised in serverless, since it requires the use of an alerting framework based AaD implementation AND the ability to ad-hoc update alerts. I think this can only be done with Elasticsearch Query and Index Threshold, and only when used in metrics scope, so it will show up in the metrics UX, which is where you can add the alerts to the case.

manual testing

It's hard! I've seen the conflict messages before, but it's quite difficult to get them to go off whenever you want. The basic idea is to get a rule that uses alerting framework AAD (not rule registry, which is not affected the same way with conflicts (they mget alerts right before updating them), set it to run on a 1s interval, and probably also configure TM to run a 1s interval, via the following configs:

xpack.alerting.rules.minimumScheduleInterval.value: "1s"
xpack.task_manager.poll_interval: 1000

You want to get the rule to execute often and generate a lot of alerts, and run for as long as possible. Then while it's running, add the generated alerts to cases. Here's the EQ rule definition I used:

I selected the alerts from the o11y alerts page, since you can't add alerts to cases from the stack page. Hmm. :-). Sort the alert list by low-high duration, so the newest alerts will be at the top. Refresh, select all the rules (set page to show 100), then add to case from the ... menu. If you force a conflict, you should see something like this in the Kibana logs:

[ERROR] [plugins.alerting] Error writing alerts: 168 successful, 100 conflicts, 0 errors:
[INFO ] [plugins.alerting] Retrying bulk update of 100 conflicted alerts
[INFO ] [plugins.alerting] Retried bulk update of 100 conflicted alerts succeeded

…xecution resolves: elastic#158403 When conflicts are detected while updating alert docs after a rule runs, we'll try to resolve the conflict by `mget()`'ing the alert documents again, to get the updated OCC info `_seq_no` and `_primary_term`. We'll also get the current versions of "ad-hoc" updated fields (which caused the conflict), like workflow status, case assignments, etc. And then attempt to update the alert doc again, with that info, which should get it back up-to-date. - hard-code the fields to refresh - add skeletal version of a function test - add some debugging for CI-only/not-local test failure - change new rule type to wait for signal to finish - a little clean up, no clue why this passes locally and fails in CI though - dump rule type registry to see if rule type there - remove diagnostic code from FT - a real test that passes locally, for alerting framework - add test for RR, but it's failing as it doesn't resolve conflicts yet - fix some stuff, add support for upcoming untracked alert status - change recover algorithm to retry subsequent times corectly - remove RR support (not needed), so simplify other things - remove more RR bits (TransportStuff) and add jest tests - add multi-alert bulk update function test - rebase main

…neric exception catcher at the top level

…-ref HEAD~1..HEAD --fix'

elasticmachine · 2023-09-23T01:34:29Z

Pinging @elastic/response-ops (Team:ResponseOps)

ymao1 · 2023-09-27T23:22:08Z

@elasticmachine merge upstream

ymao1 · 2023-09-28T11:49:43Z

@elasticmachine merge upstream

kibana-ci · 2023-09-28T12:24:44Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: a980767
Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-166283-a980767306c3

Failed CI Steps

Test Failures

[job] [logs] FTR Configs #27 / serverless common API Alerting APIs Alerting rules "after each" hook for "should retry when appropriate"
[job] [logs] FTR Configs #26 / serverless search UI empty pages should show search specific empty page in discover

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`@kbn/alerting-api-integration-helpers`	24	27	+3

Canvas Sharable Runtime

The Canvas "shareable runtime" is an bundle produced to enable running Canvas workpads outside of Kibana. This bundle is included in third-party webpages that embed canvas and therefor should be as slim as possible.

id	before	after	diff
`module count`	-	5603	+5603
`total size`	-	5.9MB	+5.9MB

Unknown metric groups

API count

id	before	after	diff
`@kbn/alerting-api-integration-helpers`	24	27	+3

ESLint disabled line counts

id	before	after	diff
`alerting`	89	90	+1

Total ESLint disabled count

id	before	after	diff
`alerting`	91	92	+1

History

💔 Build #163228 failed e1f7b3f
💛 Build #161433 was flaky a602d7a
💔 Build #161414 failed c39c33b
💚 Build #161405 succeeded 6237a1c
💛 Build #161313 was flaky 1e47685

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

ymao1

LGTM! Works as expected.

pmuellr added the ci:build-serverless-image label Sep 12, 2023

pmuellr force-pushed the alerting/158403-alert-conflicts branch 2 times, most recently from bd23e8d to 9e66ce9 Compare September 22, 2023 12:43

pmuellr force-pushed the alerting/158403-alert-conflicts branch from 9e66ce9 to c4bbddb Compare September 22, 2023 12:55

fix alert_client jest tests

1e47685

This was referenced Sep 22, 2023

[ResponseOps] move ad-hoc alert doc updates into alerting plugin #167082

Open

[ResponseOps] add proper support for ad-hoc updates to kibana.alert.status for "untracked" values #167086

Closed

pmuellr and others added 5 commits September 22, 2023 16:25

use ALERT_ constants instead of literals, add more jest tests, add ge…

e408f5f

…neric exception catcher at the top level

fix a few wrong things while tidying up

6237a1c

comment all the things

c39c33b

revert a changed that caused tons of property name test errors

7c01cc4

[CI] Auto-commit changed files from 'node scripts/precommit_hook.js -…

a602d7a

…-ref HEAD~1..HEAD --fix'

pmuellr added Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v8.11.0 labels Sep 23, 2023

pmuellr marked this pull request as ready for review September 23, 2023 01:34

pmuellr requested a review from a team as a code owner September 23, 2023 01:34

Merge branch 'main' into alerting/158403-alert-conflicts

e1f7b3f

Merge branch 'main' into alerting/158403-alert-conflicts

a980767

ymao1 approved these changes Sep 28, 2023

View reviewed changes

ymao1 merged commit e6e3e2d into elastic:main Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ResponseOps] resolve conflicts when updating alert docs after rule execution#166283

[ResponseOps] resolve conflicts when updating alert docs after rule execution#166283
ymao1 merged 9 commits intoelastic:mainfrom
pmuellr:alerting/158403-alert-conflicts

pmuellr commented Sep 12, 2023 •

edited

Loading

Uh oh!

elasticmachine commented Sep 23, 2023

Uh oh!

ymao1 commented Sep 27, 2023

Uh oh!

ymao1 commented Sep 28, 2023

Uh oh!

kibana-ci commented Sep 28, 2023 •

edited

Loading

API count

ESLint disabled line counts

Total ESLint disabled count

Uh oh!

ymao1 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

pmuellr commented Sep 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

manual testing

Uh oh!

elasticmachine commented Sep 23, 2023

Uh oh!

ymao1 commented Sep 27, 2023

Uh oh!

ymao1 commented Sep 28, 2023

Uh oh!

kibana-ci commented Sep 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

Public APIs missing comments

Canvas Sharable Runtime

API count

ESLint disabled line counts

Total ESLint disabled count

History

Uh oh!

ymao1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pmuellr commented Sep 12, 2023 •

edited

Loading

kibana-ci commented Sep 28, 2023 •

edited

Loading