[Alerting v2] [Rule Doctor] Run API and deduplication workflow by dominiqueclarke · Pull Request #266668 · elastic/kibana

dominiqueclarke · 2026-04-30T13:08:14Z

Summary

Adds the Rule Doctor Run API and initial deduplication workflow, enabling users to trigger AI-powered analyses that identify duplicate or near-duplicate alerting rules and produce actionable insights.

Closes #266648

Changes

Run API (POST /api/alerting/v2/rule_doctor/run) — accepts { "type": "deduplication" }, schedules the workflow, returns 202 with an execution_id
Deduplication workflow YAML — fetches rules from the space, sends them to an AI connector for duplicate analysis, validates results, and persists findings
Workflow step types — registers alerting_v2.validate_rules and alerting_v2.persist_findings with the workflows extensions plugin
Bulk dismiss — bulkDismissInsights marks stale insights as dismissed in a single ES bulk operation
workflowsExtensions plugin dependency — added to kibana.jsonc and wired into setup
API path constant — ALERTING_V2_RULE_DOCTOR_RUN_API_PATH added to @kbn/alerting-v2-constants
Production hardening — request-scoped SO client, typed workflow patches (no as any), 4xx/5xx log level differentiation, schema validation at the trust boundary
Index readiness — ensureResourceReady called before scheduling workflow to guarantee insights index exists

Schema tightening

rule_ids: now required (every insight must reference rules)
current: now required — always present as an object keyed by rule ID containing each rule's config snapshot
proposed: now required — always present as an object keyed by rule ID showing each rule's post-action state (null for rules being deleted)
Both current and proposed share the same shape: { [rule_id]: config | null }

Workflow fixes

Correction loop accumulation — update_validation now concatenates loop_valid_insights with revalidated results instead of overwriting, preserving all valid insights across correction iterations
Undefined dismiss_ids fallback — added | default: [] to dismiss_ids input of persist_results step, preventing errors when evaluate_existing is skipped

Sample Insight Documents

Results from running the deduplication analysis against seed rules:

Insight 1: Service error logs — severity field vs body text overlap

{
  "title": "Service error logs - severity field vs body text overlap",
  "type": "deduplication",
  "action": "merge",
  "impact": "high",
  "confidence": "high",
  "summary": "Rules 7db86a39 and 0e113c81 both detect service errors from the same data source but use different detection methods (severity_text field vs body text keyword matching). Rule 7db86a39 is more reliable as it uses the structured severity_text field, while 0e113c81 is a legacy approach prone to false positives.",
  "justification": "Both rules query the same index pattern (logs-*.otel-*), group by service name, and fire on error conditions. Rule 7db86a39 uses the structured severity_text == ERROR field which is more reliable than 0e113c81's text pattern matching. The legacy rule should be consolidated into the structured approach.",
  "rule_ids": [
    "7db86a39-c861-45db-8016-3ccb40de835c",
    "0e113c81-2a32-4095-9f8a-1293c3e7edd1"
  ],
  "current": {
    "0e113c81-2a32-4095-9f8a-1293c3e7edd1": {
      "schedule": "5m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE body.text LIKE \"*error*\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 5",
      "name": "Service error logs (body text)",
      "tags": ["rule-doctor-seed", "logs", "errors"]
    },
    "7db86a39-c861-45db-8016-3ccb40de835c": {
      "schedule": "5m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE severity_text == \"ERROR\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 10",
      "name": "Service error logs (severity)",
      "tags": ["rule-doctor-seed", "logs", "errors", "services"]
    }
  },
  "proposed": {
    "0e113c81-2a32-4095-9f8a-1293c3e7edd1": null,
    "7db86a39-c861-45db-8016-3ccb40de835c": {
      "schedule": "5m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE severity_text == \"ERROR\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 10",
      "name": "Service error logs (severity)",
      "tags": ["rule-doctor-seed", "logs", "errors", "services"]
    }
  },
  "diffs": [
    { "field": "0e113c81-2a32-4095-9f8a-1293c3e7edd1", "previous": "Service error logs (body text) - active rule", "proposed": "deleted" }
  ],
  "status": "open"
}

Insight 2: Service error logs — threshold consolidation

{
  "title": "Service error logs - threshold consolidation",
  "type": "deduplication",
  "action": "merge",
  "impact": "high",
  "confidence": "high",
  "summary": "Rules 7db86a39 and 53111728 both monitor service error logs using the same severity_text field and data source, but with different thresholds (>10 vs >5) and schedules (5m vs 1m). The lower threshold rule fires more frequently and is more sensitive.",
  "justification": "Both rules use identical detection logic (severity_text == ERROR grouped by service name) on the same index. Rule 53111728 has a lower threshold (>5) and faster schedule (1m), making it more sensitive. Consolidating to the more sensitive rule with 1m schedule provides better coverage while eliminating redundancy.",
  "rule_ids": [
    "7db86a39-c861-45db-8016-3ccb40de835c",
    "53111728-d857-4d0b-97dd-fbac832cea96"
  ],
  "current": {
    "7db86a39-c861-45db-8016-3ccb40de835c": {
      "schedule": "5m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE severity_text == \"ERROR\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 10",
      "name": "Service error logs (severity)",
      "tags": ["rule-doctor-seed", "logs", "errors", "services"]
    },
    "53111728-d857-4d0b-97dd-fbac832cea96": {
      "schedule": "1m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE severity_text == \"ERROR\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 5",
      "name": "Service error logs > 5",
      "tags": ["rule-doctor-seed", "logs", "errors", "services"]
    }
  },
  "proposed": {
    "7db86a39-c861-45db-8016-3ccb40de835c": null,
    "53111728-d857-4d0b-97dd-fbac832cea96": {
      "schedule": "1m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE severity_text == \"ERROR\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 5",
      "name": "Service error logs > 5",
      "tags": ["rule-doctor-seed", "logs", "errors", "services"]
    }
  },
  "diffs": [
    { "field": "7db86a39-c861-45db-8016-3ccb40de835c", "previous": "Service error logs (severity) - active rule", "proposed": "deleted" }
  ],
  "status": "open"
}

Insight 3: Pod log volume — threshold overlap

{
  "title": "Pod log volume - threshold overlap",
  "type": "deduplication",
  "action": "merge",
  "impact": "high",
  "confidence": "high",
  "summary": "Rules 11034581 and 0c895cd0 both monitor pod log volume from the same data source and group by pod name, but with different thresholds (>100k vs >50k). The lower threshold rule is more sensitive and will fire more frequently.",
  "justification": "Both rules query logs-*.otel-* grouped by k8s.pod.name with identical filtering logic. Rule 0c895cd0 has a lower threshold (>50k) making it more sensitive. The higher threshold rule (11034581) is redundant as the lower threshold rule will catch all conditions the higher threshold would detect.",
  "rule_ids": [
    "11034581-ba8f-41ef-a94f-8ea6c5f4df9d",
    "0c895cd0-c340-46b0-acfb-20526c061824"
  ],
  "current": {
    "0c895cd0-c340-46b0-acfb-20526c061824": {
      "schedule": "5m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE k8s.pod.name IS NOT NULL | STATS log_count = COUNT(*) BY k8s.pod.name | WHERE log_count > 50000",
      "name": "Pod log flood > 50k",
      "tags": ["rule-doctor-seed", "logs", "kubernetes"]
    },
    "11034581-ba8f-41ef-a94f-8ea6c5f4df9d": {
      "schedule": "5m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE k8s.pod.name IS NOT NULL | STATS log_count = COUNT(*) BY k8s.pod.name | WHERE log_count > 100000",
      "name": "High pod log volume",
      "tags": ["rule-doctor-seed", "logs", "kubernetes", "volume"]
    }
  },
  "proposed": {
    "0c895cd0-c340-46b0-acfb-20526c061824": {
      "schedule": "5m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE k8s.pod.name IS NOT NULL | STATS log_count = COUNT(*) BY k8s.pod.name | WHERE log_count > 50000",
      "name": "Pod log flood > 50k",
      "tags": ["rule-doctor-seed", "logs", "kubernetes"]
    },
    "11034581-ba8f-41ef-a94f-8ea6c5f4df9d": null
  },
  "diffs": [
    { "field": "11034581-ba8f-41ef-a94f-8ea6c5f4df9d", "previous": "High pod log volume - active rule", "proposed": "deleted" }
  ],
  "status": "open"
}

Test plan

Unit tests pass (node scripts/jest for validate_rules, persist_findings, run_rule_doctor_route, rule_doctor_insights_client)
Trigger deduplication via POST /api/alerting/v2/rule_doctor/run with { "type": "deduplication" } — returns 202 with execution_id
Workflow completes and insights are persisted to .rule-doctor-insights index
Insights match the tightened schema (rule_ids, current, proposed all present)
Correction loop accumulates valid insights across iterations (not overwriting)
dismiss_ids defaults to [] when evaluate_existing step is skipped

…dominiqueclarke/kibana into feat/rule-doctor-page-and-flags

macroscopeapp · 2026-04-30T13:09:19Z

Catch flakiness early (recommended): run the flaky test runner against this PR before merging.

New Scout API specs (rule_doctor_insights.spec.ts, run_rule_doctor.spec.ts) create/delete ES indices and issue multiple API calls in hooks, so stability is unknown.

Trigger a run with the Flaky Test Runner UI or post this comment on the PR:

/flaky scoutConfig:x-pack/platform/plugins/shared/alerting_v2/test/scout_alerting_v2/api/playwright.config.ts:30

^{Share feedback in the #appex-qa channel.}

^{Posted via Macroscope — Flaky Test Runner nudge}

macroscopeapp · 2026-04-30T13:39:27Z

Approvability

Verdict: Needs human review

This PR introduces a substantial new feature (Rule Doctor run API and deduplication workflow) with AI integration, new step types, and workflow orchestration. The author does not own any of the modified files, which are all owned by @elastic/rna-project-team.

^{You can customize Macroscope's approvability policy. Learn more.}

…y proposed by rule_id - rule_ids: remove .optional() — every insight must reference rules - current: remove .optional().nullable() — always present as keyed-by-rule-id object - proposed: remove .optional().nullable() — always present as keyed-by-rule-id object (individual values are null for deleted rules) - Update deduplication workflow prompt and schema to instruct the LLM to key proposed by rule_id matching the current field shape - Update test fixture to use {} instead of null for current/proposed Made-with: Cursor

dominiqueclarke · 2026-04-30T14:53:25Z

Right now if a rule is suggested to be deleted (for deduplication consolidation), the agent marks that as null in the proposed object (which is keyed by rule id). I'd like to rethink that, either in this PR or a following, to make the suggested action (delete) a bit more explicit.

…iqueclarke/kibana into feat/rule-doctor-execution

macroscopeapp · 2026-05-04T21:05:00Z

+  API_HEADERS,
+  RULE_DOCTOR_RUN_API_PATH,
+  ALERTING_V2_EXPERIMENTAL_FEATURES_SETTING_ID,
+} from '../fixtures';


Use constants for shared test values

API_HEADERS, RULE_DOCTOR_RUN_API_PATH, and ALERTING_V2_EXPERIMENTAL_FEATURES_SETTING_ID are not exported from ../fixtures — this will fail to compile. Additionally, apiTest should come from the local extended fixture (line 9), not directly from @kbn/scout.

See details

The sibling test files in this directory (find_rules.spec.ts, rule_doctor_insights.spec.ts) all import the extended apiTest and use testData.COMMON_HEADERS from ../fixtures. The new test diverges from that pattern and references three exports that don't exist in fixtures/index.ts.

Suggested fix:

Add the missing constants to common/constants.ts and re-export through fixtures/index.ts:

// common/constants.ts export { ALERTING_V2_RULE_DOCTOR_RUN_API_PATH as RULE_DOCTOR_RUN_API_PATH } from '@kbn/alerting-v2-constants'; export { ALERTING_V2_EXPERIMENTAL_FEATURES_SETTING_ID } from '../../../common/advanced_settings';

Update the spec imports to match the established pattern:

-import { apiTest, tags } from '@kbn/scout'; +import { tags } from '@kbn/scout'; import type { RoleApiCredentials } from '@kbn/scout'; import { RULE_DOCTOR_DEDUP_WORKFLOW_ID } from '../../../../server/workflows/load_workflows'; -import { - API_HEADERS, - RULE_DOCTOR_RUN_API_PATH, - ALERTING_V2_EXPERIMENTAL_FEATURES_SETTING_ID, -} from '../fixtures'; +import { apiTest, testData } from '../fixtures'; +import { + RULE_DOCTOR_RUN_API_PATH, + ALERTING_V2_EXPERIMENTAL_FEATURES_SETTING_ID, +} from '../../common/constants';

Then replace every API_HEADERS usage with testData.COMMON_HEADERS to stay consistent with the rest of the suite.

^{Share feedback in the #appex-qa channel.}

^{Posted via Macroscope — Scout Test Review}

dominiqueclarke · 2026-05-05T13:46:07Z

+      method: GET
+      path: '/api/alerting/v2/rules'
+      query:
+        perPage: 200


There will be a separate issue for elegantly handling large amounts of rules. Experimentation is happening now.

kibanamachine · 2026-05-05T14:54:10Z

💚 Build Succeeded

Buildkite Build
Commit: f6e74de

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`@kbn/alerting-v2-constants`	6	7	+1

Unknown metric groups

API count

id	before	after	diff
`@kbn/alerting-v2-constants`	14	15	+1

History

💔 Build #437822 failed d9fb6b4
💛 Build #437736 was flaky 3e6da64
💔 Build #437604 failed a9c56fb
💔 Build #437122 failed 368957b
💔 Build #436509 failed 4ab3b50

kdelemme

overall this looks good to me, i have a few questions and recommendations

kdelemme · 2026-05-06T12:15:15Z

+    private readonly resourceManager?: ResourceManagerContract,
+    private readonly rawLogger?: Logger


Why are they optional? and why are they not @injected ?

Answered myself below

kdelemme · 2026-05-06T12:22:32Z

+    const { type } = this.request.body;
+    const executionId = uuidv4();
+    const spaceId = this.spaceContext.spaceId;
+    const connectorId = await this.getDefaultConnectorId();
+
+    await this.insightsClient.ensureIndex();
+    const workflow = await ensureRuleDoctorAnalysisWorkflow(
+      type,
+      this.workflowsManagement,
+      spaceId,
+      this.request,
+      this.logger
+    );
+
+    await this.workflowsManagement.scheduleWorkflow(
+      workflow,
+      spaceId,
+      { space_id: spaceId, execution_id: executionId, connector_id: connectorId },
+      this.request,
+      'rule_doctor'
+    );


This route handler is probably doing too many things, I think it would be worth exposing this as an application service so it can be decoupled from the http layer and reused from a client if needed. Testing wise it becomes easier since it is not bound to a request anymore

kdelemme · 2026-05-06T12:27:13Z

+    const { insights = [], dismiss_ids: dismissIds = [], space_id: spaceId } = context.input;
+    const esClient = context.contextManager.getScopedEsClient();
+    const logger = adaptLogger(context.logger);
+    const client = new RuleDoctorInsightsClient(esClient, logger);


Ok I understand why we have the optional resourceManager and Logger now.
Quick question, is it possible for someone to use a workflow with this step type, but never call the run doctor API, thus never instantiating the index?
These optional parameters are a smell imo, can we initiate the managed resources on plugin setup/start instead?

kdelemme · 2026-05-06T12:31:07Z

+        ...rawObj,
+        space_id: spaceId,
+        execution_id: executionId,
+        insight_id: rawObj.insight_id || `insight-${uuidv4().slice(0, 8)}`,


uuidv4().slice(0,8)

Truncating a UUIDv4 to 8 characters drastically increases the risk of collisions. While a full 36-character UUIDv4 is practically unique, an 8-character truncation results in a collision probability of over 50% after only roughly (2^{16}) (65,536) generated IDs.

Are we expecting the default fallback to almost never be used? I would recommend using the full uuidv4 or nanoid(12) at least

kdelemme · 2026-05-06T12:32:13Z

+  const existing = await managementApi.getWorkflow(workflowId, spaceId);
+
+  if (existing) {
+    if (existing.yaml !== yaml || !existing.enabled || !existing.valid) {
+      await managementApi.updateWorkflow(workflowId, { yaml, enabled: true }, spaceId, request);
+      logger.info(`Updated workflow ${workflowId}`);
+    }
+  } else {
+    await managementApi.createWorkflow({ yaml, id: workflowId }, spaceId, request);
+    await managementApi.updateWorkflow(workflowId, { yaml, enabled: true }, spaceId, request);
+    logger.info(`Created workflow ${workflowId}`);
+  }
+
+  const workflow = (await managementApi.getWorkflow(workflowId, spaceId))!;


should we handle errors? Maybe retry on transient?

kdelemme · 2026-05-06T12:35:26Z

Also we might want to wait for #267924 to land and use the workflow registration service it brings

dominiqueclarke and others added 14 commits April 23, 2026 14:24

add rule doctor page and flags

afdf505

add rule doctor index

1a881bc

adjust settings

214234b

Changes from node scripts/eslint_all_files --no-cache --fix

c744e40

automatic cleaning of documents

7b3cbcb

Merge branch 'feat/rule-doctor-page-and-flags' of https://github.com/…

47ed058

…dominiqueclarke/kibana into feat/rule-doctor-page-and-flags

add rule doctor indices

7bb3a01

merge main

8885e13

adjust terminology

e506602

adjust types and remove unnecessary code

7f0be25

Changes from node scripts/eslint_all_files --no-cache --fix

ce53f8c

adjust schemas

c78f916

merge origin

f127fc8

add initial rule doctor run api with deduplication workflow

b243bf2

github-actions Bot added the author:actionable-obs PRs authored by the actionable obs team label Apr 30, 2026

dominiqueclarke changed the title ~~[Alerting v2] Rule Doctor: Run API and deduplication workflow~~ [Alerting v2] [Rule Doctor] Run API and deduplication workflow Apr 30, 2026

macroscopeapp Bot reviewed Apr 30, 2026

View reviewed changes

merge main

3ccdbe3

dominiqueclarke added backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes v9.5.0 labels Apr 30, 2026

dominiqueclarke marked this pull request as ready for review April 30, 2026 13:33

dominiqueclarke requested a review from a team as a code owner April 30, 2026 13:33

kibanamachine and others added 4 commits April 30, 2026 13:41

Changes from node scripts/lint_ts_projects --fix

5ff485a

Changes from node scripts/regenerate_moon_projects.js --update

50aa327

Changes from node scripts/eslint_all_files --no-cache --fix

3cb178e

macroscopeapp Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread x-pack/platform/plugins/shared/alerting_v2/server/workflows/rule_doctor_deduplication.yaml

kibanamachine and others added 4 commits April 30, 2026 15:09

Changes from node scripts/eslint_all_files --no-cache --fix

50951c9

adjust types

58f8add

Merge branch 'feat/rule-doctor-execution' of https://github.com/domin…

4ab3b50

…iqueclarke/kibana into feat/rule-doctor-execution

adjust tests

368957b

dominiqueclarke mentioned this pull request May 4, 2026

ResourceManager reports stale 'ready' status after external resource deletion #267536

Open

dominiqueclarke and others added 3 commits May 4, 2026 11:19

merge upstream

a9c56fb

adjust tests

88ed922

Changes from node scripts/eslint_all_files --no-cache --fix

64c0ba7

macroscopeapp Bot reviewed May 4, 2026

View reviewed changes

Comment thread ...platform/plugins/shared/alerting_v2/test/scout_alerting_v2/api/tests/run_rule_doctor.spec.ts

Comment thread ...platform/plugins/shared/alerting_v2/test/scout_alerting_v2/api/tests/run_rule_doctor.spec.ts

dominiqueclarke and others added 5 commits May 4, 2026 14:35

address test feedback

2d0682d

Merge branch 'feat/rule-doctor-execution' of https://github.com/domin…

81d32c6

…iqueclarke/kibana into feat/rule-doctor-execution

Changes from node scripts/eslint_all_files --no-cache --fix

3e6da64

merge upstream

aa5d989

Merge branch 'feat/rule-doctor-execution' of https://github.com/domin…

c76ef8d

…iqueclarke/kibana into feat/rule-doctor-execution

macroscopeapp Bot reviewed May 4, 2026

View reviewed changes

dominiqueclarke added 3 commits May 4, 2026 22:40

update scout tests

d9fb6b4

adjust scout tests

30a57be

merge upstream

f6e74de

dominiqueclarke commented May 5, 2026

View reviewed changes

kdelemme reviewed May 6, 2026

View reviewed changes

kdelemme mentioned this pull request May 6, 2026

[Alerting v2] Create workflow extension service #267924

Merged

		private readonly resourceManager?: ResourceManagerContract,
		private readonly rawLogger?: Logger

Conversation

dominiqueclarke commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Schema tightening

Workflow fixes

Sample Insight Documents

Test plan

Uh oh!

macroscopeapp Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

Uh oh!

dominiqueclarke commented Apr 30, 2026

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

dominiqueclarke May 5, 2026

Choose a reason for hiding this comment

Uh oh!

kibanamachine commented May 5, 2026

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

API count

History

Uh oh!

kdelemme left a comment

Choose a reason for hiding this comment

Uh oh!

kdelemme May 6, 2026

Choose a reason for hiding this comment

Uh oh!

kdelemme May 6, 2026

Choose a reason for hiding this comment

Uh oh!

kdelemme May 6, 2026

Choose a reason for hiding this comment

Uh oh!

kdelemme May 6, 2026

Choose a reason for hiding this comment

Uh oh!

kdelemme May 6, 2026

Choose a reason for hiding this comment

Uh oh!

kdelemme May 6, 2026

Choose a reason for hiding this comment

Uh oh!

kdelemme commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dominiqueclarke commented Apr 30, 2026 •

edited

Loading

macroscopeapp Bot commented Apr 30, 2026 •

edited

Loading