Skip to content

[Alerting v2] [Rule Doctor] Run API and deduplication workflow#266668

Open
dominiqueclarke wants to merge 34 commits into
elastic:mainfrom
dominiqueclarke:feat/rule-doctor-execution
Open

[Alerting v2] [Rule Doctor] Run API and deduplication workflow#266668
dominiqueclarke wants to merge 34 commits into
elastic:mainfrom
dominiqueclarke:feat/rule-doctor-execution

Conversation

@dominiqueclarke
Copy link
Copy Markdown
Contributor

@dominiqueclarke dominiqueclarke commented Apr 30, 2026

Summary

Adds the Rule Doctor Run API and initial deduplication workflow, enabling users to trigger AI-powered analyses that identify duplicate or near-duplicate alerting rules and produce actionable insights.

Closes #266648

Changes

  • Run API (POST /api/alerting/v2/rule_doctor/run) — accepts { "type": "deduplication" }, schedules the workflow, returns 202 with an execution_id
  • Deduplication workflow YAML — fetches rules from the space, sends them to an AI connector for duplicate analysis, validates results, and persists findings
  • Workflow step types — registers alerting_v2.validate_rules and alerting_v2.persist_findings with the workflows extensions plugin
  • Bulk dismissbulkDismissInsights marks stale insights as dismissed in a single ES bulk operation
  • workflowsExtensions plugin dependency — added to kibana.jsonc and wired into setup
  • API path constantALERTING_V2_RULE_DOCTOR_RUN_API_PATH added to @kbn/alerting-v2-constants
  • Production hardening — request-scoped SO client, typed workflow patches (no as any), 4xx/5xx log level differentiation, schema validation at the trust boundary
  • Index readinessensureResourceReady called before scheduling workflow to guarantee insights index exists

Schema tightening

  • rule_ids: now required (every insight must reference rules)
  • current: now required — always present as an object keyed by rule ID containing each rule's config snapshot
  • proposed: now required — always present as an object keyed by rule ID showing each rule's post-action state (null for rules being deleted)
  • Both current and proposed share the same shape: { [rule_id]: config | null }

Workflow fixes

  • Correction loop accumulationupdate_validation now concatenates loop_valid_insights with revalidated results instead of overwriting, preserving all valid insights across correction iterations
  • Undefined dismiss_ids fallback — added | default: [] to dismiss_ids input of persist_results step, preventing errors when evaluate_existing is skipped

Sample Insight Documents

Results from running the deduplication analysis against seed rules:

Insight 1: Service error logs — severity field vs body text overlap
{
  "title": "Service error logs - severity field vs body text overlap",
  "type": "deduplication",
  "action": "merge",
  "impact": "high",
  "confidence": "high",
  "summary": "Rules 7db86a39 and 0e113c81 both detect service errors from the same data source but use different detection methods (severity_text field vs body text keyword matching). Rule 7db86a39 is more reliable as it uses the structured severity_text field, while 0e113c81 is a legacy approach prone to false positives.",
  "justification": "Both rules query the same index pattern (logs-*.otel-*), group by service name, and fire on error conditions. Rule 7db86a39 uses the structured severity_text == ERROR field which is more reliable than 0e113c81's text pattern matching. The legacy rule should be consolidated into the structured approach.",
  "rule_ids": [
    "7db86a39-c861-45db-8016-3ccb40de835c",
    "0e113c81-2a32-4095-9f8a-1293c3e7edd1"
  ],
  "current": {
    "0e113c81-2a32-4095-9f8a-1293c3e7edd1": {
      "schedule": "5m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE body.text LIKE \"*error*\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 5",
      "name": "Service error logs (body text)",
      "tags": ["rule-doctor-seed", "logs", "errors"]
    },
    "7db86a39-c861-45db-8016-3ccb40de835c": {
      "schedule": "5m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE severity_text == \"ERROR\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 10",
      "name": "Service error logs (severity)",
      "tags": ["rule-doctor-seed", "logs", "errors", "services"]
    }
  },
  "proposed": {
    "0e113c81-2a32-4095-9f8a-1293c3e7edd1": null,
    "7db86a39-c861-45db-8016-3ccb40de835c": {
      "schedule": "5m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE severity_text == \"ERROR\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 10",
      "name": "Service error logs (severity)",
      "tags": ["rule-doctor-seed", "logs", "errors", "services"]
    }
  },
  "diffs": [
    { "field": "0e113c81-2a32-4095-9f8a-1293c3e7edd1", "previous": "Service error logs (body text) - active rule", "proposed": "deleted" }
  ],
  "status": "open"
}
Insight 2: Service error logs — threshold consolidation
{
  "title": "Service error logs - threshold consolidation",
  "type": "deduplication",
  "action": "merge",
  "impact": "high",
  "confidence": "high",
  "summary": "Rules 7db86a39 and 53111728 both monitor service error logs using the same severity_text field and data source, but with different thresholds (>10 vs >5) and schedules (5m vs 1m). The lower threshold rule fires more frequently and is more sensitive.",
  "justification": "Both rules use identical detection logic (severity_text == ERROR grouped by service name) on the same index. Rule 53111728 has a lower threshold (>5) and faster schedule (1m), making it more sensitive. Consolidating to the more sensitive rule with 1m schedule provides better coverage while eliminating redundancy.",
  "rule_ids": [
    "7db86a39-c861-45db-8016-3ccb40de835c",
    "53111728-d857-4d0b-97dd-fbac832cea96"
  ],
  "current": {
    "7db86a39-c861-45db-8016-3ccb40de835c": {
      "schedule": "5m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE severity_text == \"ERROR\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 10",
      "name": "Service error logs (severity)",
      "tags": ["rule-doctor-seed", "logs", "errors", "services"]
    },
    "53111728-d857-4d0b-97dd-fbac832cea96": {
      "schedule": "1m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE severity_text == \"ERROR\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 5",
      "name": "Service error logs > 5",
      "tags": ["rule-doctor-seed", "logs", "errors", "services"]
    }
  },
  "proposed": {
    "7db86a39-c861-45db-8016-3ccb40de835c": null,
    "53111728-d857-4d0b-97dd-fbac832cea96": {
      "schedule": "1m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE severity_text == \"ERROR\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 5",
      "name": "Service error logs > 5",
      "tags": ["rule-doctor-seed", "logs", "errors", "services"]
    }
  },
  "diffs": [
    { "field": "7db86a39-c861-45db-8016-3ccb40de835c", "previous": "Service error logs (severity) - active rule", "proposed": "deleted" }
  ],
  "status": "open"
}
Insight 3: Pod log volume — threshold overlap
{
  "title": "Pod log volume - threshold overlap",
  "type": "deduplication",
  "action": "merge",
  "impact": "high",
  "confidence": "high",
  "summary": "Rules 11034581 and 0c895cd0 both monitor pod log volume from the same data source and group by pod name, but with different thresholds (>100k vs >50k). The lower threshold rule is more sensitive and will fire more frequently.",
  "justification": "Both rules query logs-*.otel-* grouped by k8s.pod.name with identical filtering logic. Rule 0c895cd0 has a lower threshold (>50k) making it more sensitive. The higher threshold rule (11034581) is redundant as the lower threshold rule will catch all conditions the higher threshold would detect.",
  "rule_ids": [
    "11034581-ba8f-41ef-a94f-8ea6c5f4df9d",
    "0c895cd0-c340-46b0-acfb-20526c061824"
  ],
  "current": {
    "0c895cd0-c340-46b0-acfb-20526c061824": {
      "schedule": "5m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE k8s.pod.name IS NOT NULL | STATS log_count = COUNT(*) BY k8s.pod.name | WHERE log_count > 50000",
      "name": "Pod log flood > 50k",
      "tags": ["rule-doctor-seed", "logs", "kubernetes"]
    },
    "11034581-ba8f-41ef-a94f-8ea6c5f4df9d": {
      "schedule": "5m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE k8s.pod.name IS NOT NULL | STATS log_count = COUNT(*) BY k8s.pod.name | WHERE log_count > 100000",
      "name": "High pod log volume",
      "tags": ["rule-doctor-seed", "logs", "kubernetes", "volume"]
    }
  },
  "proposed": {
    "0c895cd0-c340-46b0-acfb-20526c061824": {
      "schedule": "5m",
      "query": "FROM remote_cluster:logs-*.otel-* | WHERE k8s.pod.name IS NOT NULL | STATS log_count = COUNT(*) BY k8s.pod.name | WHERE log_count > 50000",
      "name": "Pod log flood > 50k",
      "tags": ["rule-doctor-seed", "logs", "kubernetes"]
    },
    "11034581-ba8f-41ef-a94f-8ea6c5f4df9d": null
  },
  "diffs": [
    { "field": "11034581-ba8f-41ef-a94f-8ea6c5f4df9d", "previous": "High pod log volume - active rule", "proposed": "deleted" }
  ],
  "status": "open"
}

Test plan

  • Unit tests pass (node scripts/jest for validate_rules, persist_findings, run_rule_doctor_route, rule_doctor_insights_client)
  • Trigger deduplication via POST /api/alerting/v2/rule_doctor/run with { "type": "deduplication" } — returns 202 with execution_id
  • Workflow completes and insights are persisted to .rule-doctor-insights index
  • Insights match the tightened schema (rule_ids, current, proposed all present)
  • Correction loop accumulates valid insights across iterations (not overwriting)
  • dismiss_ids defaults to [] when evaluate_existing step is skipped

@github-actions github-actions Bot added the author:actionable-obs PRs authored by the actionable obs team label Apr 30, 2026
@dominiqueclarke dominiqueclarke changed the title [Alerting v2] Rule Doctor: Run API and deduplication workflow [Alerting v2] [Rule Doctor] Run API and deduplication workflow Apr 30, 2026
@macroscopeapp
Copy link
Copy Markdown
Contributor

macroscopeapp Bot commented Apr 30, 2026

Catch flakiness early (recommended): run the flaky test runner against this PR before merging.

New Scout API specs (rule_doctor_insights.spec.ts, run_rule_doctor.spec.ts) create/delete ES indices and issue multiple API calls in hooks, so stability is unknown.

Trigger a run with the Flaky Test Runner UI or post this comment on the PR:

/flaky scoutConfig:x-pack/platform/plugins/shared/alerting_v2/test/scout_alerting_v2/api/playwright.config.ts:30

Share feedback in the #appex-qa channel.

Posted via Macroscope — Flaky Test Runner nudge

Comment thread x-pack/platform/plugins/shared/alerting_v2/server/workflows/load_workflows.ts Outdated
@dominiqueclarke dominiqueclarke added backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes v9.5.0 labels Apr 30, 2026
@dominiqueclarke dominiqueclarke marked this pull request as ready for review April 30, 2026 13:33
@dominiqueclarke dominiqueclarke requested a review from a team as a code owner April 30, 2026 13:33
@macroscopeapp
Copy link
Copy Markdown
Contributor

macroscopeapp Bot commented Apr 30, 2026

Approvability

Verdict: Needs human review

This PR introduces a substantial new feature (Rule Doctor run API and deduplication workflow) with AI integration, new step types, and workflow orchestration. The author does not own any of the modified files, which are all owned by @elastic/rna-project-team.

You can customize Macroscope's approvability policy. Learn more.

kibanamachine and others added 4 commits April 30, 2026 13:41
…y proposed by rule_id

- rule_ids: remove .optional() — every insight must reference rules
- current: remove .optional().nullable() — always present as keyed-by-rule-id object
- proposed: remove .optional().nullable() — always present as keyed-by-rule-id
  object (individual values are null for deleted rules)
- Update deduplication workflow prompt and schema to instruct the LLM
  to key proposed by rule_id matching the current field shape
- Update test fixture to use {} instead of null for current/proposed

Made-with: Cursor
@dominiqueclarke
Copy link
Copy Markdown
Contributor Author

Right now if a rule is suggested to be deleted (for deduplication consolidation), the agent marks that as null in the proposed object (which is keyed by rule id). I'd like to rethink that, either in this PR or a following, to make the suggested action (delete) a bit more explicit.

API_HEADERS,
RULE_DOCTOR_RUN_API_PATH,
ALERTING_V2_EXPERIMENTAL_FEATURES_SETTING_ID,
} from '../fixtures';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use constants for shared test values

API_HEADERS, RULE_DOCTOR_RUN_API_PATH, and ALERTING_V2_EXPERIMENTAL_FEATURES_SETTING_ID are not exported from ../fixtures — this will fail to compile. Additionally, apiTest should come from the local extended fixture (line 9), not directly from @kbn/scout.

See details

The sibling test files in this directory (find_rules.spec.ts, rule_doctor_insights.spec.ts) all import the extended apiTest and use testData.COMMON_HEADERS from ../fixtures. The new test diverges from that pattern and references three exports that don't exist in fixtures/index.ts.

Suggested fix:

  1. Add the missing constants to common/constants.ts and re-export through fixtures/index.ts:
// common/constants.ts
export { ALERTING_V2_RULE_DOCTOR_RUN_API_PATH as RULE_DOCTOR_RUN_API_PATH } from '@kbn/alerting-v2-constants';
export { ALERTING_V2_EXPERIMENTAL_FEATURES_SETTING_ID } from '../../../common/advanced_settings';
  1. Update the spec imports to match the established pattern:
-import { apiTest, tags } from '@kbn/scout';
+import { tags } from '@kbn/scout';
 import type { RoleApiCredentials } from '@kbn/scout';
 import { RULE_DOCTOR_DEDUP_WORKFLOW_ID } from '../../../../server/workflows/load_workflows';
-import {
-  API_HEADERS,
-  RULE_DOCTOR_RUN_API_PATH,
-  ALERTING_V2_EXPERIMENTAL_FEATURES_SETTING_ID,
-} from '../fixtures';
+import { apiTest, testData } from '../fixtures';
+import {
+  RULE_DOCTOR_RUN_API_PATH,
+  ALERTING_V2_EXPERIMENTAL_FEATURES_SETTING_ID,
+} from '../../common/constants';

Then replace every API_HEADERS usage with testData.COMMON_HEADERS to stay consistent with the rest of the suite.

Share feedback in the #appex-qa channel.

Posted via Macroscope — Scout Test Review

method: GET
path: '/api/alerting/v2/rules'
query:
perPage: 200
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be a separate issue for elegantly handling large amounts of rules. Experimentation is happening now.

@kibanamachine
Copy link
Copy Markdown
Contributor

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/alerting-v2-constants 6 7 +1
Unknown metric groups

API count

id before after diff
@kbn/alerting-v2-constants 14 15 +1

History

Copy link
Copy Markdown
Contributor

@kdelemme kdelemme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall this looks good to me, i have a few questions and recommendations

Comment on lines +38 to +39
private readonly resourceManager?: ResourceManagerContract,
private readonly rawLogger?: Logger
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are they optional? and why are they not @injected ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered myself below

Comment on lines +75 to +95
const { type } = this.request.body;
const executionId = uuidv4();
const spaceId = this.spaceContext.spaceId;
const connectorId = await this.getDefaultConnectorId();

await this.insightsClient.ensureIndex();
const workflow = await ensureRuleDoctorAnalysisWorkflow(
type,
this.workflowsManagement,
spaceId,
this.request,
this.logger
);

await this.workflowsManagement.scheduleWorkflow(
workflow,
spaceId,
{ space_id: spaceId, execution_id: executionId, connector_id: connectorId },
this.request,
'rule_doctor'
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This route handler is probably doing too many things, I think it would be worth exposing this as an application service so it can be decoupled from the http layer and reused from a client if needed. Testing wise it becomes easier since it is not bound to a request anymore

const { insights = [], dismiss_ids: dismissIds = [], space_id: spaceId } = context.input;
const esClient = context.contextManager.getScopedEsClient();
const logger = adaptLogger(context.logger);
const client = new RuleDoctorInsightsClient(esClient, logger);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I understand why we have the optional resourceManager and Logger now.
Quick question, is it possible for someone to use a workflow with this step type, but never call the run doctor API, thus never instantiating the index?
These optional parameters are a smell imo, can we initiate the managed resources on plugin setup/start instead?

...rawObj,
space_id: spaceId,
execution_id: executionId,
insight_id: rawObj.insight_id || `insight-${uuidv4().slice(0, 8)}`,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uuidv4().slice(0,8)

Truncating a UUIDv4 to 8 characters drastically increases the risk of collisions. While a full 36-character UUIDv4 is practically unique, an 8-character truncation results in a collision probability of over 50% after only roughly (2^{16}) (65,536) generated IDs.

Are we expecting the default fallback to almost never be used? I would recommend using the full uuidv4 or nanoid(12) at least

Comment on lines +41 to +54
const existing = await managementApi.getWorkflow(workflowId, spaceId);

if (existing) {
if (existing.yaml !== yaml || !existing.enabled || !existing.valid) {
await managementApi.updateWorkflow(workflowId, { yaml, enabled: true }, spaceId, request);
logger.info(`Updated workflow ${workflowId}`);
}
} else {
await managementApi.createWorkflow({ yaml, id: workflowId }, spaceId, request);
await managementApi.updateWorkflow(workflowId, { yaml, enabled: true }, spaceId, request);
logger.info(`Created workflow ${workflowId}`);
}

const workflow = (await managementApi.getWorkflow(workflowId, spaceId))!;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we handle errors? Maybe retry on transient?

@kdelemme
Copy link
Copy Markdown
Contributor

kdelemme commented May 6, 2026

Also we might want to wait for #267924 to land and use the workflow registration service it brings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

author:actionable-obs PRs authored by the actionable obs team backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes v9.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rule Doctor: Run API and deduplication workflow

3 participants