[Alerting v2] [Rule Doctor] Run API and deduplication workflow#266668
[Alerting v2] [Rule Doctor] Run API and deduplication workflow#266668dominiqueclarke wants to merge 34 commits into
Conversation
…dominiqueclarke/kibana into feat/rule-doctor-page-and-flags
|
Catch flakiness early (recommended): run the flaky test runner against this PR before merging. New Scout API specs ( Trigger a run with the Flaky Test Runner UI or post this comment on the PR: Share feedback in the #appex-qa channel. Posted via Macroscope — Flaky Test Runner nudge |
ApprovabilityVerdict: Needs human review This PR introduces a substantial new feature (Rule Doctor run API and deduplication workflow) with AI integration, new step types, and workflow orchestration. The author does not own any of the modified files, which are all owned by @elastic/rna-project-team. You can customize Macroscope's approvability policy. Learn more. |
…y proposed by rule_id
- rule_ids: remove .optional() — every insight must reference rules
- current: remove .optional().nullable() — always present as keyed-by-rule-id object
- proposed: remove .optional().nullable() — always present as keyed-by-rule-id
object (individual values are null for deleted rules)
- Update deduplication workflow prompt and schema to instruct the LLM
to key proposed by rule_id matching the current field shape
- Update test fixture to use {} instead of null for current/proposed
Made-with: Cursor
|
Right now if a rule is suggested to be deleted (for deduplication consolidation), the agent marks that as |
…iqueclarke/kibana into feat/rule-doctor-execution
…iqueclarke/kibana into feat/rule-doctor-execution
| API_HEADERS, | ||
| RULE_DOCTOR_RUN_API_PATH, | ||
| ALERTING_V2_EXPERIMENTAL_FEATURES_SETTING_ID, | ||
| } from '../fixtures'; |
There was a problem hiding this comment.
Use constants for shared test values
API_HEADERS, RULE_DOCTOR_RUN_API_PATH, and ALERTING_V2_EXPERIMENTAL_FEATURES_SETTING_ID are not exported from ../fixtures — this will fail to compile. Additionally, apiTest should come from the local extended fixture (line 9), not directly from @kbn/scout.
See details
The sibling test files in this directory (find_rules.spec.ts, rule_doctor_insights.spec.ts) all import the extended apiTest and use testData.COMMON_HEADERS from ../fixtures. The new test diverges from that pattern and references three exports that don't exist in fixtures/index.ts.
Suggested fix:
- Add the missing constants to
common/constants.tsand re-export throughfixtures/index.ts:
// common/constants.ts
export { ALERTING_V2_RULE_DOCTOR_RUN_API_PATH as RULE_DOCTOR_RUN_API_PATH } from '@kbn/alerting-v2-constants';
export { ALERTING_V2_EXPERIMENTAL_FEATURES_SETTING_ID } from '../../../common/advanced_settings';- Update the spec imports to match the established pattern:
-import { apiTest, tags } from '@kbn/scout';
+import { tags } from '@kbn/scout';
import type { RoleApiCredentials } from '@kbn/scout';
import { RULE_DOCTOR_DEDUP_WORKFLOW_ID } from '../../../../server/workflows/load_workflows';
-import {
- API_HEADERS,
- RULE_DOCTOR_RUN_API_PATH,
- ALERTING_V2_EXPERIMENTAL_FEATURES_SETTING_ID,
-} from '../fixtures';
+import { apiTest, testData } from '../fixtures';
+import {
+ RULE_DOCTOR_RUN_API_PATH,
+ ALERTING_V2_EXPERIMENTAL_FEATURES_SETTING_ID,
+} from '../../common/constants';Then replace every API_HEADERS usage with testData.COMMON_HEADERS to stay consistent with the rest of the suite.
Share feedback in the #appex-qa channel.
Posted via Macroscope — Scout Test Review
| method: GET | ||
| path: '/api/alerting/v2/rules' | ||
| query: | ||
| perPage: 200 |
There was a problem hiding this comment.
There will be a separate issue for elegantly handling large amounts of rules. Experimentation is happening now.
💚 Build Succeeded
Metrics [docs]Public APIs missing comments
History
|
kdelemme
left a comment
There was a problem hiding this comment.
overall this looks good to me, i have a few questions and recommendations
| private readonly resourceManager?: ResourceManagerContract, | ||
| private readonly rawLogger?: Logger |
There was a problem hiding this comment.
Why are they optional? and why are they not @injected ?
| const { type } = this.request.body; | ||
| const executionId = uuidv4(); | ||
| const spaceId = this.spaceContext.spaceId; | ||
| const connectorId = await this.getDefaultConnectorId(); | ||
|
|
||
| await this.insightsClient.ensureIndex(); | ||
| const workflow = await ensureRuleDoctorAnalysisWorkflow( | ||
| type, | ||
| this.workflowsManagement, | ||
| spaceId, | ||
| this.request, | ||
| this.logger | ||
| ); | ||
|
|
||
| await this.workflowsManagement.scheduleWorkflow( | ||
| workflow, | ||
| spaceId, | ||
| { space_id: spaceId, execution_id: executionId, connector_id: connectorId }, | ||
| this.request, | ||
| 'rule_doctor' | ||
| ); |
There was a problem hiding this comment.
This route handler is probably doing too many things, I think it would be worth exposing this as an application service so it can be decoupled from the http layer and reused from a client if needed. Testing wise it becomes easier since it is not bound to a request anymore
| const { insights = [], dismiss_ids: dismissIds = [], space_id: spaceId } = context.input; | ||
| const esClient = context.contextManager.getScopedEsClient(); | ||
| const logger = adaptLogger(context.logger); | ||
| const client = new RuleDoctorInsightsClient(esClient, logger); |
There was a problem hiding this comment.
Ok I understand why we have the optional resourceManager and Logger now.
Quick question, is it possible for someone to use a workflow with this step type, but never call the run doctor API, thus never instantiating the index?
These optional parameters are a smell imo, can we initiate the managed resources on plugin setup/start instead?
| ...rawObj, | ||
| space_id: spaceId, | ||
| execution_id: executionId, | ||
| insight_id: rawObj.insight_id || `insight-${uuidv4().slice(0, 8)}`, |
There was a problem hiding this comment.
uuidv4().slice(0,8)
Truncating a UUIDv4 to 8 characters drastically increases the risk of collisions. While a full 36-character UUIDv4 is practically unique, an 8-character truncation results in a collision probability of over 50% after only roughly (2^{16}) (65,536) generated IDs.
Are we expecting the default fallback to almost never be used? I would recommend using the full uuidv4 or nanoid(12) at least
| const existing = await managementApi.getWorkflow(workflowId, spaceId); | ||
|
|
||
| if (existing) { | ||
| if (existing.yaml !== yaml || !existing.enabled || !existing.valid) { | ||
| await managementApi.updateWorkflow(workflowId, { yaml, enabled: true }, spaceId, request); | ||
| logger.info(`Updated workflow ${workflowId}`); | ||
| } | ||
| } else { | ||
| await managementApi.createWorkflow({ yaml, id: workflowId }, spaceId, request); | ||
| await managementApi.updateWorkflow(workflowId, { yaml, enabled: true }, spaceId, request); | ||
| logger.info(`Created workflow ${workflowId}`); | ||
| } | ||
|
|
||
| const workflow = (await managementApi.getWorkflow(workflowId, spaceId))!; |
There was a problem hiding this comment.
should we handle errors? Maybe retry on transient?
|
Also we might want to wait for #267924 to land and use the workflow registration service it brings |
Summary
Adds the Rule Doctor Run API and initial deduplication workflow, enabling users to trigger AI-powered analyses that identify duplicate or near-duplicate alerting rules and produce actionable insights.
Closes #266648
Changes
POST /api/alerting/v2/rule_doctor/run) — accepts{ "type": "deduplication" }, schedules the workflow, returns202with anexecution_idalerting_v2.validate_rulesandalerting_v2.persist_findingswith the workflows extensions pluginbulkDismissInsightsmarks stale insights as dismissed in a single ES bulk operationworkflowsExtensionsplugin dependency — added tokibana.jsoncand wired into setupALERTING_V2_RULE_DOCTOR_RUN_API_PATHadded to@kbn/alerting-v2-constantsas any), 4xx/5xx log level differentiation, schema validation at the trust boundaryensureResourceReadycalled before scheduling workflow to guarantee insights index existsSchema tightening
rule_ids: now required (every insight must reference rules)current: now required — always present as an object keyed by rule ID containing each rule's config snapshotproposed: now required — always present as an object keyed by rule ID showing each rule's post-action state (nullfor rules being deleted)currentandproposedshare the same shape:{ [rule_id]: config | null }Workflow fixes
update_validationnow concatenatesloop_valid_insightswith revalidated results instead of overwriting, preserving all valid insights across correction iterations| default: []todismiss_idsinput ofpersist_resultsstep, preventing errors whenevaluate_existingis skippedSample Insight Documents
Results from running the deduplication analysis against seed rules:
Insight 1: Service error logs — severity field vs body text overlap
{ "title": "Service error logs - severity field vs body text overlap", "type": "deduplication", "action": "merge", "impact": "high", "confidence": "high", "summary": "Rules 7db86a39 and 0e113c81 both detect service errors from the same data source but use different detection methods (severity_text field vs body text keyword matching). Rule 7db86a39 is more reliable as it uses the structured severity_text field, while 0e113c81 is a legacy approach prone to false positives.", "justification": "Both rules query the same index pattern (logs-*.otel-*), group by service name, and fire on error conditions. Rule 7db86a39 uses the structured severity_text == ERROR field which is more reliable than 0e113c81's text pattern matching. The legacy rule should be consolidated into the structured approach.", "rule_ids": [ "7db86a39-c861-45db-8016-3ccb40de835c", "0e113c81-2a32-4095-9f8a-1293c3e7edd1" ], "current": { "0e113c81-2a32-4095-9f8a-1293c3e7edd1": { "schedule": "5m", "query": "FROM remote_cluster:logs-*.otel-* | WHERE body.text LIKE \"*error*\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 5", "name": "Service error logs (body text)", "tags": ["rule-doctor-seed", "logs", "errors"] }, "7db86a39-c861-45db-8016-3ccb40de835c": { "schedule": "5m", "query": "FROM remote_cluster:logs-*.otel-* | WHERE severity_text == \"ERROR\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 10", "name": "Service error logs (severity)", "tags": ["rule-doctor-seed", "logs", "errors", "services"] } }, "proposed": { "0e113c81-2a32-4095-9f8a-1293c3e7edd1": null, "7db86a39-c861-45db-8016-3ccb40de835c": { "schedule": "5m", "query": "FROM remote_cluster:logs-*.otel-* | WHERE severity_text == \"ERROR\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 10", "name": "Service error logs (severity)", "tags": ["rule-doctor-seed", "logs", "errors", "services"] } }, "diffs": [ { "field": "0e113c81-2a32-4095-9f8a-1293c3e7edd1", "previous": "Service error logs (body text) - active rule", "proposed": "deleted" } ], "status": "open" }Insight 2: Service error logs — threshold consolidation
{ "title": "Service error logs - threshold consolidation", "type": "deduplication", "action": "merge", "impact": "high", "confidence": "high", "summary": "Rules 7db86a39 and 53111728 both monitor service error logs using the same severity_text field and data source, but with different thresholds (>10 vs >5) and schedules (5m vs 1m). The lower threshold rule fires more frequently and is more sensitive.", "justification": "Both rules use identical detection logic (severity_text == ERROR grouped by service name) on the same index. Rule 53111728 has a lower threshold (>5) and faster schedule (1m), making it more sensitive. Consolidating to the more sensitive rule with 1m schedule provides better coverage while eliminating redundancy.", "rule_ids": [ "7db86a39-c861-45db-8016-3ccb40de835c", "53111728-d857-4d0b-97dd-fbac832cea96" ], "current": { "7db86a39-c861-45db-8016-3ccb40de835c": { "schedule": "5m", "query": "FROM remote_cluster:logs-*.otel-* | WHERE severity_text == \"ERROR\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 10", "name": "Service error logs (severity)", "tags": ["rule-doctor-seed", "logs", "errors", "services"] }, "53111728-d857-4d0b-97dd-fbac832cea96": { "schedule": "1m", "query": "FROM remote_cluster:logs-*.otel-* | WHERE severity_text == \"ERROR\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 5", "name": "Service error logs > 5", "tags": ["rule-doctor-seed", "logs", "errors", "services"] } }, "proposed": { "7db86a39-c861-45db-8016-3ccb40de835c": null, "53111728-d857-4d0b-97dd-fbac832cea96": { "schedule": "1m", "query": "FROM remote_cluster:logs-*.otel-* | WHERE severity_text == \"ERROR\" | STATS error_count = COUNT(*) BY resource.attributes.service.name | WHERE error_count > 5", "name": "Service error logs > 5", "tags": ["rule-doctor-seed", "logs", "errors", "services"] } }, "diffs": [ { "field": "7db86a39-c861-45db-8016-3ccb40de835c", "previous": "Service error logs (severity) - active rule", "proposed": "deleted" } ], "status": "open" }Insight 3: Pod log volume — threshold overlap
{ "title": "Pod log volume - threshold overlap", "type": "deduplication", "action": "merge", "impact": "high", "confidence": "high", "summary": "Rules 11034581 and 0c895cd0 both monitor pod log volume from the same data source and group by pod name, but with different thresholds (>100k vs >50k). The lower threshold rule is more sensitive and will fire more frequently.", "justification": "Both rules query logs-*.otel-* grouped by k8s.pod.name with identical filtering logic. Rule 0c895cd0 has a lower threshold (>50k) making it more sensitive. The higher threshold rule (11034581) is redundant as the lower threshold rule will catch all conditions the higher threshold would detect.", "rule_ids": [ "11034581-ba8f-41ef-a94f-8ea6c5f4df9d", "0c895cd0-c340-46b0-acfb-20526c061824" ], "current": { "0c895cd0-c340-46b0-acfb-20526c061824": { "schedule": "5m", "query": "FROM remote_cluster:logs-*.otel-* | WHERE k8s.pod.name IS NOT NULL | STATS log_count = COUNT(*) BY k8s.pod.name | WHERE log_count > 50000", "name": "Pod log flood > 50k", "tags": ["rule-doctor-seed", "logs", "kubernetes"] }, "11034581-ba8f-41ef-a94f-8ea6c5f4df9d": { "schedule": "5m", "query": "FROM remote_cluster:logs-*.otel-* | WHERE k8s.pod.name IS NOT NULL | STATS log_count = COUNT(*) BY k8s.pod.name | WHERE log_count > 100000", "name": "High pod log volume", "tags": ["rule-doctor-seed", "logs", "kubernetes", "volume"] } }, "proposed": { "0c895cd0-c340-46b0-acfb-20526c061824": { "schedule": "5m", "query": "FROM remote_cluster:logs-*.otel-* | WHERE k8s.pod.name IS NOT NULL | STATS log_count = COUNT(*) BY k8s.pod.name | WHERE log_count > 50000", "name": "Pod log flood > 50k", "tags": ["rule-doctor-seed", "logs", "kubernetes"] }, "11034581-ba8f-41ef-a94f-8ea6c5f4df9d": null }, "diffs": [ { "field": "11034581-ba8f-41ef-a94f-8ea6c5f4df9d", "previous": "High pod log volume - active rule", "proposed": "deleted" } ], "status": "open" }Test plan
node scripts/jestforvalidate_rules,persist_findings,run_rule_doctor_route,rule_doctor_insights_client)POST /api/alerting/v2/rule_doctor/runwith{ "type": "deduplication" }— returns202withexecution_id.rule-doctor-insightsindexrule_ids,current,proposedall present)dismiss_idsdefaults to[]whenevaluate_existingstep is skipped