[Security Solution] Implement single execution result per rule execution logging#257203
Conversation
f4366b5 to
e8935a6
Compare
e845904 to
be7fcf4
Compare
pmuellr
left a comment
There was a problem hiding this comment.
Looks pretty good, but have some questions on existing fields like total_search_duration_ms ...
...k/platform/plugins/shared/alerting/server/lib/alerting_event_logger/alerting_event_logger.ts
Outdated
Show resolved
Hide resolved
x-pack/platform/plugins/shared/alerting/server/monitoring/rule_monitoring_service.ts
Outdated
Show resolved
Hide resolved
5eaeac5 to
a3f2b2f
Compare
There was a problem hiding this comment.
This file has a lot of diff in the processRunResults() function. But mostly it's caused by the formatting change as I touched the result.
|
Pinging @elastic/security-detections-response (Team:Detections and Resp) |
|
Pinging @elastic/security-solution (Team: SecuritySolution) |
|
Pinging @elastic/security-detection-rule-management (Team:Detection Rule Management) |
1d9fc21 to
6047b96
Compare
|
Hi @nikitaindik, Please check out 6047b96. I fixed sticky gap duration by getting rid of restoring the previous metrics from the rule's SO. |
nikitaindik
left a comment
There was a problem hiding this comment.
Thanks for addressing my feedback and fixing the issues with zero metrics and gap cleanup. Changes look clean to me.
I did also manually test the changes:
- Checked out
main, created a security rule and an observability rule (to test that we don't mess anything up with other sorts of AF rules) - Ran rules, grabbed their "execute" events from event log
- Checkout out this branch, waited for the rules to run again, grabbed new "execute" docs from the event log
Observed:
- ✅ Both the security rule and the observability rule seem to still run normally (don't throw mid-execution).
- ✅ Observability "execute" event didn't change (except "revision" was added, which is expected). Our metrics didn't end up in observability.
- ✅ First security "execute" event did correctly contain
execution_gap_duration_sandgap_range. - ✅ Second security "execute" event (next scheduled run) correctly did NOT have
execution_gap_duration_sandgap_range. - ✅ All the required metrics for v9.4 except
matched_indices_countandalerts_candidate_count(these should be added in a follow-up). "execute-backfill" events have "backfill" set correctly.
I believe we're good to go with this one! 🚀
dhurley14
left a comment
There was a problem hiding this comment.
Thanks for these changes!
6722583 to
38d0e6d
Compare
⏳ Build in-progress, with failures
Failed CI Steps
Test Failures
History
cc @maximpn |
d8d12aa to
afe0f75
Compare
…ion logging (elastic#257203) **Addresses:** elastic#255513 **[Design document](https://docs.google.com/document/d/1njxu_2MDwR2NGEHJWmI_iFg34OidO8nF0QFOS13x8ck)** (internal) ## Summary This PR adds functionality to log the rule execution metrics and context data into the Alerting Framework's `execute` event log event. ## Details Detection Engine Observability initiative requires collecting various metrics related to the security rule execution. We already collect a number of metrics and write them to the **Event Log** and would like to add more later on. On top of that we'd like to log dynamic customer defined metrics. This PR focuses on implementing extensible functionality where we could add more metrics later on. The desired list of metrics is under discussion. Above that make sure we log a single rule result event per execution. Historically there is functionality on the Security Solution's side that logs multiple events per each rule execution like `status-change`, `execution-metrics` and `message`. Weirdly it may log multiple `status-change` events per single rule execution where rule status flips between error and success. Additionally Alerting Framework logs rule execution related events to the Event Log as well. In particular it logs `execute-start` and `execute` event where the latter has almost all the metrics Security Solution logs to the Event Log already. **This PR adds functionality** to log security rule metrics and the rule context data like correlation ids to the Alerting Framework's `execute` event. Existing events produced on Security Solution's side `status-change` and `execution-metrics` will be written as is. This decision allow to reduce this PR's scope. Getting rid of `status-change` and `execution-metrics` requires a follow up PR tackling rule execution UI, snapshot based telemetry and functional tests relying on the Event Log. ### Changes At the high level changes could be described as follows - Refactor [RuleMonitoring](https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/alerting/server/monitoring/rule_monitoring_service.ts) service to replace individual metrics setters with common `setMetric()` and `setMetrics()` - Extend [AlertingEventLogger](https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/alerting/server/lib/alerting_event_logger/alerting_event_logger.ts) to write consumer metrics to the `execute` event - Extend [AlertingEventLogger](https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/alerting/server/lib/alerting_event_logger/alerting_event_logger.ts) to write `rule_id` aka `uuid` to the `execute` event - Refactor [RuleExecutionLogClientForExecutors](https://github.com/elastic/kibana/blob/main/x-pack/solutions/security/plugins/security_solution/server/lib/detection_engine/rule_monitoring/logic/rule_execution_log/client_for_executors/client.ts) to expose `setMetric()` and `setMetrics()` methods and pass the metrics to the [RuleMonitoring](https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/alerting/server/monitoring/rule_monitoring_service.ts) service. Additionally make sure we wire a single result per execution. ### The Logged Event <details> <summary>"execute" event example</summary> ``` { "_index": ".ds-.kibana-event-log-ds-2026.03.17-000001", "_id": "2cc392bd-94c7-4c05-868a-bea57bba2188", "_version": 1, "_source": { "@timestamp": "2026-03-20T11:03:11.235Z", "event": { "provider": "alerting", "action": "execute", "kind": "alert", "category": [ "siem" ], "start": "2026-03-20T11:03:10.467Z", "outcome": "failure", "end": "2026-03-20T11:03:11.235Z", "duration": "768000000", "reason": "unknown" }, "kibana": { "alert": { "rule": { "rule_type_id": "siem.queryRule", "consumer": "siem", "execution": { "uuid": "5d5a25b3-52da-4fc2-9945-06f6e4c0d9c8", "metrics": { "number_of_triggered_actions": 0, "number_of_generated_actions": 0, "alert_counts": { "active": 32, "new": 32, "recovered": 0 }, "number_of_delayed_alerts": 0, "number_of_searches": 10, "es_search_duration_ms": 77, "total_search_duration_ms": 105, "alerts_suppressed_count": 0, "frozen_indices_queried_count": 0, "total_indexing_duration_ms": 345, "total_enrichment_duration_ms": 8, "execution_gap_duration_s": 451, "gap_range": { "gte": "2026-03-20T10:29:39.844Z", "lte": "2026-03-20T10:37:10.427Z" }, "claim_to_start_duration_ms": 41, "total_run_duration_ms": 808, "prepare_rule_duration_ms": 14, "rule_type_run_duration_ms": 719, "process_alerts_duration_ms": 1, "persist_alerts_duration_ms": 0, "trigger_actions_duration_ms": 1, "process_rule_duration_ms": 23, "update_alerts_duration_ms": 0 } }, "revision": 1 } }, "saved_objects": [ { "rel": "primary", "type": "alert", "id": "baca894d-1925-447e-a60d-bff768097a78", "type_id": "siem.queryRule" } ], "space_ids": [ "default" ], "task": { "scheduled": "2026-03-20T11:03:08.054Z", "schedule_delay": 2413000000 }, "alerting": { "outcome": "failure", "status": "error" }, "server_uuid": "5b2de169-2785-441b-ae8c-186a1936b17d", "version": "9.4.0" }, "rule": { "id": "baca894d-1925-447e-a60d-bff768097a78", "license": "basic", "category": "siem.queryRule", "ruleset": "siem", "name": "Test rule 2", "uuid": "e3e0eb5e-ecdc-4f7c-9896-5ff923375575" }, "message": "siem.queryRule:baca894d-1925-447e-a60d-bff768097a78: execution failed", "error": { "message": "8 minutes (450583ms) were not queried between this rule execution and the last execution, so signals may have been missed. Consider increasing your look behind time or adding more Kibana instances" }, "ecs": { "version": "1.8.0" } }, "fields": { "rule.id": [ "baca894d-1925-447e-a60d-bff768097a78" ], "event.category": [ "siem" ], "kibana.alerting.outcome": [ "failure" ], "kibana.task.schedule_delay": [ 2413000000 ], "kibana.task.scheduled": [ "2026-03-20T11:03:08.054Z" ], "kibana.alert.rule.execution.metrics.alert_counts.active": [ 32 ], "kibana.alert.rule.execution.metrics.claim_to_start_duration_ms": [ 41 ], "kibana.alert.rule.execution.metrics.total_search_duration_ms": [ 105 ], "kibana.alert.rule.execution.metrics.alerts_suppressed_count": [ 0 ], "rule.ruleset": [ "siem" ], "event.reason": [ "unknown" ], "event.kind": [ "alert" ], "kibana.server_uuid": [ "5b2de169-2785-441b-ae8c-186a1936b17d" ], "event.outcome": [ "failure" ], "rule.name": [ "Test rule 2" ], "kibana.alerting.status": [ "error" ], "kibana.saved_objects": [ { "type_id": [ "siem.queryRule" ], "rel": [ "primary" ], "id": [ "baca894d-1925-447e-a60d-bff768097a78" ], "type": [ "alert" ] } ], "kibana.alert.rule.execution.metrics.rule_type_run_duration_ms": [ 719 ], "event.provider": [ "alerting" ], "kibana.alert.rule.execution.metrics.alert_counts.recovered": [ 0 ], "kibana.alert.rule.execution.metrics.number_of_searches": [ 10 ], "ecs.version": [ "1.8.0" ], "kibana.alert.rule.execution.metrics.frozen_indices_queried_count": [ 0 ], "event.start": [ "2026-03-20T11:03:10.467Z" ], "kibana.alert.rule.revision": [ 1 ], "kibana.alert.rule.execution.metrics.number_of_triggered_actions": [ 0 ], "kibana.alert.rule.execution.metrics.process_rule_duration_ms": [ 23 ], "kibana.alert.rule.execution.metrics.number_of_delayed_alerts": [ 0 ], "event.end": [ "2026-03-20T11:03:11.235Z" ], "kibana.alert.rule.execution.metrics.prepare_rule_duration_ms": [ 14 ], "rule.license": [ "basic" ], "kibana.alert.rule.execution.metrics.gap_range": [ { "gte": "2026-03-20T10:29:39.844Z", "lte": "2026-03-20T10:37:10.427Z" } ], "kibana.alert.rule.rule_type_id": [ "siem.queryRule" ], "kibana.alert.rule.execution.metrics.number_of_generated_actions": [ 0 ], "kibana.alert.rule.execution.metrics.process_alerts_duration_ms": [ 1 ], "kibana.alert.rule.execution.metrics.trigger_actions_duration_ms": [ 1 ], "kibana.alert.rule.execution.metrics.total_indexing_duration_ms": [ 345 ], "kibana.alert.rule.execution.metrics.alert_counts.new": [ 32 ], "kibana.alert.rule.execution.metrics.persist_alerts_duration_ms": [ 0 ], "kibana.alert.rule.execution.metrics.total_enrichment_duration_ms": [ 8 ], "message": [ "siem.queryRule:baca894d-1925-447e-a60d-bff768097a78: execution failed" ], "kibana.alert.rule.consumer": [ "siem" ], "kibana.alert.rule.execution.metrics.update_alerts_duration_ms": [ 0 ], "event.duration": [ 768000000 ], "rule.uuid": [ "e3e0eb5e-ecdc-4f7c-9896-5ff923375575" ], "event.action": [ "execute" ], "@timestamp": [ "2026-03-20T11:03:11.235Z" ], "error.message": [ "8 minutes (450583ms) were not queried between this rule execution and the last execution, so signals may have been missed. Consider increasing your look behind time or adding more Kibana instances" ], "kibana.alert.rule.execution.metrics.total_run_duration_ms": [ 808 ], "kibana.alert.rule.execution.metrics.execution_gap_duration_s": [ 451 ], "kibana.alert.rule.execution.uuid": [ "5d5a25b3-52da-4fc2-9945-06f6e4c0d9c8" ], "kibana.space_ids": [ "default" ], "kibana.version": [ "9.4.0" ], "kibana.alert.rule.execution.metrics.es_search_duration_ms": [ 77 ], "rule.category": [ "siem.queryRule" ] } } ``` </details> <details> <summary>"execute-backfill" event example</summary> ``` { "_index": ".ds-.kibana-event-log-ds-2026.03.17-000001", "_id": "82159944-e0c8-4e70-a3fa-8604749f6282", "_version": 1, "_source": { "@timestamp": "2026-03-20T11:56:32.166Z", "event": { "provider": "alerting", "action": "execute-backfill", "kind": "alert", "start": "2026-03-20T11:56:32.160Z", "end": "2026-03-20T11:56:32.166Z", "duration": "6000000", "outcome": "failure", "reason": "decrypt" }, "kibana": { "alert": { "rule": { "execution": { "uuid": "f21ccc43-1dbb-4795-910b-f2ddf34f37ed", "metrics": { "claim_to_start_duration_ms": 0, "total_run_duration_ms": 19, "prepare_rule_duration_ms": 0, "rule_type_run_duration_ms": 0, "process_alerts_duration_ms": 0, "persist_alerts_duration_ms": 0, "trigger_actions_duration_ms": 0, "process_rule_duration_ms": 5, "update_alerts_duration_ms": 0 }, "backfill": { "id": "92981a1d-8dfb-4dfd-ab4a-cfa9f4150410" } } } }, "saved_objects": [ { "rel": "primary", "type": "ad_hoc_run_params", "id": "92981a1d-8dfb-4dfd-ab4a-cfa9f4150410" } ], "space_ids": [ "default" ], "task": { "scheduled": "2026-03-20T11:56:29.305Z", "schedule_delay": 2855000000 }, "alerting": { "status": "error", "outcome": "failure" }, "server_uuid": "5b2de169-2785-441b-ae8c-186a1936b17d", "version": "9.4.0" }, "rule": {}, "message": "92981a1d-8dfb-4dfd-ab4a-cfa9f4150410: execution failed", "error": { "message": "Saved object [ad_hoc_run_params/92981a1d-8dfb-4dfd-ab4a-cfa9f4150410] not found" }, "ecs": { "version": "1.8.0" } }, "fields": { "kibana.alert.rule.execution.metrics.process_rule_duration_ms": [ 5 ], "kibana.alerting.outcome": [ "failure" ], "kibana.task.schedule_delay": [ 2855000000 ], "kibana.task.scheduled": [ "2026-03-20T11:56:29.305Z" ], "event.end": [ "2026-03-20T11:56:32.166Z" ], "kibana.alert.rule.execution.metrics.claim_to_start_duration_ms": [ 0 ], "kibana.alert.rule.execution.metrics.prepare_rule_duration_ms": [ 0 ], "kibana.alert.rule.execution.metrics.process_alerts_duration_ms": [ 0 ], "event.reason": [ "decrypt" ], "kibana.alert.rule.execution.metrics.trigger_actions_duration_ms": [ 0 ], "event.kind": [ "alert" ], "kibana.server_uuid": [ "5b2de169-2785-441b-ae8c-186a1936b17d" ], "event.outcome": [ "failure" ], "kibana.alert.rule.execution.metrics.persist_alerts_duration_ms": [ 0 ], "kibana.alerting.status": [ "error" ], "kibana.saved_objects": [ { "rel": [ "primary" ], "id": [ "92981a1d-8dfb-4dfd-ab4a-cfa9f4150410" ], "type": [ "ad_hoc_run_params" ] } ], "message": [ "92981a1d-8dfb-4dfd-ab4a-cfa9f4150410: execution failed" ], "kibana.alert.rule.execution.metrics.rule_type_run_duration_ms": [ 0 ], "kibana.alert.rule.execution.metrics.update_alerts_duration_ms": [ 0 ], "event.duration": [ 6000000 ], "kibana.alert.rule.execution.backfill.id": [ "92981a1d-8dfb-4dfd-ab4a-cfa9f4150410" ], "event.action": [ "execute-backfill" ], "event.provider": [ "alerting" ], "@timestamp": [ "2026-03-20T11:56:32.166Z" ], "ecs.version": [ "1.8.0" ], "error.message": [ "Saved object [ad_hoc_run_params/92981a1d-8dfb-4dfd-ab4a-cfa9f4150410] not found" ], "kibana.alert.rule.execution.metrics.total_run_duration_ms": [ 19 ], "kibana.alert.rule.execution.uuid": [ "f21ccc43-1dbb-4795-910b-f2ddf34f37ed" ], "kibana.space_ids": [ "default" ], "kibana.version": [ "9.4.0" ], "event.start": [ "2026-03-20T11:56:32.160Z" ] } } ``` </details>
#259917) **Addresses:** #255513 **Relates to:** #257203 ## Summary This PR adds logic to calculate `alerts_candidate_count` rule execution metric. The mapping was added in #257203. ## Details `alerts_candidate_count` rule execution metric represents the number of detected alerts before any alerts suppression is applied. Alerts may be suppressed when alerts suppression by some field(s) in a certain time window is configured. Or it's a duplicate alert so Elasticsearch won't index alert documents with the same ids. The calculation logic varies based on the rule type but the logic described in the previous paragraph persists.
elastic#259917) **Addresses:** elastic#255513 **Relates to:** elastic#257203 ## Summary This PR adds logic to calculate `alerts_candidate_count` rule execution metric. The mapping was added in elastic#257203. ## Details `alerts_candidate_count` rule execution metric represents the number of detected alerts before any alerts suppression is applied. Alerts may be suppressed when alerts suppression by some field(s) in a certain time window is configured. Or it's a duplicate alert so Elasticsearch won't index alert documents with the same ids. The calculation logic varies based on the rule type but the logic described in the previous paragraph persists.
…#259938) **Addresses:** #255513 **Relates to:** #257203 ## Summary This PR adds `matched_indices_count` rule executor metric. On top of that it adds `matched_indicator_indices_count` for Indicator Match rules to cover indicator match indices. ## Details `matched_indices_count` rule execution metric represents the number of indices matched during rule execution. While obviously matched indices might be calculated in several ways we stick to the logic used in `runExecutionValidation()` function. It checks whether rule's index patterns has any matching indices to break rule execution when there are 0 matching indices. Under the hood it boils down to a request the [ES field caps API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-field-caps).
elastic#259917) **Addresses:** elastic#255513 **Relates to:** elastic#257203 ## Summary This PR adds logic to calculate `alerts_candidate_count` rule execution metric. The mapping was added in elastic#257203. ## Details `alerts_candidate_count` rule execution metric represents the number of detected alerts before any alerts suppression is applied. Alerts may be suppressed when alerts suppression by some field(s) in a certain time window is configured. Or it's a duplicate alert so Elasticsearch won't index alert documents with the same ids. The calculation logic varies based on the rule type but the logic described in the previous paragraph persists.
…elastic#259938) **Addresses:** elastic#255513 **Relates to:** elastic#257203 ## Summary This PR adds `matched_indices_count` rule executor metric. On top of that it adds `matched_indicator_indices_count` for Indicator Match rules to cover indicator match indices. ## Details `matched_indices_count` rule execution metric represents the number of indices matched during rule execution. While obviously matched indices might be calculated in several ways we stick to the logic used in `runExecutionValidation()` function. It checks whether rule's index patterns has any matching indices to break rule execution when there are 0 matching indices. Under the hood it boils down to a request the [ES field caps API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-field-caps).
Partially addresses: #255513
Resolves: #135209
Design document (internal)
Summary
This PR adds functionality to log the rule execution metrics and context data into the Alerting Framework's
executeevent log event.Details
Detection Engine Observability initiative requires collecting various metrics related to the security rule execution. We already collect a number of metrics and write them to the Event Log and would like to add more later on. On top of that we'd like to log dynamic customer defined metrics. This PR focuses on implementing extensible functionality where we could add more metrics later on. The desired list of metrics is under discussion. Above that make sure we log a single rule result event per execution.
Historically there is functionality on the Security Solution's side that logs multiple events per each rule execution like
status-change,execution-metricsandmessage. Weirdly it may log multiplestatus-changeevents per single rule execution where rule status flips between error and success. Additionally Alerting Framework logs rule execution related events to the Event Log as well. In particular it logsexecute-startandexecuteevent where the latter has almost all the metrics Security Solution logs to the Event Log already.This PR adds functionality to log security rule metrics and the rule context data like correlation ids to the Alerting Framework's
executeevent.Existing events produced on Security Solution's side
status-changeandexecution-metricswill be written as is. This decision allow to reduce this PR's scope. Getting rid ofstatus-changeandexecution-metricsrequires a follow up PR tackling rule execution UI, snapshot based telemetry and functional tests relying on the Event Log.Changes
At the high level changes could be described as follows
setMetric()andsetMetrics()executeeventrule_idakauuidto theexecuteeventsetMetric()andsetMetrics()methods and pass the metrics to the RuleMonitoring service. Additionally make sure we wire a single result per execution.The Logged Event
"execute" event example
"execute-backfill" event example