Skip to content

[ResponseOps][Alerting] Alerting v2: Director#247673

Merged
cnasikas merged 23 commits into
elastic:alerting_v2from
cnasikas:alerting_v2_director
Feb 3, 2026
Merged

[ResponseOps][Alerting] Alerting v2: Director#247673
cnasikas merged 23 commits into
elastic:alerting_v2from
cnasikas:alerting_v2_director

Conversation

@cnasikas
Copy link
Copy Markdown
Member

@cnasikas cnasikas commented Dec 31, 2025

Summary

Note

Dear reviewers. This PR is getting merged into a feature branch. Only the ResponseOps review is needed it at the moment. We will request for your review when we open the feature branch PR to be merged on main.

This PR implements the Director component of the alerting v2 core engine. The Director is an asynchronous state engine responsible for deriving alert state transitions (e.g., Pending → Active) from the immutable stream of raw alert events.

Architecture

Strategy Pattern

To ensure the Director remains agnostic of specific business logic, we implemented the Strategy Pattern. The Director facilitates the data flow, while an ITransitionStrategy defines the actual state machine logic. This allows us to support different transition behaviors, based on rule configuration or alert event type, without modifying the core service. It may seem overengineering at the moment, but I think it will help us in the long run. At the moment, only one strategy is supported, the BasicTransitionStrategy, which moves the states like inactive -> pending -> active -> recovering -> inactive based on a) the status of the alert event and the latest episode status if exist.

Episode Lifecycle Management

The state is calculated as:

  • Inactive + Breached → Pending: A new alert has started, but must wait in Pending before becoming Active.
  • Pending + Recoverde → Inactive: The condition cleared before it could become Active.
  • Active + Recovered → Recovering: An active alert has stopped breaching and enters the recovery phase.
  • Recovering + Breached → Active: An alert that was recovering has breached again.

The episode ID is preserved across pending, active, and recovering states. A new episode ID is generated only when transitioning from inactive to a non-inactive state (a new episode starts).

Important

Calculating the states based on counts or timeframes will be implemented on the next PR to avoid growing the size of the PR and make reviewing the fundamentals of the director easier. Same for streaming the ESQL results to the director and to the datastream.

flowchart LR
    subgraph Lifecycle["Episode Lifecycle"]
        direction LR
        
        INACTIVE((INACTIVE))
        PENDING((PENDING))
        ACTIVE((ACTIVE))
        RECOVERING((RECOVERING))
        
        INACTIVE -->|"breached<br/>New Episode ID"| PENDING
        PENDING -->|breached| ACTIVE
        ACTIVE -->|recovered| RECOVERING
        RECOVERING -->|recovered| INACTIVE
        
        RECOVERING -->|"breached<br/>"| ACTIVE
        PENDING -->|"recovered<br/>Episode Ends"| INACTIVE
    end

    style INACTIVE fill:#9e9e9e,color:#fff
    style PENDING fill:#ffc107,color:#000
    style ACTIVE fill:#f44336,color:#fff
    style RECOVERING fill:#ff9800,color:#000
Loading

Example

Alert events

Row @timestamp Status Episode status Episode ID
1 10:00 breached pending uuid-1
2 10:05 breached active uuid-1
3 10:10 recovered recovering uuid-1
4 10:15 recovered recovered uuid-1
5 10:20 breached pending uuid-2

Out of scope

  • Changed state transitions based on counts or timeframes.
  • Streaming of ES|QL results

Testing

  1. Create a rule that fires breach events.

  2. Maturation:

    • Verify that the alert event documents have the correct episode status, alert event status, and the episode ID on each run.

Recovering is not possible to be tested atm as we need the rule executor to produce these alert events.

Checklist

Check the PR satisfies following conditions.

Reviewers should verify this PR satisfies this list as well.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t//

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants