[ResponseOps][Alerting] Alerting v2: Director#247673
Merged
Merged
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Note
Dear reviewers. This PR is getting merged into a feature branch. Only the ResponseOps review is needed it at the moment. We will request for your review when we open the feature branch PR to be merged on
main.This PR implements the Director component of the alerting v2 core engine. The Director is an asynchronous state engine responsible for deriving alert state transitions (e.g., Pending → Active) from the immutable stream of raw alert events.
Architecture
Strategy Pattern
To ensure the Director remains agnostic of specific business logic, we implemented the Strategy Pattern. The Director facilitates the data flow, while an
ITransitionStrategydefines the actual state machine logic. This allows us to support different transition behaviors, based on rule configuration or alert event type, without modifying the core service. It may seem overengineering at the moment, but I think it will help us in the long run. At the moment, only one strategy is supported, theBasicTransitionStrategy, which moves the states likeinactive -> pending -> active -> recovering -> inactivebased on a) the status of the alert event and the latest episode status if exist.Episode Lifecycle Management
The state is calculated as:
Pending: A new alert has started, but must wait in Pending before becoming Active.Inactive: The condition cleared before it could become Active.Recovering: An active alert has stopped breaching and enters the recovery phase.Active: An alert that was recovering has breached again.The episode ID is preserved across pending, active, and recovering states. A new episode ID is generated only when transitioning from inactive to a non-inactive state (a new episode starts).
Important
Calculating the states based on counts or timeframes will be implemented on the next PR to avoid growing the size of the PR and make reviewing the fundamentals of the director easier. Same for streaming the ESQL results to the director and to the datastream.
flowchart LR subgraph Lifecycle["Episode Lifecycle"] direction LR INACTIVE((INACTIVE)) PENDING((PENDING)) ACTIVE((ACTIVE)) RECOVERING((RECOVERING)) INACTIVE -->|"breached<br/>New Episode ID"| PENDING PENDING -->|breached| ACTIVE ACTIVE -->|recovered| RECOVERING RECOVERING -->|recovered| INACTIVE RECOVERING -->|"breached<br/>"| ACTIVE PENDING -->|"recovered<br/>Episode Ends"| INACTIVE end style INACTIVE fill:#9e9e9e,color:#fff style PENDING fill:#ffc107,color:#000 style ACTIVE fill:#f44336,color:#fff style RECOVERING fill:#ff9800,color:#000Example
Alert events
Out of scope
Testing
Create a rule that fires breach events.
Maturation:
Recovering is not possible to be tested atm as we need the rule executor to produce these alert events.
Checklist
Check the PR satisfies following conditions.
Reviewers should verify this PR satisfies this list as well.