-
Notifications
You must be signed in to change notification settings - Fork 513
Add elastic agent alerting rule templates #15572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add elastic agent alerting rule templates #15572
Conversation
|
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
|
Putting this back in draft temporarily to avoid accidental merge. We want to validate these more against running agents -- but still open for config review. |
…agent path matching
💚 Build Succeeded
History
|
nchaulet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
|
Package elastic_agent - 2.6.4 containing this change is available at https://epr.elastic.co/package/elastic_agent/2.6.4/ |
Add alerting rule templates to the Elastic Agent package: * CPU usage spike * Excessive memory usage * High pipeline queue * Dropped events * Output errors * Excessive restarts * Unhealthy status
Proposed commit message
Extended description
Here is an initial exploration of alerting rule templates for monitoring elastic agent health. This PR can just include the ones we feel the most confident about, and defer others for further refinement and exploration.
Install the rules
How to install the rules:
your-local-dir/integrations/packages/elastic_agentpackages/elastic_agent/manifest.ymlfrom 2.6.4 to 2.6.3elastic-package build --skip-validation. Run this in theelastic_agentpackage directorybuild/packages/elastic_agent-2.6.3.zipCreate new integrationCTA at the top rightupload it as a .ziplink, and upload the zip you builtElastic Agentfor filtering.Rule templates:
So that the ESQL is clear, here is a summary of their definitions.
Resource Utilization
*elastic*agent*are above 80% of total cpu utilization. Calculate the max for 1 minute buckets and check if there are 5 occurrences when looking back 7 minutes. Rows are distinct by agent id and process name.FROM metrics-, :metrics-
| WHERE process.executable RLIKE ".[Ee]lastic.[Aa]gent." AND agent.name NOT LIKE "agentless"
| STATS cpu_process_pct = MAX(system.process.cpu.total.pct) * 100
BY elastic_agent.id, process.name,
time_bucket = BUCKET(@timestamp, 1 minute)
// Count the 1 minute timebuckets that are above 80% by process and agent
| WHERE cpu_process_pct >= 80
| STATS count_above_threshold = COUNT(*)
BY elastic_agent.id, process.name
// Alert if there are 5 or more occurences
| WHERE count_above_threshold >= 5
```
*elastic*agent*are above 50% of total memory usage. Rows are distinct by agent id.Beats Pipelines and Queues
beat.stats.libbeat.pipeline.queue.filled.pctexceeds 90%. Rows are distinct by agent id and component idAgent Stability
elastic_agent.status_changedatastreamChecklist
changelog.ymlfile.Author's Checklist
How to test this PR locally
Built and Install the elastic agent package locally:
Related issues
Screenshots