[ResponseOps] Integrate rule and action monitoring data to the monitoring collection plugin#123416
Conversation
… rops/rule_monitoring
… rops/rule_monitoring
… rops/rule_monitoring
… rops/rule_monitoring
ymao1
left a comment
There was a problem hiding this comment.
Looks good overall! Left some minor comments about code consolidation. It would also be nice to add functional tests for this if possible.
Also wondering if the api/monitoring_collection endpoints should be internal?
Great idea, I'll add some.
These APIs are designed to be publicly accessible, as users might want to consume this data and use it in other monitoring solutions and we want to make that possible |
|
@ymao1 Thanks for all the great suggestions. I've implemented nearly all of them and this PR is ready for another round! |
ymao1
left a comment
There was a problem hiding this comment.
LGTM! Just one comment about types
|
@elasticmachine merge upstream |
💚 Build SucceededMetrics [docs]Public APIs missing comments
Public APIs missing exports
Unknown metric groupsAPI count
History
To update your PR or re-run it, just comment with: |
Relates to #123637
Summary
This PR makes use of the new monitoring collection system by collecting metrics for both the alerting and actions plugins. The specific metrics answer these questions and some were readily available, while others were added in this PR.
We are segmenting our metrics (in each plugin) by either a metric that pertains to just this specific Kibana instance or a metric that pertains to all Kibanas in a cluster. We call the former a "node" level metric and the latter a "cluster" level metric (and the terminology in this PR should reflect this).
The "node" level metrics added in this PR is the most new code and involves in memory counters that are incremented each time a rule or action finishes execution, and each time a rule or action's execution results in a failure. These metrics are represented by the type
node_actionsandnode_rules.The "cluster" level metrics added in this PR are the results of a query to the task manager index where we return the number of delayed tasks. A delayed task is a task that is either
runAt() < now and status = IdleorretryAt() < now and (status = Running || status = Claimed). We include a count of delayed tasks, as well as p50 and p99 data points around how long the delay is (in ms)As part of the integration with the monitoring collection plugin, the registration of these new collectors means that the following routes will now return the above data:
/api/monitoring_collection/node_rules/api/monitoring_collection/cluster_rules/api/monitoring_collection/node_actions/api/monitoring_collection/cluster_actionsTesting
To test, you'll need to create some rules with some actions and verify the data from the above endpoints matches what you expect. To simulate delayed tasks, you can configure task manager to run 1 rule every 1min or something but make sure the rules run faster than that.