Skip to content

[ResponseOps] Integrate rule and action monitoring data to the monitoring collection plugin#123416

Merged
chrisronline merged 62 commits intoelastic:mainfrom
chrisronline:rops/rule_monitoring
Mar 24, 2022
Merged

[ResponseOps] Integrate rule and action monitoring data to the monitoring collection plugin#123416
chrisronline merged 62 commits intoelastic:mainfrom
chrisronline:rops/rule_monitoring

Conversation

@chrisronline
Copy link
Copy Markdown
Contributor

@chrisronline chrisronline commented Jan 19, 2022

Relates to #123637

Summary

This PR makes use of the new monitoring collection system by collecting metrics for both the alerting and actions plugins. The specific metrics answer these questions and some were readily available, while others were added in this PR.

We are segmenting our metrics (in each plugin) by either a metric that pertains to just this specific Kibana instance or a metric that pertains to all Kibanas in a cluster. We call the former a "node" level metric and the latter a "cluster" level metric (and the terminology in this PR should reflect this).

The "node" level metrics added in this PR is the most new code and involves in memory counters that are incremented each time a rule or action finishes execution, and each time a rule or action's execution results in a failure. These metrics are represented by the type node_actions and node_rules.

The "cluster" level metrics added in this PR are the results of a query to the task manager index where we return the number of delayed tasks. A delayed task is a task that is either runAt() < now and status = Idle or retryAt() < now and (status = Running || status = Claimed). We include a count of delayed tasks, as well as p50 and p99 data points around how long the delay is (in ms)

As part of the integration with the monitoring collection plugin, the registration of these new collectors means that the following routes will now return the above data:

/api/monitoring_collection/node_rules
/api/monitoring_collection/cluster_rules
/api/monitoring_collection/node_actions
/api/monitoring_collection/cluster_actions

Testing

To test, you'll need to create some rules with some actions and verify the data from the above endpoints matches what you expect. To simulate delayed tasks, you can configure task manager to run 1 rule every 1min or something but make sure the rules run faster than that.

Copy link
Copy Markdown
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! Left some minor comments about code consolidation. It would also be nice to add functional tests for this if possible.

Also wondering if the api/monitoring_collection endpoints should be internal?

Comment thread x-pack/plugins/actions/server/monitoring/register_cluster_collector.ts Outdated
Comment thread x-pack/plugins/alerting/server/monitoring/in_memory_metrics.ts
Comment thread x-pack/plugins/alerting/server/monitoring/register_cluster_collector.ts Outdated
Comment thread x-pack/plugins/alerting/server/task_runner/task_runner.test.ts Outdated
Comment thread x-pack/plugins/alerting/server/task_runner/task_runner.ts
Comment thread x-pack/plugins/actions/server/monitoring/register_cluster_collector.ts Outdated
@chrisronline
Copy link
Copy Markdown
Contributor Author

It would also be nice to add functional tests for this if possible.

Great idea, I'll add some.

Also wondering if the api/monitoring_collection endpoints should be internal?

These APIs are designed to be publicly accessible, as users might want to consume this data and use it in other monitoring solutions and we want to make that possible

@chrisronline chrisronline requested a review from ymao1 March 17, 2022 15:18
@chrisronline
Copy link
Copy Markdown
Contributor Author

@ymao1 Thanks for all the great suggestions. I've implemented nearly all of them and this PR is ready for another round!

Copy link
Copy Markdown
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just one comment about types

Comment thread x-pack/plugins/actions/server/monitoring/types.ts
Comment thread x-pack/plugins/alerting/server/monitoring/types.ts
@chrisronline
Copy link
Copy Markdown
Contributor Author

@elasticmachine merge upstream

@chrisronline chrisronline removed the request for review from a team March 24, 2022 14:49
@kibana-ci
Copy link
Copy Markdown

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
monitoringCollection 5 9 +4
taskManager 33 39 +6
total +10

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
monitoringCollection 1 0 -1
Unknown metric groups

API count

id before after diff
monitoringCollection 5 9 +4
taskManager 71 77 +6
total +10

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @chrisronline

@chrisronline chrisronline merged commit f981d53 into elastic:main Mar 24, 2022
@chrisronline chrisronline deleted the rops/rule_monitoring branch March 24, 2022 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release_note:skip Skip the PR/issue when compiling release notes review Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v8.2.0

Projects

No open projects

Development

Successfully merging this pull request may close these issues.

6 participants