Create an officially maintained Prometheus rules (alerting rules) and Grafana dashboard #21

ksa-real · 2022-08-04T03:11:48Z

So far I haven't found one. The proposal is to create an initial version of the above and to dump there some initial knowledge. As Manticore authors likely have Manticore operational experience, the ask is to add some basic monitoring rules and dashboards. Users can help evolve it over time.

sanikolaev · 2022-08-04T04:49:05Z

officially maintained Prometheus rules

dump there some initial knowledge

What's your opinion on https://github.com/manticoresoftware/manticoresearch-prometheus ? The initial knowledge is actually there.

Grafana dashboard, and Alertmanager alerting rules

Yes. Would you like to take on this project? Manticore core team can help by providing info on what can typically go wrong and might be worth looking after.

ksa-real · 2022-08-04T04:54:25Z

Fixed the issue title. Prometheus rules are actually alerting rules.

Ok, I can try. Meanwhile, please suggest here a few things what to alert on and what to have in the dashboard. Let's assume the usage of the Kube Prometheus Stack helm charts. I think VictoriaMetrics can use the same PrometheusRules CRD. I've seen people using libsonnet mixins as a source of truth. I'm not too experienced in that. So likely will start with CRD and migrate to a "better" source of truth if necessary.

Also, would you please review the ongoing PRs.

sanikolaev · 2022-08-04T05:52:32Z

Meanwhile, please suggest here a few things what to alert on and what to have in the dashboard

In addition to https://github.com/manticoresoftware/manticoresearch-prometheus/blob/master/src/Exporter.php here's what else may be important to track / alert on:

version:
- daemon version to see how an upgrade correlates with smth else on the graphs
- columnar library version
- secondary library version
availability:
- instance is running
connectivity:
- can connect at all
- connection time
resource consumption:
- searchd's anon-rss (to catch a memleak)
- searchd's total RSS
- searchd's VIRT
- index files:
  - count
  - size
- binlog files:
  - count
  - size
- file descriptors count
schema:
- number of tables
- checksum/generation of all the schemas (to see how changing a schema correlates with smth else in Grafana)
- non-served tables count
cluster:
- state
- number of active nodes

Possible alerts:

connection time is greater than N
etalon query time is greater than N (I'm afraid it may be difficult to make support of this kind of alert out of the box)
number of crashes (from the searchd log) for last N minutes is greater than M
number of binlog files is greater than N
file descriptors count is greater than N
anon-rss is greater than N
non-served tables count is greater than 0
maxed_out_error_count delta for last N minutes is greater than M
agent_retry_count delta for last N minutes is greater than M
current_connections_count is greater than N
slowest_thread_seconds is greater than N

Also, would you please review the ongoing PRs.

I'll talk to the developer working on that today.

ksa-real changed the title ~~Create an officially maintained Prometheus rules, Grafana dashboard, and Alerting rules~~ Create an officially maintained Prometheus rules, Grafana dashboard, and Alertmanager alerting rules Aug 4, 2022

ksa-real changed the title ~~Create an officially maintained Prometheus rules, Grafana dashboard, and Alertmanager alerting rules~~ Create an officially maintained Prometheus rules (alerting rules) and Grafana dashboard Aug 4, 2022

slopezz mentioned this issue Feb 22, 2024

Feat/Add manticore prometheus-exporter and stable release v0.8.0 3scale-ops/prometheus-exporter-operator#51

Merged

sanikolaev assigned djklim87 Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create an officially maintained Prometheus rules (alerting rules) and Grafana dashboard #21

Create an officially maintained Prometheus rules (alerting rules) and Grafana dashboard #21

ksa-real commented Aug 4, 2022

sanikolaev commented Aug 4, 2022

ksa-real commented Aug 4, 2022 •

edited

Loading

sanikolaev commented Aug 4, 2022 •

edited

Loading

Create an officially maintained Prometheus rules (alerting rules) and Grafana dashboard #21

Create an officially maintained Prometheus rules (alerting rules) and Grafana dashboard #21

Comments

ksa-real commented Aug 4, 2022

sanikolaev commented Aug 4, 2022

ksa-real commented Aug 4, 2022 • edited Loading

sanikolaev commented Aug 4, 2022 • edited Loading

ksa-real commented Aug 4, 2022 •

edited

Loading

sanikolaev commented Aug 4, 2022 •

edited

Loading