Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an officially maintained Prometheus rules (alerting rules) and Grafana dashboard #21

Open
ksa-real opened this issue Aug 4, 2022 · 3 comments
Assignees

Comments

@ksa-real
Copy link
Contributor

ksa-real commented Aug 4, 2022

So far I haven't found one. The proposal is to create an initial version of the above and to dump there some initial knowledge. As Manticore authors likely have Manticore operational experience, the ask is to add some basic monitoring rules and dashboards. Users can help evolve it over time.

@ksa-real ksa-real changed the title Create an officially maintained Prometheus rules, Grafana dashboard, and Alerting rules Create an officially maintained Prometheus rules, Grafana dashboard, and Alertmanager alerting rules Aug 4, 2022
@sanikolaev
Copy link
Collaborator

officially maintained Prometheus rules

dump there some initial knowledge

What's your opinion on https://github.com/manticoresoftware/manticoresearch-prometheus ? The initial knowledge is actually there.

Grafana dashboard, and Alertmanager alerting rules

Yes. Would you like to take on this project? Manticore core team can help by providing info on what can typically go wrong and might be worth looking after.

@ksa-real ksa-real changed the title Create an officially maintained Prometheus rules, Grafana dashboard, and Alertmanager alerting rules Create an officially maintained Prometheus rules (alerting rules) and Grafana dashboard Aug 4, 2022
@ksa-real
Copy link
Contributor Author

ksa-real commented Aug 4, 2022

Fixed the issue title. Prometheus rules are actually alerting rules.

Ok, I can try. Meanwhile, please suggest here a few things what to alert on and what to have in the dashboard. Let's assume the usage of the Kube Prometheus Stack helm charts. I think VictoriaMetrics can use the same PrometheusRules CRD. I've seen people using libsonnet mixins as a source of truth. I'm not too experienced in that. So likely will start with CRD and migrate to a "better" source of truth if necessary.

Also, would you please review the ongoing PRs.

@sanikolaev
Copy link
Collaborator

sanikolaev commented Aug 4, 2022

Meanwhile, please suggest here a few things what to alert on and what to have in the dashboard

In addition to https://github.com/manticoresoftware/manticoresearch-prometheus/blob/master/src/Exporter.php here's what else may be important to track / alert on:

  • version:
    • daemon version to see how an upgrade correlates with smth else on the graphs
    • columnar library version
    • secondary library version
  • availability:
    • instance is running
  • connectivity:
    • can connect at all
    • connection time
  • resource consumption:
    • searchd's anon-rss (to catch a memleak)
    • searchd's total RSS
    • searchd's VIRT
    • index files:
      • count
      • size
    • binlog files:
      • count
      • size
    • file descriptors count
  • schema:
    • number of tables
    • checksum/generation of all the schemas (to see how changing a schema correlates with smth else in Grafana)
    • non-served tables count
  • cluster:
    • state
    • number of active nodes

Possible alerts:

  • connection time is greater than N
  • etalon query time is greater than N (I'm afraid it may be difficult to make support of this kind of alert out of the box)
  • number of crashes (from the searchd log) for last N minutes is greater than M
  • number of binlog files is greater than N
  • file descriptors count is greater than N
  • anon-rss is greater than N
  • non-served tables count is greater than 0
  • maxed_out_error_count delta for last N minutes is greater than M
  • agent_retry_count delta for last N minutes is greater than M
  • current_connections_count is greater than N
  • slowest_thread_seconds is greater than N

Also, would you please review the ongoing PRs.

I'll talk to the developer working on that today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants