Add alert and dashboard for config file hashes #146

simonswine · 2020-07-21T16:27:24Z

This allows to monitor the roll-out of new config file version to the various members of a cluster.

The metric itself was added as part of cortexproject/cortex#2874.

Ideally someone double checks the filters I am using. I am also not 100% sure if really all nodes of a cluster typically have the exact same config and if the period of 1hours is reasonable.

This doesn't alert on runtime config divergence

cstyan

Do both the alert and dashboard work as expected for k8s clusters that have multiple Cortex cluster/namespaces?

Others would be able to give you more info. about how often we have intentional config drift. It might be useful to include another label, such as namespace, in the message. Depending on the tool you're viewing the alert in you might not get to see all the labels associated with the alert right away, so more info. in the message can be handy.

simonswine · 2020-07-22T11:26:55Z

I think so, as the dashboard can be filtered by cluster and namespace through .addClusterSelectorTemplates() and the alert is using the $._config.alert_aggregation_labels.

I have tested a similar query (for the alert) with cortex_build_info as the hash metric is not yet deployed anywhere

pracucci

👋 @simonswine. I left few comments. Feel free to reach out directly on Slack if you have specific questions about the feedback left here.

cortex-mixin/alerts/alerts.libsonnet

cortex-mixin/dashboards/config.libsonnet

This allow to monitor the roll out of new config file versions to the various nodes of a cluster. The metric was added as part of cortexproject/cortex#2874.

annanay25

lgtm, thanks @simonswine!

pracucci · 2020-07-29T10:49:59Z

cortex-mixin/alerts/alerts.libsonnet

+        {
+          alert: 'CortexInconsistentConfig',
+          expr: |||
+            count(count by(%s, job, sha256) (cortex_config_hash)) without(sha256) > 1


Isn't this always > 1 when there are multiple jobs?

I don't think so (maybe I am making a mistake here).

I think the first count checks for series with differenent hashes and the second count filters out the ones from different clusters. This unit test passes (maybe you can break it 🙂 ):

jsonnet -S alerts.jsonnet > alerts.yaml cat > test.yaml <<EOF rule_files: - alerts.yaml evaluation_interval: 1m tests: - interval: 1m input_series: - series: 'cortex_config_hash{sha256="aa",job="customer-a/cortex", cluster="customer-a", namespace="customer-a", instance="host-a1"}' values: '1+0x100' - series: 'cortex_config_hash{sha256="bb",job="customer-a/cortex", cluster="customer-a", namespace="customer-a", instance="host-a2"}' values: '1+0x100' - series: 'cortex_config_hash{sha256="cc",job="customer-b/cortex", cluster="customer-b", namespace="customer-b", instance="host-b1"}' values: '1+0x100' - series: 'cortex_config_hash{sha256="cc",job="customer-b/cortex", cluster="customer-b", namespace="customer-b", instance="host-b2"}' values: '1+0x100' alert_rule_test: - alertname: CortexInconsistentConfig eval_time: 59m - alertname: CortexInconsistentConfig eval_time: 60m exp_alerts: - exp_labels: severity: warning job: customer-a/cortex cluster: customer-a namespace: customer-a exp_annotations: message: "An inconsistent config file hash is used across cluster customer-a/cortex.\n" EOF promtool test rules unit_test.yaml Unit Testing: unit_test.yaml SUCCESS

I ran the following query against our local environment to verify the logic of this query:

count(count by(cluster, job, sha256) (cortex_runtime_config_hash)) without (sha256)

It came back looking sane to me

jtlisi

LGTM

…fig-hash-alert-and-dashboard Add alert and dashboard for config file hashes

simonswine requested a review from a team as a code owner July 21, 2020 16:27

cstyan reviewed Jul 21, 2020

View reviewed changes

pracucci reviewed Jul 22, 2020

View reviewed changes

cortex-mixin/alerts/alerts.libsonnet Show resolved Hide resolved

cortex-mixin/dashboards/config.libsonnet Outdated Show resolved Hide resolved

Add alert and dashboard using config file hashes

6c39083

This allow to monitor the roll out of new config file versions to the various nodes of a cluster. The metric was added as part of cortexproject/cortex#2874.

simonswine force-pushed the add-config-hash-alert-and-dashboard branch from 0596cac to 6c39083 Compare July 28, 2020 13:13

annanay25 approved these changes Jul 29, 2020

View reviewed changes

pracucci reviewed Jul 29, 2020

View reviewed changes

jtlisi approved these changes Jul 30, 2020

View reviewed changes

simonswine merged commit e15a730 into grafana:master Jul 31, 2020

simonswine added a commit to grafana/mimir that referenced this pull request Oct 18, 2021

Merge pull request grafana/cortex-jsonnet#146 from simonswine/add-con…

8ab4632

…fig-hash-alert-and-dashboard Add alert and dashboard for config file hashes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add alert and dashboard for config file hashes #146

Add alert and dashboard for config file hashes #146

Uh oh!

simonswine commented Jul 21, 2020 •

edited

Loading

Uh oh!

cstyan left a comment

Uh oh!

simonswine commented Jul 22, 2020

Uh oh!

pracucci left a comment

Uh oh!

Uh oh!

Uh oh!

annanay25 left a comment

Uh oh!

pracucci Jul 29, 2020

Uh oh!

simonswine Jul 29, 2020

Uh oh!

jtlisi Jul 30, 2020

Uh oh!

jtlisi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add alert and dashboard for config file hashes #146

Add alert and dashboard for config file hashes #146

Uh oh!

Conversation

simonswine commented Jul 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cstyan left a comment

Choose a reason for hiding this comment

Uh oh!

simonswine commented Jul 22, 2020

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

annanay25 left a comment

Choose a reason for hiding this comment

Uh oh!

pracucci Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

simonswine Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

jtlisi Jul 30, 2020

Choose a reason for hiding this comment

Uh oh!

jtlisi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

simonswine commented Jul 21, 2020 •

edited

Loading