-
Notifications
You must be signed in to change notification settings - Fork 55
Add alert and dashboard for config file hashes #146
Add alert and dashboard for config file hashes #146
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do both the alert and dashboard work as expected for k8s clusters that have multiple Cortex cluster/namespaces?
Others would be able to give you more info. about how often we have intentional config drift. It might be useful to include another label, such as namespace, in the message. Depending on the tool you're viewing the alert in you might not get to see all the labels associated with the alert right away, so more info. in the message can be handy.
|
I think so, as the dashboard can be filtered by cluster and namespace through I have tested a similar query (for the alert) with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 @simonswine. I left few comments. Feel free to reach out directly on Slack if you have specific questions about the feedback left here.
This allow to monitor the roll out of new config file versions to the various nodes of a cluster. The metric was added as part of cortexproject/cortex#2874.
0596cac to
6c39083
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks @simonswine!
| { | ||
| alert: 'CortexInconsistentConfig', | ||
| expr: ||| | ||
| count(count by(%s, job, sha256) (cortex_config_hash)) without(sha256) > 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this always > 1 when there are multiple jobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so (maybe I am making a mistake here).
I think the first count checks for series with differenent hashes and the second count filters out the ones from different clusters. This unit test passes (maybe you can break it 🙂 ):
jsonnet -S alerts.jsonnet > alerts.yaml
cat > test.yaml <<EOF
rule_files:
- alerts.yaml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'cortex_config_hash{sha256="aa",job="customer-a/cortex", cluster="customer-a", namespace="customer-a", instance="host-a1"}'
values: '1+0x100'
- series: 'cortex_config_hash{sha256="bb",job="customer-a/cortex", cluster="customer-a", namespace="customer-a", instance="host-a2"}'
values: '1+0x100'
- series: 'cortex_config_hash{sha256="cc",job="customer-b/cortex", cluster="customer-b", namespace="customer-b", instance="host-b1"}'
values: '1+0x100'
- series: 'cortex_config_hash{sha256="cc",job="customer-b/cortex", cluster="customer-b", namespace="customer-b", instance="host-b2"}'
values: '1+0x100'
alert_rule_test:
- alertname: CortexInconsistentConfig
eval_time: 59m
- alertname: CortexInconsistentConfig
eval_time: 60m
exp_alerts:
- exp_labels:
severity: warning
job: customer-a/cortex
cluster: customer-a
namespace: customer-a
exp_annotations:
message: "An inconsistent config file hash is used across cluster customer-a/cortex.\n"
EOF
promtool test rules unit_test.yaml
Unit Testing: unit_test.yaml
SUCCESS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the following query against our local environment to verify the logic of this query:
count(count by(cluster, job, sha256) (cortex_runtime_config_hash)) without (sha256)
It came back looking sane to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…fig-hash-alert-and-dashboard Add alert and dashboard for config file hashes
This allows to monitor the roll-out of new config file version to the various members of a cluster.
The metric itself was added as part of cortexproject/cortex#2874.
Ideally someone double checks the filters I am using. I am also not 100% sure if really all nodes of a cluster typically have the exact same config and if the period of 1hours is reasonable.
This doesn't alert on runtime config divergence