Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Conversation

@simonswine
Copy link
Contributor

@simonswine simonswine commented Jul 21, 2020

This allows to monitor the roll-out of new config file version to the various members of a cluster.

The metric itself was added as part of cortexproject/cortex#2874.

Ideally someone double checks the filters I am using. I am also not 100% sure if really all nodes of a cluster typically have the exact same config and if the period of 1hours is reasonable.

This doesn't alert on runtime config divergence

scrn-2020-07-21-17-11-55
scrn-2020-07-21-17-11-45

@simonswine simonswine requested a review from a team as a code owner July 21, 2020 16:27
Copy link
Contributor

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do both the alert and dashboard work as expected for k8s clusters that have multiple Cortex cluster/namespaces?

Others would be able to give you more info. about how often we have intentional config drift. It might be useful to include another label, such as namespace, in the message. Depending on the tool you're viewing the alert in you might not get to see all the labels associated with the alert right away, so more info. in the message can be handy.

@simonswine
Copy link
Contributor Author

I think so, as the dashboard can be filtered by cluster and namespace through .addClusterSelectorTemplates() and the alert is using the $._config.alert_aggregation_labels.

I have tested a similar query (for the alert) with cortex_build_info as the hash metric is not yet deployed anywhere

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 @simonswine. I left few comments. Feel free to reach out directly on Slack if you have specific questions about the feedback left here.

This allow to monitor the roll out of new config file versions to the
various nodes of a cluster. The metric was added as part of
cortexproject/cortex#2874.
@simonswine simonswine force-pushed the add-config-hash-alert-and-dashboard branch from 0596cac to 6c39083 Compare July 28, 2020 13:13
Copy link
Contributor

@annanay25 annanay25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @simonswine!

{
alert: 'CortexInconsistentConfig',
expr: |||
count(count by(%s, job, sha256) (cortex_config_hash)) without(sha256) > 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this always > 1 when there are multiple jobs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so (maybe I am making a mistake here).

I think the first count checks for series with differenent hashes and the second count filters out the ones from different clusters. This unit test passes (maybe you can break it 🙂 ):

jsonnet -S alerts.jsonnet > alerts.yaml

cat > test.yaml <<EOF  
rule_files:
  - alerts.yaml
evaluation_interval: 1m
tests:
 - interval: 1m
   input_series:
    - series: 'cortex_config_hash{sha256="aa",job="customer-a/cortex", cluster="customer-a", namespace="customer-a", instance="host-a1"}'
      values: '1+0x100'
    - series: 'cortex_config_hash{sha256="bb",job="customer-a/cortex", cluster="customer-a", namespace="customer-a", instance="host-a2"}'
      values: '1+0x100'
    - series: 'cortex_config_hash{sha256="cc",job="customer-b/cortex", cluster="customer-b", namespace="customer-b", instance="host-b1"}'
      values: '1+0x100'
    - series: 'cortex_config_hash{sha256="cc",job="customer-b/cortex", cluster="customer-b", namespace="customer-b", instance="host-b2"}'
      values: '1+0x100'
   alert_rule_test:
    - alertname: CortexInconsistentConfig
      eval_time: 59m
    - alertname: CortexInconsistentConfig
      eval_time: 60m
      exp_alerts:
       - exp_labels:
           severity: warning
           job: customer-a/cortex
           cluster: customer-a
           namespace: customer-a
         exp_annotations:
           message: "An inconsistent config file hash is used across cluster customer-a/cortex.\n"
EOF

promtool test rules unit_test.yaml
Unit Testing:  unit_test.yaml
  SUCCESS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the following query against our local environment to verify the logic of this query:

count(count by(cluster, job, sha256) (cortex_runtime_config_hash)) without (sha256)

It came back looking sane to me

Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@simonswine simonswine merged commit e15a730 into grafana:master Jul 31, 2020
simonswine added a commit to grafana/mimir that referenced this pull request Oct 18, 2021
…fig-hash-alert-and-dashboard

Add alert and dashboard for config file hashes
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants