Update alert to be generic to KV stores

pracucci · pracucci · commit df3256e6cc2d · 2021-10-14T09:45:09.000+02:00
Signed-off-by: Marco Pracucci &lt;marco@pracucci.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -64,7 +64,7 @@
 * [ENHANCEMENT] Add support for running Alertmanager in sharding mode. #394
 * [ENHANCEMENT] Allow to customize PromQL engine settings via `queryEngineConfig`. #399
 * [ENHANCEMENT] Add recording rules to improve responsiveness of Alertmanager dashboard. #387
-* [ENHANCEMENT] Added `CortexFailingToTalkToConsul` alert. #406
+* [ENHANCEMENT] Added `CortexKVStoreFailure` alert. #406
 * [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
 * [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
 * [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335
diff --git a/cortex-mixin/alerts/alerts.libsonnet b/cortex-mixin/alerts/alerts.libsonnet
@@ -235,6 +235,27 @@
             |||,
           },
         },
+        {
+          alert: 'CortexKVStoreFailure',
+          expr: |||
+            (
+              sum by(%s, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count{status_code!~"2.+"}[1m]))
+              /
+              sum by(%s, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count[1m]))
+            )
+            # We want to get alerted only in case there's a constant failure.
+            == 1
+          ||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
+          'for': '5m',
+          labels: {
+            severity: 'warning',
+          },
+          annotations: {
+            message: |||
+              Cortex {{ $labels.pod }} in  %(alert_aggregation_variables)s is failing to talk to the KV store {{ $labels.kv_name }}.
+            ||| % $._config,
+          },
+        },
         {
           alert: 'CortexMemoryMapAreasTooHigh',
           expr: |||
@@ -654,31 +675,5 @@
         },
       ],
     },
-    {
-      name: 'cortex-consul-alerts',
-      rules: [
-        {
-          alert: 'CortexFailingToTalkToConsul',
-          expr: |||
-            (
-              sum by(%s, pod, status_code, kv_name) (rate(cortex_consul_request_duration_seconds_count{status_code!~"2.+"}[1m]))
-              /
-              sum by(%s, pod, status_code, kv_name) (rate(cortex_consul_request_duration_seconds_count[1m]))
-            )
-            # We want to get alerted only in case there's a constant failure.
-            == 1
-          ||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
-          'for': '5m',
-          labels: {
-            severity: 'warning',
-          },
-          annotations: {
-            message: |||
-              Cortex {{ $labels.pod }} in  %(alert_aggregation_variables)s is failing to talk to Consul store {{ $labels.kv_name }}.
-            ||| % $._config,
-          },
-        },
-      ],
-    },
   ],
 }
diff --git a/cortex-mixin/docs/playbooks.md b/cortex-mixin/docs/playbooks.md
@@ -724,17 +724,19 @@ When an alertmanager cannot read the state for a tenant from storage it gets log
 - The state could not be merged because it might be invalid and could not be decoded. This could indicate data corruption and therefore a bug in the reading or writing of the state, and would need further investigation.
 - The state could not be read from storage. This could be due to a networking issue such as a timeout or an authentication and authorization issue with the remote object store.
 
-### CortexFailingToTalkToConsul
+### CortexKVStoreFailure
 
-This alert fires if a Cortex instance is failing to run any operation on Consul.
+This alert fires if a Cortex instance is failing to run any operation on a KV store (eg. consul or etcd).
 
 How it **works**:
 - Consul is typically used to store the hash ring state.
-- If an instance is failing to talk to Consul, either the instance can't update the heartbeat in the ring or is failing to receive ring updates.
+- Etcd is typically used to store by the HA tracker (distributor) to deduplicate samples.
+- If an instance is failing operations on the **hash ring**, either the instance can't update the heartbeat in the ring or is failing to receive ring updates.
+- If an instance is failing operations on the **HA tracker** backend, either the instance can't update the authoritative replica or is failing to receive updates.
 
 How to **investigate**:
-- Ensure Consul is up and running.
-- Investigate the logs of the affected instance to find the specific error occurring when talking to Consul.
+- Ensure Consul/Etcd is up and running.
+- Investigate the logs of the affected instance to find the specific error occurring when talking to Consul/Etcd.
 
 ## Cortex routes by path