Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Commit df3256e

Browse files
committed
Update alert to be generic to KV stores
Signed-off-by: Marco Pracucci <[email protected]>
1 parent f415acc commit df3256e

File tree

3 files changed

+29
-32
lines changed

3 files changed

+29
-32
lines changed

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@
6464
* [ENHANCEMENT] Add support for running Alertmanager in sharding mode. #394
6565
* [ENHANCEMENT] Allow to customize PromQL engine settings via `queryEngineConfig`. #399
6666
* [ENHANCEMENT] Add recording rules to improve responsiveness of Alertmanager dashboard. #387
67-
* [ENHANCEMENT] Added `CortexFailingToTalkToConsul` alert. #406
67+
* [ENHANCEMENT] Added `CortexKVStoreFailure` alert. #406
6868
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
6969
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
7070
* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335

cortex-mixin/alerts/alerts.libsonnet

Lines changed: 21 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -235,6 +235,27 @@
235235
|||,
236236
},
237237
},
238+
{
239+
alert: 'CortexKVStoreFailure',
240+
expr: |||
241+
(
242+
sum by(%s, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count{status_code!~"2.+"}[1m]))
243+
/
244+
sum by(%s, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count[1m]))
245+
)
246+
# We want to get alerted only in case there's a constant failure.
247+
== 1
248+
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
249+
'for': '5m',
250+
labels: {
251+
severity: 'warning',
252+
},
253+
annotations: {
254+
message: |||
255+
Cortex {{ $labels.pod }} in %(alert_aggregation_variables)s is failing to talk to the KV store {{ $labels.kv_name }}.
256+
||| % $._config,
257+
},
258+
},
238259
{
239260
alert: 'CortexMemoryMapAreasTooHigh',
240261
expr: |||
@@ -654,31 +675,5 @@
654675
},
655676
],
656677
},
657-
{
658-
name: 'cortex-consul-alerts',
659-
rules: [
660-
{
661-
alert: 'CortexFailingToTalkToConsul',
662-
expr: |||
663-
(
664-
sum by(%s, pod, status_code, kv_name) (rate(cortex_consul_request_duration_seconds_count{status_code!~"2.+"}[1m]))
665-
/
666-
sum by(%s, pod, status_code, kv_name) (rate(cortex_consul_request_duration_seconds_count[1m]))
667-
)
668-
# We want to get alerted only in case there's a constant failure.
669-
== 1
670-
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
671-
'for': '5m',
672-
labels: {
673-
severity: 'warning',
674-
},
675-
annotations: {
676-
message: |||
677-
Cortex {{ $labels.pod }} in %(alert_aggregation_variables)s is failing to talk to Consul store {{ $labels.kv_name }}.
678-
||| % $._config,
679-
},
680-
},
681-
],
682-
},
683678
],
684679
}

cortex-mixin/docs/playbooks.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -724,17 +724,19 @@ When an alertmanager cannot read the state for a tenant from storage it gets log
724724
- The state could not be merged because it might be invalid and could not be decoded. This could indicate data corruption and therefore a bug in the reading or writing of the state, and would need further investigation.
725725
- The state could not be read from storage. This could be due to a networking issue such as a timeout or an authentication and authorization issue with the remote object store.
726726
727-
### CortexFailingToTalkToConsul
727+
### CortexKVStoreFailure
728728
729-
This alert fires if a Cortex instance is failing to run any operation on Consul.
729+
This alert fires if a Cortex instance is failing to run any operation on a KV store (eg. consul or etcd).
730730
731731
How it **works**:
732732
- Consul is typically used to store the hash ring state.
733-
- If an instance is failing to talk to Consul, either the instance can't update the heartbeat in the ring or is failing to receive ring updates.
733+
- Etcd is typically used to store by the HA tracker (distributor) to deduplicate samples.
734+
- If an instance is failing operations on the **hash ring**, either the instance can't update the heartbeat in the ring or is failing to receive ring updates.
735+
- If an instance is failing operations on the **HA tracker** backend, either the instance can't update the authoritative replica or is failing to receive updates.
734736
735737
How to **investigate**:
736-
- Ensure Consul is up and running.
737-
- Investigate the logs of the affected instance to find the specific error occurring when talking to Consul.
738+
- Ensure Consul/Etcd is up and running.
739+
- Investigate the logs of the affected instance to find the specific error occurring when talking to Consul/Etcd.
738740
739741
## Cortex routes by path
740742

0 commit comments

Comments
 (0)