Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Commit 8817fc8

Browse files
authored
Merge pull request #334 from grafana/playbooks-for-compactor-alerts
Improve compactor alerts and playbooks
2 parents 313b3ec + 700dae2 commit 8817fc8

File tree

3 files changed

+27
-29
lines changed

3 files changed

+27
-29
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@
1212
* [CHANGE] Dashboards: added overridable `job_labels` and `cluster_labels` to the configuration object as label lists to uniquely identify jobs and clusters in the metric names and group-by lists in dashboards. #319
1313
* [CHANGE] Dashboards: `alert_aggregation_labels` has been removed from the configuration and overriding this value has been deprecated. Instead the labels are now defined by the `cluster_labels` list, and should be overridden accordingly through that list. #319
1414
* [CHANGE] Ingester/Ruler: set `-server.grpc-max-send-msg-size-bytes` and `-server.grpc-max-send-msg-size-bytes` to sensible default values (10MB). #326
15+
* [CHANGE] Renamed `CortexCompactorHasNotUploadedBlocksSinceStart` to `CortexCompactorHasNotUploadedBlocks`. #334
16+
* [CHANGE] Renamed `CortexCompactorRunFailed` to `CortexCompactorHasNotSuccessfullyRunCompaction`. #334
1517
* [ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317
1618
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
1719
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329

cortex-mixin/alerts/compactor.libsonnet

Lines changed: 14 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,19 @@
4747
message: 'Cortex Compactor {{ $labels.namespace }}/{{ $labels.instance }} has not run compaction in the last 24 hours.',
4848
},
4949
},
50+
{
51+
// Alert if compactor failed to run 2 consecutive compactions.
52+
alert: 'CortexCompactorHasNotSuccessfullyRunCompaction',
53+
expr: |||
54+
increase(cortex_compactor_runs_failed_total[2h]) >= 2
55+
|||,
56+
labels: {
57+
severity: 'critical',
58+
},
59+
annotations: {
60+
message: 'Cortex Compactor {{ $labels.namespace }}/{{ $labels.instance }} failed to run 2 consecutive compactions.',
61+
},
62+
},
5063
{
5164
// Alert if the compactor has not uploaded anything in the last 24h.
5265
alert: 'CortexCompactorHasNotUploadedBlocks',
@@ -65,7 +78,7 @@
6578
},
6679
{
6780
// Alert if the compactor has not uploaded anything since its start.
68-
alert: 'CortexCompactorHasNotUploadedBlocksSinceStart',
81+
alert: 'CortexCompactorHasNotUploadedBlocks',
6982
'for': '24h',
7083
expr: |||
7184
thanos_objstore_bucket_last_successful_upload_time{job=~".+/%(compactor)s"} == 0
@@ -77,21 +90,6 @@
7790
message: 'Cortex Compactor {{ $labels.namespace }}/{{ $labels.instance }} has not uploaded any block in the last 24 hours.',
7891
},
7992
},
80-
{
81-
// Alert if compactor fails.
82-
alert: 'CortexCompactorRunFailed',
83-
expr: |||
84-
increase(cortex_compactor_runs_failed_total[2h]) >= 2
85-
|||,
86-
labels: {
87-
severity: 'critical',
88-
},
89-
annotations: {
90-
message: |||
91-
{{ $labels.job }}/{{ $labels.instance }} failed to run compaction.
92-
|||,
93-
},
94-
},
9593
],
9694
},
9795
],

cortex-mixin/docs/playbooks.md

Lines changed: 11 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -272,11 +272,21 @@ Same as [`CortexCompactorHasNotSuccessfullyCleanedUpBlocks`](#CortexCompactorHas
272272
This alert fires when a Cortex compactor is not uploading any compacted blocks to the storage since a long time.
273273

274274
How to **investigate**:
275-
- If the alert `CortexCompactorHasNotSuccessfullyRun` or `CortexCompactorHasNotSuccessfullyRunSinceStart` have fired as well, then investigate that issue first
275+
- If the alert `CortexCompactorHasNotSuccessfullyRunCompaction` has fired as well, then investigate that issue first
276276
- If the alert `CortexIngesterHasNotShippedBlocks` or `CortexIngesterHasNotShippedBlocksSinceStart` have fired as well, then investigate that issue first
277277
- Ensure ingesters are successfully shipping blocks to the storage
278278
- Look for any error in the compactor logs
279279

280+
### CortexCompactorHasNotSuccessfullyRunCompaction
281+
282+
This alert fires if the compactor is not able to successfully compact all discovered compactable blocks (across all tenants).
283+
284+
When this alert fires, the compactor may still have successfully compacted some blocks but, for some reason, other blocks compaction is consistently failing. A common case is when the compactor is trying to compact a corrupted block for a single tenant: in this case the compaction of blocks for other tenants is still working, but compaction for the affected tenant is blocked by the corrupted block.
285+
286+
How to **investigate**:
287+
- Look for any error in the compactor logs
288+
- Corruption: [`not healthy index found`](#compactor-is-failing-because-of-not-healthy-index-found)
289+
280290
#### Compactor is failing because of `not healthy index found`
281291

282292
The compactor may fail to compact blocks due a corrupted block index found in one of the source blocks:
@@ -301,18 +311,6 @@ To rename a block stored on GCS you can use the `gsutil` CLI:
301311
gsutil mv gs://BUCKET/TENANT/BLOCK gs://BUCKET/TENANT/corrupted-BLOCK
302312
```
303313

304-
### CortexCompactorHasNotUploadedBlocksSinceStart
305-
306-
Same as [`CortexCompactorHasNotUploadedBlocks`](#CortexCompactorHasNotUploadedBlocks).
307-
308-
### CortexCompactorHasNotSuccessfullyRunCompaction
309-
310-
_TODO: this playbook has not been written yet._
311-
312-
### CortexCompactorRunFailed
313-
314-
_TODO: this playbook has not been written yet._
315-
316314
### CortexBucketIndexNotUpdated
317315

318316
This alert fires when the bucket index, for a given tenant, is not updated since a long time. The bucket index is expected to be periodically updated by the compactor and is used by queriers and store-gateways to get an almost-updated view over the bucket store.

0 commit comments

Comments
 (0)