You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 28, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: cortex-mixin/docs/playbooks.md
+52-6Lines changed: 52 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -82,7 +82,7 @@ This alert occurs when a ruler is unable to validate whether or not it should cl
82
82
83
83
This alert fires when a Cortex ingester is not uploading any block to the long-term storage. An ingester is expected to upload a block to the storage every block range period (defaults to 2h) and if a longer time elapse since the last successful upload it means something is not working correctly.
84
84
85
-
How to investigate:
85
+
How to **investigate**:
86
86
- Ensure the ingester is receiving write-path traffic (samples to ingest)
87
87
- Look for any upload error in the ingester logs (ie. networking or authentication issues)
88
88
@@ -115,33 +115,79 @@ The cause triggering this alert could **lead to**:
115
115
How to **investigate**:
116
116
- Look for details in the ingester logs
117
117
118
+
### CortexIngesterTSDBHeadTruncationFailed
119
+
120
+
This alert fires when a Cortex ingester fails to truncate the TSDB head.
121
+
122
+
The TSDB head is the in-memory store used to keep series and samples not compacted into a block yet. If head truncation fails for a long time, the ingester memory will increase until OOMKilled and the subsequent ingester restart may take a long time or even go into an OOMKilled crash loop because of the huge WAL to replay. For this reason, it's important to investigate and address the issue as soon as it happen.
123
+
124
+
How to **investigate**:
125
+
- Look for details in the ingester logs
126
+
127
+
### CortexIngesterTSDBCheckpointCreationFailed
128
+
129
+
This alert fires when a Cortex ingester fails to create a TSDB checkpoint.
130
+
131
+
How to **investigate**:
132
+
- Look for details in the ingester logs
133
+
- If the checkpoint fails because of a `corruption in segment`, you can restart the ingester because at next startup TSDB will try to "repair" it. After restart, if the issue is repaired and the ingester is running, you should also get paged by `CortexIngesterTSDBWALCorrupted` to signal you the WAL was corrupted and manual investigation is required.
134
+
135
+
### CortexIngesterTSDBCheckpointDeletionFailed
136
+
137
+
This alert fires when a Cortex ingester fails to delete a TSDB checkpoint.
138
+
139
+
Generally, this is not an urgent issue, but manual investigation is required to find the root cause of the issue and fix it.
140
+
141
+
How to **investigate**:
142
+
- Look for details in the ingester logs
143
+
144
+
### CortexIngesterTSDBWALTruncationFailed
145
+
146
+
This alert fires when a Cortex ingester fails to truncate the TSDB WAL.
147
+
148
+
How to **investigate**:
149
+
- Look for details in the ingester logs
150
+
151
+
### CortexIngesterTSDBWALCorrupted
152
+
153
+
This alert fires when a Cortex ingester finds a corrupted TSDB WAL (stored on disk) while replaying it at ingester startup.
154
+
155
+
When this alert fires, the WAL should have been auto-repaired, but manual investigation is required. The WAL repair mechanism cause data loss because all WAL records after the corrupted segment are discarded and so their samples lost while replaying the WAL. If this issue happen only on 1 ingester then Cortex doesn't suffer any data loss because of the replication factor, while if it happens on multiple ingesters then some data loss is possible.
156
+
157
+
### CortexIngesterTSDBWALWritesFailed
158
+
159
+
This alert fires when a Cortex ingester is failing to log records to the TSDB WAL on disk.
160
+
161
+
How to **investigate**:
162
+
- Look for details in the ingester logs
163
+
118
164
### CortexQuerierHasNotScanTheBucket
119
165
120
166
This alert fires when a Cortex querier is not successfully scanning blocks in the storage (bucket). A querier is expected to periodically iterate the bucket to find new and deleted blocks (defaults to every 5m) and if it's not successfully synching the bucket since a long time, it may end up querying only a subset of blocks, thus leading to potentially partial results.
121
167
122
-
How to investigate:
168
+
How to **investigate**:
123
169
- Look for any scan error in the querier logs (ie. networking or rate limiting issues)
124
170
125
171
### CortexQuerierHighRefetchRate
126
172
127
173
This alert fires when there's an high number of queries for which series have been refetched from a different store-gateway because of missing blocks. This could happen for a short time whenever a store-gateway ring resharding occurs (e.g. during/after an outage or while rolling out store-gateway) but store-gateways should reconcile in a short time. This alert fires if the issue persist for an unexpected long time and thus it should be investigated.
128
174
129
-
How to investigate:
175
+
How to **investigate**:
130
176
- Ensure there are no errors related to blocks scan or sync in the queriers and store-gateways
131
177
- Check store-gateway logs to see if all store-gateway have successfully completed a blocks sync
132
178
133
179
### CortexStoreGatewayHasNotSyncTheBucket
134
180
135
181
This alert fires when a Cortex store-gateway is not successfully scanning blocks in the storage (bucket). A store-gateway is expected to periodically iterate the bucket to find new and deleted blocks (defaults to every 5m) and if it's not successfully synching the bucket for a long time, it may end up querying only a subset of blocks, thus leading to potentially partial results.
136
182
137
-
How to investigate:
183
+
How to **investigate**:
138
184
- Look for any scan error in the store-gateway logs (ie. networking or rate limiting issues)
This alert fires when a Cortex compactor is not successfully deleting blocks marked for deletion for a long time.
143
189
144
-
How to investigate:
190
+
How to **investigate**:
145
191
- Ensure the compactor is not crashing during compaction (ie. `OOMKilled`)
146
192
- Look for any error in the compactor logs (ie. bucket Delete API errors)
147
193
@@ -153,7 +199,7 @@ Same as [`CortexCompactorHasNotSuccessfullyCleanedUpBlocks`](#CortexCompactorHas
153
199
154
200
This alert fires when a Cortex compactor is not uploading any compacted blocks to the storage since a long time.
155
201
156
-
How to investigate:
202
+
How to **investigate**:
157
203
- If the alert `CortexCompactorHasNotSuccessfullyRun` or `CortexCompactorHasNotSuccessfullyRunSinceStart` have fired as well, then investigate that issue first
158
204
- If the alert `CortexIngesterHasNotShippedBlocks` or `CortexIngesterHasNotShippedBlocksSinceStart` have fired as well, then investigate that issue first
159
205
- Ensure ingesters are successfully shipping blocks to the storage
0 commit comments