Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,12 @@
* [CHANGE] Renamed `CortexInconsistentConfig` alert to `CortexInconsistentRuntimeConfig` and increased severity to `critical`. #335
* [CHANGE] Increased `CortexBadRuntimeConfig` alert severity to `critical` and removed support for `cortex_overrides_last_reload_successful` metric (was removed in Cortex 1.3.0). #335
* [CHANGE] Grafana 'min step' changed to 15s so dashboard show better detail. #340
* [CHANGE] Removed `CortexCacheRequestErrors` alert. This alert was not working because the legacy Cortex cache client instrumentation doesn't track errors. #346
* [ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317
* [ENHANCEMENT] Added documentation text panels and descriptions to reads and writes dashboards. #324
* [ENHANCEMENT] Dashboards: defined container functions for common resources panels: containerDiskWritesPanel, containerDiskReadsPanel, containerDiskSpaceUtilization. #331
* [ENHANCEMENT] cortex-mixin: Added `alert_excluded_routes` config to exclude specific routes from alerts. #338
* [ENHANCEMENT] Added `CortexMemcachedRequestErrors` alert. #346
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335
Expand Down
14 changes: 7 additions & 7 deletions cortex-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -180,20 +180,20 @@
},
},
{
alert: 'CortexCacheRequestErrors',
alert: 'CortexMemcachedRequestErrors',
expr: |||
100 * sum by (%s, method) (rate(cortex_cache_request_duration_seconds_count{status_code=~"5.."}[1m]))
/
sum by (%s, method) (rate(cortex_cache_request_duration_seconds_count[1m]))
> 1
(
sum by(%s, name, operation) (rate(thanos_memcached_operation_failures_total[1m])) /
sum by(%s, name, operation) (rate(thanos_memcached_operations_total[1m]))
) * 100 > 5
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
'for': '15m',
'for': '5m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
Cache {{ $labels.method }} is experiencing {{ printf "%.2f" $value }}% errors.
Memcached {{ $labels.name }} used by Cortex in {{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors for {{ $labels.operation }} operation.
|||,
},
},
Expand Down
28 changes: 26 additions & 2 deletions cortex-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -414,9 +414,33 @@ _TODO: this playbook has not been written yet._

_TODO: this playbook has not been written yet._

### CortexCacheRequestErrors
### CortexMemcachedRequestErrors

_TODO: this playbook has not been written yet._
This alert fires if Cortex memcached client is experiencing an high error rate for a specific cache and operation.

How to **investigate**:
- The alert reports which cache is experiencing issue
- `metadata-cache`: object store metadata cache
- `index-cache`: TSDB index cache
- `chunks-cache`: TSDB chunks cache
- Check which specific error is occurring
- Run the following query to find out the reason (replace `<namespace>` with the actual Cortex cluster namespace)
```
sum by(name, operation, reason) (rate(thanos_memcached_operation_failures_total{namespace="<namespace>"}[1m])) > 0
```
- Based on the **`reason`**:
- `timeout`
- Scale up the memcached replicas
- `server-error`
- Check both Cortex and memcached logs to find more details
- `network-error`
- Check Cortex logs to find more details
- `malformed-key`
- The key is too long or contains invalid characters
- Check Cortex logs to find the offending key
- Fixing this will require changes to the application code
- `other`
- Check both Cortex and memcached logs to find more details

### CortexOldChunkInMemory

Expand Down