Skip to content

Commit 025285e

Browse files
simonswinepracuccigouthamvereplaytomwilkie
authored
Import cortex mixin from upstream (#373)
* Increased CortexAllocatingTooMuchMemory alert threshold Signed-off-by: Marco Pracucci <[email protected]> * Add alert for etcd memory limits close Signed-off-by: Goutham Veeramachaneni <[email protected]> * the distributor now supports push via GRPC (grafana/cortex-jsonnet#266) Signed-off-by: Mauro Stettler <[email protected]> * Fixed CortexQuerierHighRefetchRate alert Signed-off-by: Marco Pracucci <[email protected]> * Fixed label matcher Signed-off-by: Marco Pracucci <[email protected]> * Sort legend descending in the CPU/memory panels Signed-off-by: Marco Pracucci <[email protected]> * Add slow queries dashboard Signed-off-by: Marco Pracucci <[email protected]> * Added tenant ID field to the table Signed-off-by: Marco Pracucci <[email protected]> * Add recording rules to calculate Cortex scaling - Update dashboard so it only shows under provisioned services and why - Add sizing rules based on limits. - Add some docs to the dashboard. Signed-off-by: Tom Wilkie <[email protected]> * Increased CortexRequestErrors alert severity Signed-off-by: Marco Pracucci <[email protected]> * Fixed "Disk Writes" and "Disk Reads" panels Signed-off-by: Marco Pracucci <[email protected]> * Pre-compute aggregations to optimize scaling recording rules Signed-off-by: Marco Pracucci <[email protected]> * Removed 5m step from subquery Signed-off-by: Marco Pracucci <[email protected]> * Add function to customize compactor statefulset Signed-off-by: Marco Pracucci <[email protected]> * Use the job name in compactor alerts too Signed-off-by: Marco Pracucci <[email protected]> * Fixed CortexCompactorRunFailed threshold Signed-off-by: Marco Pracucci <[email protected]> * Added Cortex Rollout progress dashboard Signed-off-by: Marco Pracucci <[email protected]> * Fix 'Unhealthy pods' in Cortex Rollout dashboard Signed-off-by: Marco Pracucci <[email protected]> * Simplify compactor alerts We should simply alert on things not having run since X. Signed-off-by: Goutham Veeramachaneni <[email protected]> * Use the right metric Signed-off-by: Goutham Veeramachaneni <[email protected]> * Apply suggestions from code review Co-authored-by: Marco Pracucci <[email protected]> Signed-off-by: Goutham Veeramachaneni <[email protected]> * Fix CortexCompactorHasNotSuccessfullyRunCompaction to avoid false positives Signed-off-by: Marco Pracucci <[email protected]> * Introduce ingester instance limits to configuration, and add alerts. (grafana/cortex-jsonnet#296) * Introduce ingester instance limits to configuration, and add alerts. * CHANGELOG.md * Address (internal) review feedback. * Improve CortexRulerFailedRingCheck alert Signed-off-by: Marco Pracucci <[email protected]> * Added example Loki query to CortexTenantHasPartialBlocks playbook Signed-off-by: Marco Pracucci <[email protected]> * Default dashboards to Cortex blocks storage only Signed-off-by: Marco Pracucci <[email protected]> * Add missing memberlist components to alerts This adds the admin-api, compactor and store-gateway components to the memberlist alert. Signed-off-by: Christian Simon <[email protected]> * mixin: Add gateway to valid job names (for GEM) * Only show namespaces from selected cluster. "All" works thanks to using regex matcher. (grafana/cortex-jsonnet#311) * Only show namespaces from selected cluster. "All" works thanks to using regex matcher. * CHANGELOG.md * Fixed CortexIngesterHasNotShippedBlocks alert false positive Signed-off-by: Marco Pracucci <[email protected]> * Fixed mixin linter Signed-off-by: Marco Pracucci <[email protected]> * Add placeholders to make the linter pass Signed-off-by: Marco Pracucci <[email protected]> * cortex-mixin: Use kube_pod_container_resource_{requests,limits} metrics This updates the recording rules to make them compatible with kube-state-metrics v2.0.0 which introduces some breaking changes in some metric names. With kube-state-metrics v2.0.0: - `kube_pod_container_resource_requests_cpu_cores` becomes `kube_pod_container_resource_requests{resource="cpu"}` - `kube_pod_container_resource_requests_memory_bytes` becomes `kube_pod_container_resource_requests{resource="memory"}` * cortex-mixin: Make the recording rules backwards compatible * refactor: functions to reduce code duplication - improve overrideability - making more use of `per_instance_label` from _config - added containerNetworkPanel functions for dashboards to use * fix: lint * refactor: config for job aggregation strings - to make it easier to override, define "cluster_namespace_job" in $._config as `job_aggregation_prefix`. - added some `job_aggregation_labels_*` as well The resulting output does not change (unless config is overridden). * lint * Update cortex-mixin/dashboards/writes.libsonnet simplify mapping by extending $._config Co-authored-by: Marco Pracucci <[email protected]> * fix: syntax * refactor: added a group_config defines group-related strings based off of array-based parameters in _config. deprecated _config.alert_aggregation_labels with a std.trace warning, while maintaining (temporary?) backward compatibility. * refactor: added a group_config defines group-related strings based off of array-based parameters in _config. deprecated _config.alert_aggregation_labels with a std.trace warning, while maintaining (temporary?) backward compatibility. * refactor: added a group_config defines group-related strings based off of array-based parameters in _config. deprecated _config.alert_aggregation_labels with a std.trace warning, while maintaining (temporary?) backward compatibility. * Lower CortexIngesterRestarts severity Signed-off-by: Marco Pracucci <[email protected]> * feature: add some text boxes and descriptions Focussing on the reads and writes dashboards, added some info panels and hover-over descriptions for some of the panels. Some common code used by the compactor also received additional text content. New functions: - addRows - addRowsIf ...to add a list of rows to a dashboard. The `thanosMemcachedCache` function has had some of its query text sprawled out for easier reading and comparison with similar dashboard queries. * fix: text replacements, repair addRows * Changing copy to add 'latency' as well. * Cut down on text from initial PR. Tucked existing text from the compactor dashboard under tooltips, rather than making them text boxes. * Getting rid of a few space/comma errors. * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * fix: formatting - limit to 4 panels per row * fmt * fix: remove accidental line * Update cortex-mixin/dashboards/dashboard-utils.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/reads.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/reads.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/reads.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * fix: Requests per second * fix: text * Apply suggestions from code review as per @osg-grafana Co-authored-by: Ursula Kallio <[email protected]> * fix: clarity * Apply suggestions from code review as per @osg-grafana Co-authored-by: Ursula Kallio <[email protected]> * Add a simple playbook for ingester series limit alert. Signed-off-by: Callum Styan <[email protected]> * Add cortex-gw-internal to watched gateway metrics (grafana/cortex-jsonnet#328) * Add cortex-gw-internal to watched gateway metrics * Update CHANGELOG.md Co-authored-by: Marco Pracucci <[email protected]> * fix: query formatting to aid in merge * fix: query formatting to aid in merge * fix: consistent labelling * fix: ensure panel titles are consistent - Most existing "per second" panel titles in `main` are written "/ sec", corrected recent commits to match. * Improved CortexIngesterReachingSeriesLimit playbook and added CortexIngesterReachingTenantsLimit playbook Signed-off-by: Marco Pracucci <[email protected]> * Better formatting for ingester_instance_limits+ example Signed-off-by: Marco Pracucci <[email protected]> * Clarify which alerts apply to chunks storage only Signed-off-by: Marco Pracucci <[email protected]> * Improve compactor alerts and playbooks Signed-off-by: Marco Pracucci <[email protected]> * Addressed review comments Signed-off-by: Marco Pracucci <[email protected]> * Update cortex-mixin/docs/playbooks.md Signed-off-by: Marco Pracucci <[email protected]> Co-authored-by: Peter Štibraný <[email protected]> * Fixed and improved runtime config alerts and playbooks Signed-off-by: Marco Pracucci <[email protected]> * fix: resolve review feedback * Update cortex-mixin/docs/playbooks.md Signed-off-by: Marco Pracucci <[email protected]> Co-authored-by: Peter Štibraný <[email protected]> * Update cortex-mixin/docs/playbooks.md Signed-off-by: Marco Pracucci <[email protected]> Co-authored-by: Peter Štibraný <[email protected]> * MarkCortexTableSyncFailure and CortexOldChunkInMemory alerts as chunks storage only Signed-off-by: Marco Pracucci <[email protected]> * Fixed whitespace noise Signed-off-by: Marco Pracucci <[email protected]> * refactor: resources dashboard comtainer functions added: - containerDiskWritesPanel - containerDiskReadsPanel - containerDiskSpaceUtilization * revert: matching spacing format of main * lint: white noise * Add playbook for CortexRequestErrors and config option to exclude specific routes Signed-off-by: Marco Pracucci <[email protected]> * Change min-step to 15s to show better detail. $__rate_interval will be floored at 4x this quantity, so 15s lets us see faster transients than the previous value of 1m. Signed-off-by: Bryan Boreham <[email protected]> * Added playbook for CortexFrontendQueriesStuck and CortexSchedulerQueriesStuck Signed-off-by: Marco Pracucci <[email protected]> * Remove CortexQuerierCapacityFull alert Signed-off-by: Marco Pracucci <[email protected]> * Added playbook for CortexProvisioningTooManyWrites Signed-off-by: Marco Pracucci <[email protected]> * Added playbook for CortexAllocatingTooMuchMemory Signed-off-by: Marco Pracucci <[email protected]> * Address review feedback Signed-off-by: Marco Pracucci <[email protected]> * Replaced CortexCacheRequestErrors with CortexMemcachedRequestErrors Signed-off-by: Marco Pracucci <[email protected]> * Replace ruler alerts, and add playbooks. * Addressed review comments Signed-off-by: Marco Pracucci <[email protected]> * Fix white space. * Better alert messages. * Improve CortexIngesterReachingSeriesLimit playbook Signed-off-by: Marco Pracucci <[email protected]> * Add playbook for CortexProvisioningTooManyActiveSeries Signed-off-by: Marco Pracucci <[email protected]> * Improve messaging. * Fixed formatting Signed-off-by: Marco Pracucci <[email protected]> * Improved alert messages with Cortex cluster Signed-off-by: Marco Pracucci <[email protected]> * Improved CortexRequestLatency playbook Signed-off-by: Marco Pracucci <[email protected]> * Added 'Per route p99 latency' to ruler configuration API Signed-off-by: Marco Pracucci <[email protected]> * Addressed review comments Signed-off-by: Marco Pracucci <[email protected]> * Aded object storage metrics for Ruler and Alertmanager Signed-off-by: Marco Pracucci <[email protected]> * Add playbook entry for CortexGossipMembersMismatch. * Clarify data loss related to 'not healthy index found' issue Signed-off-by: Marco Pracucci <[email protected]> * Review comments. * Improve CortexIngesterReachingSeriesLimit playbook Signed-off-by: Marco Pracucci <[email protected]> * Increased CortexIngesterReachingSeriesLimit critical alert threshold from 80% to 85% Signed-off-by: Marco Pracucci <[email protected]> * Increase CortexIngesterReachingSeriesLimit warning `for` duration As it turns out, during normal shuffle-sharding operation, the 70% mark is often exceeded, but not by much. Rather than increasing the threshold to 75%, this commit increases the `for` duration to 3h, following the thought that we want this alert to fire if ingesters are constantly above the threshold even after stale series are flushed (which occurs every 2h, when the TSDB head is compacted). We flush series with a timestamp between [-3h, -1h] after the last compaction, so the worst case scenario is that it takes 3h to flush a stale series. Signed-off-by: beorn7 <[email protected]> * Fix scaling dashboard to work on multi-zone ingesters Signed-off-by: Marco Pracucci <[email protected]> * Simplified cluster_namespace_deployment:actual_replicas:count recording rule Signed-off-by: Marco Pracucci <[email protected]> * Added a comment to explain '.*?' Signed-off-by: Marco Pracucci <[email protected]> * Fix rollout dashboard to work with multi-zone deployments Signed-off-by: Marco Pracucci <[email protected]> * Fixed legends Signed-off-by: Marco Pracucci <[email protected]> * Extend Alertmanager dashboard with currently unused metrics. Metrics for general operation: - Added "Tenants" stat panel using: `cortex_alertmanager_tenants_discovered` - Added "Tenant Configuration Sync" row using: `cortex_alertmanager_sync_configs_failed_total` `cortex_alertmanager_sync_configs_total` `cortex_alertmanager_ring_check_errors_total` Metrics specific to sharding operation: - Added "Sharding Initial State Sync" row using: `cortex_alertmanager_state_initial_sync_completed_total` `cortex_alertmanager_state_initial_sync_completed_total` `cortex_alertmanager_state_initial_sync_duration_seconds` - Added "Sharding State Operations" row using: `cortex_alertmanager_state_fetch_replica_state_total` `cortex_alertmanager_state_fetch_replica_state_failed_total` `cortex_alertmanager_state_replication_total` `cortex_alertmanager_state_replication_failed_total` `cortex_alertmanager_partial_state_merges_total` `cortex_alertmanager_partial_state_merges_failed_total` `cortex_alertmanager_state_persist_total` `cortex_alertmanager_state_persist_failed_total` * Review comments + fix latency panel. * Review comments. * Clarify the gsutil mv command for moving corrupted blocks Signed-off-by: Tyler Reid <[email protected]> * Modify log message to fit example command Signed-off-by: Tyler Reid <[email protected]> * Update grafana-builder from Mar 2019 to Feb 2021 Brings in the following changes: - Use default as a picker value for datasource variable grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/204 - allow table link in new tab grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/238 - allow setting a default datasource grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/301 - Add textPanel grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/341 - make status code label name overrideable in qpsPanel grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/397 - use $__rate_interval over $__interval grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/401 - Set shared tooltip to false by default grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/458 - Use custom 'all' value to avoid massive regexes in queries. grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/469 https://github.com/grafana/jsonnet-libs/commits/master/grafana-builder/ * Match query-frontend/query-scheduler/querier custom deployments by default Signed-off-by: Marco Pracucci <[email protected]> * Create playbooks for sharded alertmanager * Add new alerts for alertmanager sharding mode of operation. * fix(rules): upstream recording rule switched to sum_irate ref: kubernetes-monitoring/kubernetes-mixin#619 * Fix CortexIngesterReachingSeriesLimit playbook Signed-off-by: Arve Knudsen <[email protected]> * feat: Allow configuration of ring members in gossip alerts Signed-off-by: Jack Baldry <[email protected]> * fix: Add store-gateway and compactor ring_members Also re-order names for readability. Signed-off-by: Jack Baldry <[email protected]> * fix: Match all ingester workloads and avoid matching the cortex-gateway Signed-off-by: Jack Baldry <[email protected]> * feat: Optionally allow use of array or string to configure ring members Signed-off-by: Jack Baldry <[email protected]> * address review feedback Signed-off-by: Jack Baldry <[email protected]> * fix: Correct ingester and querier regexps Signed-off-by: Jack Baldry <[email protected]> * Fixes to initial state sync panels on alertmanager dashboard. 1) Change minimal interval to 1m for sync duration and fetch state panels. This is in order to show infrequent events at smaller time windows. 2) Change syncs/sec panel to reflect absolute value of metric not rate. The initial sync only occurs once per-tenant so the counter value is essentially 0 or 1. Due to how per-tenant metrics are aggregated, the external facing metric really acts more like a gauge reflecting the number of tenants which achieved each outcome. Also, stack this panel as it becomes easier to visually see when the initial syncs have completed for all tenants (e.g. during a rollout). * Add rate back to Alertmanager dashboard initial syncs panel. The metric in fact does act like a counter due to soft deletion of the per-user registry when the user is unconfigured (e.g. moved to another instance or configuration deleted). * Make the overrides metric name configurable. We (Grafana Labs) are about to put in a new system to control and export data about limits and we'll need to use a different name. This shouldn't affect our OSS users. Signed-off-by: Goutham Veeramachaneni <[email protected]> * Improve Cortex / Queries dashboard Signed-off-by: Marco Pracucci <[email protected]> * Add recording rules for speeding up Alertmanager dashboard. With large numbers of tenants the queries for some panels on thos dashboard can become quite slow as the metrics exposed are per-tenant. * Fixes from testing. * Move rules to their own group. * Split `cortex_api` recording rule group into three groups. This is a workaround for large clusters where this group can become slow to evaluate. * Update gsutil installation playbook Signed-off-by: Marco Pracucci <[email protected]> * Use `$._config.job_names.gateway` in resources dashboards. This fixes panels where `cortex-gw` was hardcoded. * Fine tune CortexIngesterReachingSeriesLimit alert Signed-off-by: Marco Pracucci <[email protected]> * Add CortexRolloutStuck alert Signed-off-by: Marco Pracucci <[email protected]> * Fixed playbook Signed-off-by: Marco Pracucci <[email protected]> * Added CortexFailingToTalkToConsul alert Signed-off-by: Marco Pracucci <[email protected]> * Fixed alert message Signed-off-by: Marco Pracucci <[email protected]> * Update alert to be generic to KV stores Signed-off-by: Marco Pracucci <[email protected]> * Add README * Add mimir-mixin CI checks * Update build image * Move to operations folder * Add missing zip to build-image * Run prettifier on playbooks.md * Update build-image Co-authored-by: Marco Pracucci <[email protected]> Co-authored-by: Goutham Veeramachaneni <[email protected]> Co-authored-by: Mauro Stettler <[email protected]> Co-authored-by: Tom Wilkie <[email protected]> Co-authored-by: Tom Wilkie <[email protected]> Co-authored-by: Goutham Veeramachaneni <[email protected]> Co-authored-by: Peter Štibraný <[email protected]> Co-authored-by: Alex Martin <[email protected]> Co-authored-by: Javier Palomo <[email protected]> Co-authored-by: Darren Janeczek <[email protected]> Co-authored-by: Darren Janeczek <[email protected]> Co-authored-by: Jennifer Villa <[email protected]> Co-authored-by: Ursula Kallio <[email protected]> Co-authored-by: Callum Styan <[email protected]> Co-authored-by: Johanna Ratliff <[email protected]> Co-authored-by: Bryan Boreham <[email protected]> Co-authored-by: Steve Simpson <[email protected]> Co-authored-by: beorn7 <[email protected]> Co-authored-by: Tyler Reid <[email protected]> Co-authored-by: George Robinson <[email protected]> Co-authored-by: Duologic <[email protected]> Co-authored-by: Arve Knudsen <[email protected]> Co-authored-by: Jack Baldry <[email protected]>
1 parent 442ef6b commit 025285e

37 files changed

+6302
-15
lines changed

.github/workflows/test-build-deploy.yml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ jobs:
1010
lint:
1111
runs-on: ubuntu-20.04
1212
container:
13-
image: us.gcr.io/kubernetes-dev/mimir-build-image:add-prettier-08d2e2a61
13+
image: us.gcr.io/kubernetes-dev/mimir-build-image:20211018_import-cortex-mixin-e7b4eab3c
1414
credentials:
1515
username: _json_key
1616
password: ${{ secrets.gcr_json_key }}
@@ -36,6 +36,8 @@ jobs:
3636
run: make BUILD_IN_CONTAINER=false check-protos
3737
- name: Check Generated Documentation
3838
run: make BUILD_IN_CONTAINER=false check-doc
39+
- name: Check Mixin
40+
run: make BUILD_IN_CONTAINER=false check-mixin
3941
- name: Check White Noise.
4042
run: make BUILD_IN_CONTAINER=false check-white-noise
4143
- name: Check License Header
@@ -44,7 +46,7 @@ jobs:
4446
test:
4547
runs-on: ubuntu-20.04
4648
container:
47-
image: us.gcr.io/kubernetes-dev/mimir-build-image:add-prettier-08d2e2a61
49+
image: us.gcr.io/kubernetes-dev/mimir-build-image:20211018_import-cortex-mixin-e7b4eab3c
4850
credentials:
4951
username: _json_key
5052
password: ${{ secrets.gcr_json_key }}
@@ -68,7 +70,7 @@ jobs:
6870
build:
6971
runs-on: ubuntu-20.04
7072
container:
71-
image: us.gcr.io/kubernetes-dev/mimir-build-image:add-prettier-08d2e2a61
73+
image: us.gcr.io/kubernetes-dev/mimir-build-image:20211018_import-cortex-mixin-e7b4eab3c
7274
credentials:
7375
username: _json_key
7476
password: ${{ secrets.gcr_json_key }}

Makefile

Lines changed: 35 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# WARNING: do not commit to a repository!
33
-include Makefile.local
44

5-
.PHONY: all test integration-tests cover clean images protos exes dist doc clean-doc check-doc push-multiarch-build-image license check-license format
5+
.PHONY: all test integration-tests cover clean images protos exes dist doc clean-doc check-doc push-multiarch-build-image license check-license format check-mixin check-mixin-jb check-mixin-mixtool checkin-mixin-playbook build-mixin format-mixin
66
.DEFAULT_GOAL := all
77

88
# Version number
@@ -25,6 +25,12 @@ GIT_REVISION := $(shell git rev-parse --short HEAD)
2525
GIT_BRANCH := $(shell git rev-parse --abbrev-ref HEAD)
2626
UPTODATE := .uptodate
2727

28+
# path to jsonnetfmt
29+
JSONNET_FMT := jsonnetfmt
30+
31+
# path to the mimir/mixin
32+
MIXIN_PATH := operations/mimir-mixin
33+
2834
.PHONY: image-tag
2935
image-tag:
3036
@echo $(IMAGE_TAG)
@@ -120,7 +126,7 @@ mimir-build-image/$(UPTODATE): mimir-build-image/*
120126
# All the boiler plate for building golang follows:
121127
SUDO := $(shell docker info >/dev/null 2>&1 || echo "sudo -E")
122128
BUILD_IN_CONTAINER := true
123-
LATEST_BUILD_IMAGE_TAG ?= add-prettier-08d2e2a61
129+
LATEST_BUILD_IMAGE_TAG ?= 20211018_import-cortex-mixin-e7b4eab3c
124130

125131
# TTY is parameterized to allow Google Cloud Builder to run builds,
126132
# as it currently disallows TTY devices. This value needs to be overridden
@@ -314,6 +320,33 @@ clean-white-noise:
314320
check-white-noise: clean-white-noise
315321
@git diff --exit-code --quiet -- '*.md' || (echo "Please remove trailing whitespaces running 'make clean-white-noise'" && false)
316322

323+
check-mixin: format-mixin check-mixin-jb check-mixin-mixtool check-mixin-playbook
324+
@git diff --exit-code --quiet -- $(MIXIN_PATH) || (echo "Please format mixin by running 'make format-mixin'" && false)
325+
326+
@cd $(MIXIN_PATH) && \
327+
jb install && \
328+
mixtool lint mixin.libsonnet
329+
330+
check-mixin-jb:
331+
@cd $(MIXIN_PATH) && \
332+
jb install
333+
334+
check-mixin-mixtool: check-mixin-jb
335+
@cd $(MIXIN_PATH) && \
336+
mixtool lint mixin.libsonnet
337+
338+
check-mixin-playbook: build-mixin
339+
@$(MIXIN_PATH)/scripts/lint-playbooks.sh
340+
341+
build-mixin: check-mixin-jb
342+
@rm -rf $(MIXIN_PATH)/out && mkdir $(MIXIN_PATH)/out
343+
@cd $(MIXIN_PATH) && \
344+
mixtool generate all --output-alerts out/alerts.yaml --output-rules out/rules.yaml --directory out/dashboards mixin.libsonnet && \
345+
zip -q -r mimir-mixin.zip out
346+
347+
format-mixin:
348+
@find $(MIXIN_PATH) -type f -name '*.libsonnet' -print -o -name '*.jsonnet' -print | xargs jsonnetfmt -i
349+
317350
web-serve:
318351
cd website && hugo --config config.toml --minify -v server
319352

mimir-build-image/Dockerfile

Lines changed: 14 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
FROM golang:1.16.6-buster
77
ARG goproxyValue
88
ENV GOPROXY=${goproxyValue}
9-
RUN apt-get update && apt-get install -y curl python-requests python-yaml file jq unzip protobuf-compiler libprotobuf-dev shellcheck && \
9+
RUN apt-get update && apt-get install -y curl python-requests python-yaml file jq zip unzip protobuf-compiler libprotobuf-dev shellcheck && \
1010
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
1111
RUN go get -u golang.org/x/tools/cmd/goimports@3fce476f0a782aeb5034d592c189e63be4ba6c9e
1212
RUN curl -sL https://deb.nodesource.com/setup_14.x | bash -
@@ -36,15 +36,19 @@ RUN GOARCH=$(go env GOARCH) && \
3636

3737
RUN curl -sfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh| sh -s -- -b /usr/bin v1.27.0
3838

39-
RUN GO111MODULE=on go get \
40-
github.com/client9/misspell/cmd/[email protected] \
41-
github.com/golang/protobuf/[email protected] \
42-
github.com/gogo/protobuf/[email protected] \
43-
github.com/gogo/protobuf/[email protected] \
44-
github.com/weaveworks/tools/cover@bdd647e92546027e12cdde3ae0714bb495e43013 \
45-
github.com/fatih/[email protected] \
46-
github.com/campoy/[email protected] \
47-
&& rm -rf /go/pkg /go/src /root/.cache
39+
RUN GO111MODULE=on \
40+
go get github.com/client9/misspell/cmd/[email protected] && \
41+
go get github.com/golang/protobuf/[email protected] && \
42+
go get github.com/gogo/protobuf/[email protected] && \
43+
go get github.com/gogo/protobuf/[email protected] && \
44+
go get github.com/weaveworks/tools/cover@bdd647e92546027e12cdde3ae0714bb495e43013 && \
45+
go get github.com/fatih/[email protected] && \
46+
go get github.com/campoy/[email protected] && \
47+
go get github.com/jsonnet-bundler/jsonnet-bundler/cmd/[email protected] && \
48+
go get github.com/monitoring-mixins/mixtool/cmd/mixtool@bca3066 && \
49+
go get github.com/mikefarah/yq/[email protected] && \
50+
go get github.com/google/go-jsonnet/cmd/[email protected] && \
51+
rm -rf /go/pkg /go/src /root/.cache
4852

4953
ENV NODE_PATH=/usr/lib/node_modules
5054
COPY build.sh /

operations/mimir-mixin/.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
/out/
2+
/vendor/
3+
/mimir-mixin.zip

operations/mimir-mixin/README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Monitoring for Mimir
2+
3+
To generate the Grafana dashboards and Prometheus alerts for Mimir:
4+
5+
## Usage
6+
7+
```console
8+
$ GO111MODULE=on go get github.com/monitoring-mixins/mixtool/cmd/mixtool
9+
$ GO111MODULE=on go get github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb
10+
$ git clone https://github.com/grafana/mimir.git
11+
$ make build-mixin
12+
```
13+
14+
This will leave all the alerts and dashboards in jsonnet/mimir-mixin/mimir-mixin.zip (or jsonnet/mimir-mixin/out).
15+
16+
## Known Problems
17+
18+
If you get an error like `cannot use cli.StringSliceFlag literal (type cli.StringSliceFlag) as type cli.Flag in slice literal` when installing [mixtool](https://github.com/monitoring-mixins/mixtool/issues/27), make sure you set `GO111MODULE=on` before `go get`.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
{
2+
prometheusAlerts+::
3+
(import 'alerts/alerts.libsonnet') +
4+
(import 'alerts/alertmanager.libsonnet') +
5+
6+
(if std.member($._config.storage_engine, 'blocks')
7+
then
8+
(import 'alerts/blocks.libsonnet') +
9+
(import 'alerts/compactor.libsonnet')
10+
else {}) +
11+
12+
{ _config:: $._config + $._group_config },
13+
}
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
{
2+
groups+: [
3+
{
4+
name: 'alertmanager_alerts',
5+
rules: [
6+
{
7+
alert: 'CortexAlertmanagerSyncConfigsFailing',
8+
expr: |||
9+
rate(cortex_alertmanager_sync_configs_failed_total[5m]) > 0
10+
|||,
11+
'for': '30m',
12+
labels: {
13+
severity: 'critical',
14+
},
15+
annotations: {
16+
message: |||
17+
Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} is failing to read tenant configurations from storage.
18+
|||,
19+
},
20+
},
21+
{
22+
alert: 'CortexAlertmanagerRingCheckFailing',
23+
expr: |||
24+
rate(cortex_alertmanager_ring_check_errors_total[2m]) > 0
25+
|||,
26+
'for': '10m',
27+
labels: {
28+
severity: 'critical',
29+
},
30+
annotations: {
31+
message: |||
32+
Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} is unable to check tenants ownership via the ring.
33+
|||,
34+
},
35+
},
36+
{
37+
alert: 'CortexAlertmanagerPartialStateMergeFailing',
38+
expr: |||
39+
rate(cortex_alertmanager_partial_state_merges_failed_total[2m]) > 0
40+
|||,
41+
'for': '10m',
42+
labels: {
43+
severity: 'critical',
44+
},
45+
annotations: {
46+
message: |||
47+
Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} is failing to merge partial state changes received from a replica.
48+
|||,
49+
},
50+
},
51+
{
52+
alert: 'CortexAlertmanagerReplicationFailing',
53+
expr: |||
54+
rate(cortex_alertmanager_state_replication_failed_total[2m]) > 0
55+
|||,
56+
'for': '10m',
57+
labels: {
58+
severity: 'critical',
59+
},
60+
annotations: {
61+
message: |||
62+
Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} is failing to replicating partial state to its replicas.
63+
|||,
64+
},
65+
},
66+
{
67+
alert: 'CortexAlertmanagerPersistStateFailing',
68+
expr: |||
69+
rate(cortex_alertmanager_state_persist_failed_total[15m]) > 0
70+
|||,
71+
'for': '1h',
72+
labels: {
73+
severity: 'critical',
74+
},
75+
annotations: {
76+
message: |||
77+
Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} is unable to persist full state snaphots to remote storage.
78+
|||,
79+
},
80+
},
81+
{
82+
alert: 'CortexAlertmanagerInitialSyncFailed',
83+
expr: |||
84+
increase(cortex_alertmanager_state_initial_sync_completed_total{outcome="failed"}[1m]) > 0
85+
|||,
86+
labels: {
87+
severity: 'critical',
88+
},
89+
annotations: {
90+
message: |||
91+
Cortex Alertmanager {{ $labels.job }}/{{ $labels.instance }} was unable to obtain some initial state when starting up.
92+
|||,
93+
},
94+
},
95+
],
96+
},
97+
],
98+
}

0 commit comments

Comments
 (0)