Alertmanager: Introduce new metrics from upstream #4918

alexweav · 2023-05-04T02:09:20Z

What this PR does

The upstream prometheus/alertmanager has recently been enhanced with new metrics:

New reason label on alertmanager_notifications_failed_total
alertmanager_silences_maintenance_total
alertmanager_silences_maintenance_errors_total
alertmanager_nflog_maintenance_total
alertmanager_nflog_maintenance_errors_total

Since we recently upgraded alertmanager, these metrics are now available for us to consume.
This PR exposes these metrics in Mimir.

Which issue(s) this PR fixes or relates to

Fixes #4895

Checklist

Tests updated
Documentation added
- I found no documentation listing each metric by name, but happy to update this if there is one! 😄
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

alexweav · 2023-05-04T02:11:52Z

pkg/alertmanager/alertmanager_metrics_test.go

-		cortex_alertmanager_notification_latency_seconds_bucket{le="+Inf"} 27
-		cortex_alertmanager_notification_latency_seconds_sum 99.9
-		cortex_alertmanager_notification_latency_seconds_count 27
+		cortex_alertmanager_notification_latency_seconds_bucket{le="1"} 16


These tests had drifted from the actual metrics a bit, they also did not cover the telegram notifier. I fixed the drift in the tests while I was in the area. There will be a few extra changes in the expected metrics due to this.

alexweav · 2023-05-04T02:13:54Z

pkg/alertmanager/alertmanager_metrics_test.go

 	queryErrorsTotal        prometheus.Counter
 	queryDuration           prometheus.Histogram
 	propagatedMessagesTotal prometheus.Counter
+	maintenanceTotal        prometheus.Counter


Changes in the rest of the file are copied from their respective prometheus files. There are comments pointing to the source above each relevant section.

alexweav · 2023-05-04T02:15:42Z

pkg/alertmanager/alertmanager_metrics_test.go

 		}, []string{"integration"}),
 	}
-	for _, integration := range integrations {
+	for _, integration := range []string{


Intentionally duplicated the list here, because that's what's in prometheus's notify.go. I didn't want to disturb the "purity" of the code that is copied from prometheus.

These tests currently work against copy/pasted code from prometheus, which is subject to drift if prometheus changes (as it did here).

@gotjosh, WDYT about exporting some of these objects like metrics structs? Then, we could test against the actual vendored prometheus code, and avoid this kind of test drift in the future. Would alertmanager accept such a change, or is it too separated from prometheus?

I see no harm in that if it's a straightforward change 👍

CHANGELOG.md

stevesg · 2023-05-04T07:48:06Z

pkg/alertmanager/alertmanager_metrics.go

 			"cortex_alertmanager_notifications_failed_total",
 			"The total number of failed notifications.",
-			[]string{"user", "integration"}, nil),
+			[]string{"user", "integration", "reason"}, nil),


This metric can already be very high cardinality with many tenants, so I'm wondering whether we should actually intentionally aggregate away reason. However, it can only be one of three values, (other, clientError or serverError), we could potentially accept it, and because this metric is only exposed for non-zero values, the increase in series would normally be low.

I'm trying to think how we would use it:

For investigating failures:
The extra detail is probably too little; I would just look at logs.

For alerting on failures as an operator:
There could be value if we only wanted to alert on a particular class of error. However, I'm not sure I trust all integrations and backend services to correctly return 4xx vs 5xx (think: arbitrary webhooks). It appears the original intention of the change was to alert on 5xx class errors specifically from the SNS integration, so the people doing that presumably have sufficient trust in the error codes returned from it.

For alerting on failures as a tenant/user:
I care about all errors, regardless of 4xx vs 5xx.

I've checked some of our internal environments, even with the high number of tenants, the number of series for this metric is very small, so I don't think adding reason will be a problem.

There is much discussion here (including mirroring your point about cardinality): prometheus/alertmanager#2927

Overall, the main benefit is for monitoring specific notifiers. While I agree that many systems have messy 400/500 semantics, if a specific notifier was known to not do this, it could be useful for alerting. SNS was the specific example given in the thread.

Another potential use case is if there's an outage in a 3rd party system where we happen to have an integration. The metrics can give operators some extra confidence (that they might be able to pass on to users) of "our deployment is working fine, but is down."

stevesg

Looks great, thanks! One nit in the changelog and please check my comments on the reason label to see if you agree/disagree.

As a follow up, it would be nice to experiment with an alert for 5xx errors coming from notifications, however I'm not sure how noisy it would be. Perhaps it would be fine for a subset of integrations.

Co-authored-by: Steve Simpson <[email protected]>

stevesg

Looks ready to merge, need one of @grafana/mimir-maintainers to approve the CHANGELOG.

alexweav added the component/alertmanager label May 4, 2023

alexweav commented May 4, 2023

View reviewed changes

alexweav marked this pull request as ready for review May 4, 2023 02:28

alexweav requested review from a team as code owners May 4, 2023 02:28

stevesg reviewed May 4, 2023

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

stevesg reviewed May 4, 2023

View reviewed changes

stevesg approved these changes May 4, 2023

View reviewed changes

alexweav and others added 5 commits May 8, 2023 17:21

Wire up reason label and update notification metrics tests

723879d

Map silences_maintenance metrics

6ae5765

Wire up new nflog maintenance metrics

371b7de

Update CHANGELOG

5f29fef

Specify metric names in changelog

a60a5e4

Co-authored-by: Steve Simpson <[email protected]>

alexweav force-pushed the alexweav/alertmanager-metrics branch from bb28959 to a60a5e4 Compare May 8, 2023 22:21

stevesg approved these changes May 9, 2023

View reviewed changes

pstibrany approved these changes May 9, 2023

View reviewed changes

stevesg merged commit 7593d2f into main May 9, 2023

stevesg deleted the alexweav/alertmanager-metrics branch May 9, 2023 09:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alertmanager: Introduce new metrics from upstream #4918

Alertmanager: Introduce new metrics from upstream #4918

Uh oh!

alexweav commented May 4, 2023 •

edited

Loading

Uh oh!

alexweav May 4, 2023 •

edited

Loading

Uh oh!

alexweav May 4, 2023

Uh oh!

alexweav May 4, 2023

Uh oh!

alexweav May 4, 2023

Uh oh!

gotjosh May 4, 2023

Uh oh!

Uh oh!

stevesg May 4, 2023 •

edited

Loading

Uh oh!

stevesg May 4, 2023

Uh oh!

alexweav May 8, 2023

Uh oh!

stevesg left a comment

Uh oh!

stevesg left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Alertmanager: Introduce new metrics from upstream #4918

Alertmanager: Introduce new metrics from upstream #4918

Uh oh!

Conversation

alexweav commented May 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

Uh oh!

alexweav May 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexweav May 4, 2023

Choose a reason for hiding this comment

Uh oh!

alexweav May 4, 2023

Choose a reason for hiding this comment

Uh oh!

alexweav May 4, 2023

Choose a reason for hiding this comment

Uh oh!

gotjosh May 4, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stevesg May 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevesg May 4, 2023

Choose a reason for hiding this comment

Uh oh!

alexweav May 8, 2023

Choose a reason for hiding this comment

Uh oh!

stevesg left a comment

Choose a reason for hiding this comment

Uh oh!

stevesg left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexweav commented May 4, 2023 •

edited

Loading

alexweav May 4, 2023 •

edited

Loading

stevesg May 4, 2023 •

edited

Loading