-
Notifications
You must be signed in to change notification settings - Fork 674
Alertmanager: Introduce new metrics from upstream #4918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| cortex_alertmanager_notification_latency_seconds_bucket{le="+Inf"} 27 | ||
| cortex_alertmanager_notification_latency_seconds_sum 99.9 | ||
| cortex_alertmanager_notification_latency_seconds_count 27 | ||
| cortex_alertmanager_notification_latency_seconds_bucket{le="1"} 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests had drifted from the actual metrics a bit, they also did not cover the telegram notifier. I fixed the drift in the tests while I was in the area. There will be a few extra changes in the expected metrics due to this.
| queryErrorsTotal prometheus.Counter | ||
| queryDuration prometheus.Histogram | ||
| propagatedMessagesTotal prometheus.Counter | ||
| maintenanceTotal prometheus.Counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes in the rest of the file are copied from their respective prometheus files. There are comments pointing to the source above each relevant section.
| }, []string{"integration"}), | ||
| } | ||
| for _, integration := range integrations { | ||
| for _, integration := range []string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentionally duplicated the list here, because that's what's in prometheus's notify.go. I didn't want to disturb the "purity" of the code that is copied from prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests currently work against copy/pasted code from prometheus, which is subject to drift if prometheus changes (as it did here).
@gotjosh, WDYT about exporting some of these objects like metrics structs? Then, we could test against the actual vendored prometheus code, and avoid this kind of test drift in the future. Would alertmanager accept such a change, or is it too separated from prometheus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no harm in that if it's a straightforward change 👍
| "cortex_alertmanager_notifications_failed_total", | ||
| "The total number of failed notifications.", | ||
| []string{"user", "integration"}, nil), | ||
| []string{"user", "integration", "reason"}, nil), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric can already be very high cardinality with many tenants, so I'm wondering whether we should actually intentionally aggregate away reason. However, it can only be one of three values, (other, clientError or serverError), we could potentially accept it, and because this metric is only exposed for non-zero values, the increase in series would normally be low.
I'm trying to think how we would use it:
- For investigating failures:
The extra detail is probably too little; I would just look at logs. - For alerting on failures as an operator:
There could be value if we only wanted to alert on a particular class of error. However, I'm not sure I trust all integrations and backend services to correctly return 4xx vs 5xx (think: arbitrary webhooks). It appears the original intention of the change was to alert on 5xx class errors specifically from the SNS integration, so the people doing that presumably have sufficient trust in the error codes returned from it. - For alerting on failures as a tenant/user:
I care about all errors, regardless of 4xx vs 5xx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked some of our internal environments, even with the high number of tenants, the number of series for this metric is very small, so I don't think adding reason will be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is much discussion here (including mirroring your point about cardinality): prometheus/alertmanager#2927
Overall, the main benefit is for monitoring specific notifiers. While I agree that many systems have messy 400/500 semantics, if a specific notifier was known to not do this, it could be useful for alerting. SNS was the specific example given in the thread.
Another potential use case is if there's an outage in a 3rd party system where we happen to have an integration. The metrics can give operators some extra confidence (that they might be able to pass on to users) of "our deployment is working fine, but is down."
stevesg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks! One nit in the changelog and please check my comments on the reason label to see if you agree/disagree.
As a follow up, it would be nice to experiment with an alert for 5xx errors coming from notifications, however I'm not sure how noisy it would be. Perhaps it would be fine for a subset of integrations.
bb28959 to
a60a5e4
Compare
stevesg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ready to merge, need one of @grafana/mimir-maintainers to approve the CHANGELOG.
What this PR does
The upstream
prometheus/alertmanagerhas recently been enhanced with new metrics:reasonlabel onalertmanager_notifications_failed_totalalertmanager_silences_maintenance_totalalertmanager_silences_maintenance_errors_totalalertmanager_nflog_maintenance_totalalertmanager_nflog_maintenance_errors_totalSince we recently upgraded alertmanager, these metrics are now available for us to consume.
This PR exposes these metrics in Mimir.
Which issue(s) this PR fixes or relates to
Fixes #4895
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]