Skip to content

Conversation

@alexweav
Copy link
Contributor

@alexweav alexweav commented May 4, 2023

What this PR does

The upstream prometheus/alertmanager has recently been enhanced with new metrics:

  • New reason label on alertmanager_notifications_failed_total
  • alertmanager_silences_maintenance_total
  • alertmanager_silences_maintenance_errors_total
  • alertmanager_nflog_maintenance_total
  • alertmanager_nflog_maintenance_errors_total

Since we recently upgraded alertmanager, these metrics are now available for us to consume.
This PR exposes these metrics in Mimir.

Which issue(s) this PR fixes or relates to

Fixes #4895

Checklist

  • Tests updated
  • Documentation added
    • I found no documentation listing each metric by name, but happy to update this if there is one! 😄
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

cortex_alertmanager_notification_latency_seconds_bucket{le="+Inf"} 27
cortex_alertmanager_notification_latency_seconds_sum 99.9
cortex_alertmanager_notification_latency_seconds_count 27
cortex_alertmanager_notification_latency_seconds_bucket{le="1"} 16
Copy link
Contributor Author

@alexweav alexweav May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests had drifted from the actual metrics a bit, they also did not cover the telegram notifier. I fixed the drift in the tests while I was in the area. There will be a few extra changes in the expected metrics due to this.

queryErrorsTotal prometheus.Counter
queryDuration prometheus.Histogram
propagatedMessagesTotal prometheus.Counter
maintenanceTotal prometheus.Counter
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in the rest of the file are copied from their respective prometheus files. There are comments pointing to the source above each relevant section.

}, []string{"integration"}),
}
for _, integration := range integrations {
for _, integration := range []string{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally duplicated the list here, because that's what's in prometheus's notify.go. I didn't want to disturb the "purity" of the code that is copied from prometheus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests currently work against copy/pasted code from prometheus, which is subject to drift if prometheus changes (as it did here).

@gotjosh, WDYT about exporting some of these objects like metrics structs? Then, we could test against the actual vendored prometheus code, and avoid this kind of test drift in the future. Would alertmanager accept such a change, or is it too separated from prometheus?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no harm in that if it's a straightforward change 👍

@alexweav alexweav marked this pull request as ready for review May 4, 2023 02:28
@alexweav alexweav requested review from a team as code owners May 4, 2023 02:28
"cortex_alertmanager_notifications_failed_total",
"The total number of failed notifications.",
[]string{"user", "integration"}, nil),
[]string{"user", "integration", "reason"}, nil),
Copy link
Contributor

@stevesg stevesg May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric can already be very high cardinality with many tenants, so I'm wondering whether we should actually intentionally aggregate away reason. However, it can only be one of three values, (other, clientError or serverError), we could potentially accept it, and because this metric is only exposed for non-zero values, the increase in series would normally be low.

I'm trying to think how we would use it:

  • For investigating failures:
    The extra detail is probably too little; I would just look at logs.
  • For alerting on failures as an operator:
    There could be value if we only wanted to alert on a particular class of error. However, I'm not sure I trust all integrations and backend services to correctly return 4xx vs 5xx (think: arbitrary webhooks). It appears the original intention of the change was to alert on 5xx class errors specifically from the SNS integration, so the people doing that presumably have sufficient trust in the error codes returned from it.
  • For alerting on failures as a tenant/user:
    I care about all errors, regardless of 4xx vs 5xx.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked some of our internal environments, even with the high number of tenants, the number of series for this metric is very small, so I don't think adding reason will be a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is much discussion here (including mirroring your point about cardinality): prometheus/alertmanager#2927

Overall, the main benefit is for monitoring specific notifiers. While I agree that many systems have messy 400/500 semantics, if a specific notifier was known to not do this, it could be useful for alerting. SNS was the specific example given in the thread.

Another potential use case is if there's an outage in a 3rd party system where we happen to have an integration. The metrics can give operators some extra confidence (that they might be able to pass on to users) of "our deployment is working fine, but is down."

Copy link
Contributor

@stevesg stevesg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks! One nit in the changelog and please check my comments on the reason label to see if you agree/disagree.

As a follow up, it would be nice to experiment with an alert for 5xx errors coming from notifications, however I'm not sure how noisy it would be. Perhaps it would be fine for a subset of integrations.

@alexweav alexweav force-pushed the alexweav/alertmanager-metrics branch from bb28959 to a60a5e4 Compare May 8, 2023 22:21
Copy link
Contributor

@stevesg stevesg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ready to merge, need one of @grafana/mimir-maintainers to approve the CHANGELOG.

@stevesg stevesg merged commit 7593d2f into main May 9, 2023
@stevesg stevesg deleted the alexweav/alertmanager-metrics branch May 9, 2023 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New Alertmanager metrics following upgrade

4 participants