Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netcup Prometheus alerting for unreachable Alertmanagers #219

Open
jchristgit opened this issue Apr 14, 2024 · 2 comments
Open

Netcup Prometheus alerting for unreachable Alertmanagers #219

jchristgit opened this issue Apr 14, 2024 · 2 comments
Labels
group: ansible Issues and pull requests related to the Ansible setup

Comments

@jchristgit
Copy link
Member

We need to configure our Alertmanager to send us alerts on Discord such that we
can be informed of anything not being right as part of the monitoring setup on
lovelace.

@jchristgit jchristgit added the group: ansible Issues and pull requests related to the Ansible setup label Apr 14, 2024
@jb3
Copy link
Member

jb3 commented Apr 14, 2024

As discussed in the dev-ops channel, I think we can reach a configuration here that utilizes our existing High-Availability AlertManager setup.

We can set up token access for Prometheus on Ansible machines to push alerts through to the Kubernetes HA AlertManager.

Some notes:

  • This does not mean we cannot route alerts differently, push to different areas based on different severity or anything like this, we still maintain full granular control over alert routing, even more so since we centralise it.
  • We can write a small healthcheck (systemd timer, cronjob, etc.) to check whether the alertmanager server is healthy and responding to requests, if it is not then we can trigger a rudimentary alert from netcup reporting that the AlertManager is down. I don't think this needs to be a separate instance of AlertManager as that feels overcomplex and leads to duplication of routing configuration.

@jchristgit
Copy link
Member Author

Just to clarify this from a discussion on Discord, this is about adding a "dead
man's switch" alert that will route to Discord in case the Netcup Prometheus
instance can't contact the Alertmanager in Kubernetes properly. To cover this
case we want to:

  • add alerts in Prometheus in case it cannot talk to Alertmanager properly
    (there are built-in metrics for this exported by Prometheus)
  • add a local alertmanager configured to send alerts to Discord
  • configure the local alertmanager & Prometheus instances such that only the
    newly added alerts from above are routed to the local alertmanager and
    everything else is still routed to the Kubernetes alertmanager.

@jchristgit jchristgit changed the title Set up Prometheus alerting on Discord in Ansible Netcup Prometheus alerting for unreachable Alertmanagers May 1, 2024
@jchristgit jchristgit moved this from Up next to Backlog in Infrastructure Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
group: ansible Issues and pull requests related to the Ansible setup
Projects
Status: Backlog
Development

No branches or pull requests

2 participants