-
Notifications
You must be signed in to change notification settings - Fork 222
Bug 1861455: Add basic HAProxy alert rules for HAProxy status and Reload failures #397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| apiVersion: monitoring.coreos.com/v1 | ||
| kind: PrometheusRule | ||
| metadata: | ||
| name: ingress-operator | ||
| namespace: openshift-ingress-operator | ||
| labels: | ||
| role: alert-rules | ||
| spec: | ||
| groups: | ||
| - name: openshift-ingress.rules | ||
| rules: | ||
| - alert: HAProxyReloadFail | ||
| expr: increase(template_router_reload_fails[5m]) > 0 | ||
| for: 5m | ||
| labels: | ||
| severity: critical | ||
| annotations: | ||
| message: "HAProxy reloads have failed on {{ $labels.pod }}. Router is not respecting recently created or modified routes" | ||
| - alert: HAProxyDown | ||
| expr: haproxy_up == 0 | ||
| for: 5m | ||
| labels: | ||
| severity: critical | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe let's add a warning if more are up than down
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @RiRa12621 the default ingress controller run 2 HAProxy instances. With that in mind, is it not a good idea to alert when any HAProxy instances report |
||
| annotations: | ||
| message: "HAProxy metrics are reporting that the router is down" | ||
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often does it reload? Because some failure every 5 minutes still seems ok to me.
If we make it alert on the first failure I'm afraid it will be noisy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HAProxy reloads often. How can we change this expression to not alert on the first failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is, if HAProxy fails to reload, no successive reloads will succeed and the router will become "wedged". So I think we want to alert on the first reload failure. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openshift/router#190 needs to be back-ported to 4.5 to reduce the reload alert noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, so if it fails once, it's consecutively going to fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally yes.