Bug 1861455: Add basic HAProxy alert rules for HAProxy status and Reload failures #397

sgreene570 · 2020-05-13T17:48:52Z

Adds 2 alerts for existing HAProxy metrics.

HAProxyReloadFail
Fire an alert when the new template_router_reload_fail counter metric increments
HAProxyDown
Fire an alert when haproxy_up is reporting 0

Miciah · 2020-05-13T17:56:55Z

The verify job is failing; looks like make generate is needed.

manifests/0000_90_ingress-operator_03_prometheusrules.yaml

sgreene570 · 2020-05-14T17:12:07Z

Dropped HAProxyServerConnectionErrors alert in favor of HAProxyServerChecksFailing alert.

sgreene570 · 2020-05-18T16:37:43Z

Feedback from Miciah:

I think alerts should focus on infrastructure components that are broken (that is, things in namespaces that start with openshift-), or things that could break applications, but not applications themselves. Some cluster administrators will be very annoyed if they are alerted because some user's throwaway test application is down, and we don't have a way to determine which routes belong to production workloads that the cluster administrator cares about and which do not, so I think we need to leave it to the user to configure alerts for non-infrastructure components. Let's get the rest of the team's (and possibly some other teams') thoughts on this.

RiRa12621 · 2020-07-28T19:11:07Z

manifests/0000_90_ingress-operator_03_prometheusrules.yaml

+  groups:
+    - name: openshift-ingress.rules
+      rules:
+      - alert: HAProxyHighRoute5xxResp


I would expect every OCP component, that creates a route out of the box, to create their alerting rules like that.
There are some reasons for this:

that rule does not tell us anything about the health of the router

components can have alerts defined based on their individual SLOs and have those rules to be more sophisticated because they can be more specific

Right, now that I think about it, this would make much more sense.

RiRa12621 · 2020-07-29T16:02:30Z

we probably want something like

- name: router_slos
  rules:
    record: router:error_slo:percent
    labels:
      job: router-internal-default
    expr: 0.1

- name: router_multiwindow_recording_rules
  rules:
  - record: router:haproxy_csv_fetch_errors:ratio_rate5m
    expr: |2
        rate(haproxy_exporter_csv_parse_failure[5m])
      /
        rate(haproxy_exporter_total_scrapes[5m])
  - record: router:haproxy_csv_fetch_errors:ratio_rate30m
    expr: |2
        rate(haproxy_exporter_csv_parse_failure[30m])
      /
        rate(haproxy_exporter_total_scrapes[30m])
  - record: router:haproxy_csv_fetch_errors:ratio_rate1h
    expr: |2
        rate(haproxy_exporter_csv_parse_failure[1h])
      /
        rate(haproxy_exporter_total_scrapes[1h])
  - record: router:haproxy_csv_fetch_errors:ratio_rate2h
    expr: |2
        rate(haproxy_exporter_csv_parse_failure[2h])
      /
        rate(haproxy_exporter_total_scrapes[2h])
  - record: router:haproxy_csv_fetch_errors:ratio_rate6h
    expr: |2
        rate(haproxy_exporter_csv_parse_failure[6h])
      /
        rate(haproxy_exporter_total_scrapes[6h])
  - record: router:haproxy_csv_fetch_errors:ratio_rate1d
    expr: |2
        rate(haproxy_exporter_csv_parse_failure[1d])
      /
        rate(haproxy_exporter_total_scrapes[1d])
  - record: router:haproxy_csv_fetch_errors:ratio_rate3d
    expr: |2
         rate(haproxy_exporter_csv_parse_failure[3d])
       /
         rate(haproxy_exporter_total_scrapes[3d])

- name: router_multiwindow_alerts
  rules:
  - alert: ErrorBudgetBurn
    expr: |2
        (
          100 * router:haproxy_csv_fetch_errors:ratio_rate1h
        > on (job)
          14.4 * router:error_slo:percent
        )
      and
        (
          100 * router:haproxy_csv_fetch_errors:ratio_rate5m
        > on (job)
          14.4 * router:error_slo:percent
        )
    for: 2m
    labels:
      system: "{{$labels.job}}"
      severity: "critical"
      long_window: "1h"
    annotations:
      summary: "a router burns its error budget very fast"
      description: "Router {{$labels.job}} has returned {{ $value | printf `%.2f` }}% errors over the last hour."
  - alert: ErrorBudgetBurn
    expr: |2
        (
          100 * router:haproxy_csv_fetch_errors:ratio_rate6h
        > on (job)
          6 * router:error_slo:percent
        )
      and
        (
          100 * router:haproxy_csv_fetch_errors:ratio_rate30m
        > on (job)
          6 * router:error_slo:percent
        )
    for: 15m
    labels:
      system: "{{$labels.job}}"
      severity: "critical"
      long_window: "6h"
    annotations:
      summary: "a router burns its error budget very fast"
      description: "Router {{$labels.job}} has returned {{ $value | printf `%.2f` }}% errors over the 6 hours."
  - alert: ErrorBudgetBurn
    expr: |2
        (
          100 * router:haproxy_csv_fetch_errors:ratio_rate1d
        > on (job)
          3 * router:error_slo:percent
        )
      and
        (
          100 * router:haproxy_csv_fetch_errors:ratio_rate2h
        > on (job)
          3 * router:error_slo:percent
        )
    for: 1h
    labels:
      system: "{{$labels.job}}"
      severity: "warning"
      long_window: "1d"
    annotations:
      summary: "a router burns its error budget very fast"
      description: "Router {{$labels.job}} has returned {{ $value | printf `%.2f` }}% errors over the last day."
  - alert: ErrorBudgetBurn
    expr: |2
        (
          100 * router:haproxy_csv_fetch_errors:ratio_rate3d
        > on (job)
          1 * router:error_slo:percent
        )
      and
        (
          100 * router:haproxy_csv_fetch_errors:ratio_rate6h
        > on (job)
          1 * router:error_slo:percent
        )
    for: 1h
    labels:
      system: "{{$labels.job}}"
      severity: "warning"
      long_window: "3d"
    annotations:
      summary: "a router burns its error budget very fast"
      description: "Router {{$labels.job}} has returned {{ $value | printf `%.2f` }}% errors over the last 3 days."

once the metrics from https://bugzilla.redhat.com/show_bug.cgi?id=1861455 are fixed

sgreene570 · 2020-07-29T16:59:57Z

once the metrics from https://bugzilla.redhat.com/show_bug.cgi?id=1861455 are fixed

@RiRa12621 sounds good. Lets reconvene after that BZ is fixed. If later we find that this PR is as doesn't have any useful alerts, we can scrap it and open a new one then.

openshift-ci-robot · 2020-08-10T14:40:06Z

@sgreene570: This pull request references Bugzilla bug 1861455, which is invalid:

expected the bug to be in one of the following states: NEW, ASSIGNED, ON_DEV, POST, POST, but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1861455: Add basic HAProxy alert rules for HAProxy status and Reload failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-08-10T14:41:23Z

@sgreene570: This pull request references Bugzilla bug 1861455, which is invalid:

expected the bug to be in one of the following states: NEW, ASSIGNED, ON_DEV, POST, POST, but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1861455: Add basic HAProxy alert rules for HAProxy status and Reload failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sgreene570 · 2020-08-10T14:43:37Z

/bugzilla refresh

openshift-ci-robot · 2020-08-10T14:43:42Z

@sgreene570: This pull request references Bugzilla bug 1861455, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sgreene570 · 2020-08-10T14:45:33Z

@RiRa12621 I updated this PR to add 2 basic alerts in response to BZ 1861455. Could I get your feedback on the new alerts when you get the chance?

Miciah · 2020-08-11T15:01:23Z

👍
/lgtm

openshift-ci-robot · 2020-08-11T15:01:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah, sgreene570

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Miciah,sgreene570]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sgreene570 · 2020-08-11T16:58:53Z

/retest

Miciah · 2020-08-11T20:22:34Z

TestInternalLoadBalancer timed out in the last e2e-aws-operator test run, but the ingress operator's logs are empty, kcm's log appear to be incomplete, and I don't see any events that could explain the failure.
/test e2e-aws-operator

openshift-bot · 2020-08-11T20:26:37Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-08-11T22:02:07Z

@sgreene570: All pull requests linked via external trackers have merged: openshift/router#165, openshift/cluster-ingress-operator#397. Bugzilla bug 1861455 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1861455: Add basic HAProxy alert rules for HAProxy status and Reload failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sgreene570 · 2020-08-19T14:20:19Z

/cherry-pick release-4.5

openshift-cherrypick-robot · 2020-08-19T14:20:30Z

@sgreene570: new pull request created: #440

Details

In response to this:

/cherry-pick release-4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

RiRa12621 · 2020-10-06T15:16:43Z

turns out i forgot to hit "publish" on my comments @sgreene570

RiRa12621 · 2020-08-10T16:02:26Z

manifests/0000_90_ingress-operator_03_prometheusrules.yaml

+    - name: openshift-ingress.rules
+      rules:
+      - alert: HAProxyReloadFail
+        expr: increase(template_router_reload_fails[5m]) > 0


How often does it reload? Because some failure every 5 minutes still seems ok to me.
If we make it alert on the first failure I'm afraid it will be noisy

HAProxy reloads often. How can we change this expression to not alert on the first failure?

The problem is, if HAProxy fails to reload, no successive reloads will succeed and the router will become "wedged". So I think we want to alert on the first reload failure. Thoughts?

openshift/router#190 needs to be back-ported to 4.5 to reduce the reload alert noise.

Oh, so if it fails once, it's consecutively going to fail?

Ideally yes.

RiRa12621 · 2020-08-10T16:03:05Z

manifests/0000_90_ingress-operator_03_prometheusrules.yaml

+        expr: haproxy_up == 0
+        for: 5m
+        labels:
+          severity: critical


Maybe let's add a warning if more are up than down

@RiRa12621 the default ingress controller run 2 HAProxy instances. With that in mind, is it not a good idea to alert when any HAProxy instances report haproxy_up == 0?

openshift-ci-robot requested review from danehans and frobware May 13, 2020 17:49

sgreene570 changed the title ~~Add HAProxy alert rules~~ NE-168: Add HAProxy alert rules May 13, 2020

sgreene570 force-pushed the add-alerts branch from b37b4ae to 60c2418 Compare May 13, 2020 17:59

Miciah reviewed May 13, 2020

View reviewed changes

manifests/0000_90_ingress-operator_03_prometheusrules.yaml Outdated Show resolved Hide resolved

manifests/0000_90_ingress-operator_03_prometheusrules.yaml Outdated Show resolved Hide resolved

manifests/0000_90_ingress-operator_03_prometheusrules.yaml Outdated Show resolved Hide resolved

sgreene570 force-pushed the add-alerts branch 2 times, most recently from 0c5ec14 to ef4ad2c Compare May 14, 2020 17:07

sgreene570 changed the title ~~NE-168: Add HAProxy alert rules~~ [WIP] NE-168: Add HAProxy alert rules Jul 22, 2020

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 22, 2020

RiRa12621 reviewed Jul 28, 2020

View reviewed changes

Add basic HAProxy alert rules

a613bcd

sgreene570 force-pushed the add-alerts branch from ef4ad2c to a613bcd Compare August 10, 2020 14:38

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 10, 2020

sgreene570 changed the title ~~[WIP] NE-168: Add HAProxy alert rules~~ Bug 1861455: Add basic HAProxy alert rules for HAProxy status and Reload failures Aug 10, 2020

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 10, 2020

openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 10, 2020

openshift-ci-robot assigned Miciah Aug 11, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 11, 2020

openshift-merge-robot merged commit 8aa1ce2 into openshift:master Aug 11, 2020

openshift-cherrypick-robot mentioned this pull request Aug 19, 2020

Bug 1871175: Add basic HAProxy alert rules for HAProxy status and Reload failures #440

Merged

RiRa12621 reviewed Oct 6, 2020

View reviewed changes

Bug 1861455: Add basic HAProxy alert rules for HAProxy status and Reload failures #397

Bug 1861455: Add basic HAProxy alert rules for HAProxy status and Reload failures #397

Uh oh!

Conversation

sgreene570 commented May 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Miciah commented May 13, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sgreene570 commented May 14, 2020

Uh oh!

sgreene570 commented May 18, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RiRa12621 commented Jul 29, 2020

Uh oh!

sgreene570 commented Jul 29, 2020

Uh oh!

openshift-ci-robot commented Aug 10, 2020

Uh oh!

openshift-ci-robot commented Aug 10, 2020

Uh oh!

sgreene570 commented Aug 10, 2020

Uh oh!

openshift-ci-robot commented Aug 10, 2020

Uh oh!

sgreene570 commented Aug 10, 2020

Uh oh!

Miciah commented Aug 11, 2020

Uh oh!

openshift-ci-robot commented Aug 11, 2020

Uh oh!

sgreene570 commented Aug 11, 2020

Uh oh!

Miciah commented Aug 11, 2020

Uh oh!

openshift-bot commented Aug 11, 2020

Uh oh!

openshift-ci-robot commented Aug 11, 2020

Uh oh!

sgreene570 commented Aug 19, 2020

Uh oh!

openshift-cherrypick-robot commented Aug 19, 2020

Uh oh!

RiRa12621 commented Oct 6, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

sgreene570 commented May 13, 2020 •

edited

Loading