Skip to content

Conversation

@sgreene570
Copy link
Contributor

@sgreene570 sgreene570 commented May 13, 2020

Adds 2 alerts for existing HAProxy metrics.

  • HAProxyReloadFail
    Fire an alert when the new template_router_reload_fail counter metric increments
  • HAProxyDown
    Fire an alert when haproxy_up is reporting 0

@sgreene570 sgreene570 changed the title Add HAProxy alert rules NE-168: Add HAProxy alert rules May 13, 2020
@Miciah
Copy link
Contributor

Miciah commented May 13, 2020

The verify job is failing; looks like make generate is needed.

@sgreene570 sgreene570 force-pushed the add-alerts branch 2 times, most recently from 0c5ec14 to ef4ad2c Compare May 14, 2020 17:07
@sgreene570
Copy link
Contributor Author

Dropped HAProxyServerConnectionErrors alert in favor of HAProxyServerChecksFailing alert.

@sgreene570
Copy link
Contributor Author

Feedback from Miciah:

I think alerts should focus on infrastructure components that are broken (that is, things in namespaces that start with openshift-), or things that could break applications, but not applications themselves. Some cluster administrators will be very annoyed if they are alerted because some user's throwaway test application is down, and we don't have a way to determine which routes belong to production workloads that the cluster administrator cares about and which do not, so I think we need to leave it to the user to configure alerts for non-infrastructure components. Let's get the rest of the team's (and possibly some other teams') thoughts on this.

@sgreene570 sgreene570 changed the title NE-168: Add HAProxy alert rules [WIP] NE-168: Add HAProxy alert rules Jul 22, 2020
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 22, 2020
groups:
- name: openshift-ingress.rules
rules:
- alert: HAProxyHighRoute5xxResp

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect every OCP component, that creates a route out of the box, to create their alerting rules like that.
There are some reasons for this:

  • that rule does not tell us anything about the health of the router
  • components can have alerts defined based on their individual SLOs and have those rules to be more sophisticated because they can be more specific

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, now that I think about it, this would make much more sense.

@RiRa12621
Copy link

we probably want something like

- name: router_slos
  rules:
    record: router:error_slo:percent
    labels:
      job: router-internal-default
    expr: 0.1

- name: router_multiwindow_recording_rules
  rules:
  - record: router:haproxy_csv_fetch_errors:ratio_rate5m
    expr: |2
        rate(haproxy_exporter_csv_parse_failure[5m])
      /
        rate(haproxy_exporter_total_scrapes[5m])
  - record: router:haproxy_csv_fetch_errors:ratio_rate30m
    expr: |2
        rate(haproxy_exporter_csv_parse_failure[30m])
      /
        rate(haproxy_exporter_total_scrapes[30m])
  - record: router:haproxy_csv_fetch_errors:ratio_rate1h
    expr: |2
        rate(haproxy_exporter_csv_parse_failure[1h])
      /
        rate(haproxy_exporter_total_scrapes[1h])
  - record: router:haproxy_csv_fetch_errors:ratio_rate2h
    expr: |2
        rate(haproxy_exporter_csv_parse_failure[2h])
      /
        rate(haproxy_exporter_total_scrapes[2h])
  - record: router:haproxy_csv_fetch_errors:ratio_rate6h
    expr: |2
        rate(haproxy_exporter_csv_parse_failure[6h])
      /
        rate(haproxy_exporter_total_scrapes[6h])
  - record: router:haproxy_csv_fetch_errors:ratio_rate1d
    expr: |2
        rate(haproxy_exporter_csv_parse_failure[1d])
      /
        rate(haproxy_exporter_total_scrapes[1d])
  - record: router:haproxy_csv_fetch_errors:ratio_rate3d
    expr: |2
         rate(haproxy_exporter_csv_parse_failure[3d])
       /
         rate(haproxy_exporter_total_scrapes[3d])

- name: router_multiwindow_alerts
  rules:
  - alert: ErrorBudgetBurn
    expr: |2
        (
          100 * router:haproxy_csv_fetch_errors:ratio_rate1h
        > on (job)
          14.4 * router:error_slo:percent
        )
      and
        (
          100 * router:haproxy_csv_fetch_errors:ratio_rate5m
        > on (job)
          14.4 * router:error_slo:percent
        )
    for: 2m
    labels:
      system: "{{$labels.job}}"
      severity: "critical"
      long_window: "1h"
    annotations:
      summary: "a router burns its error budget very fast"
      description: "Router {{$labels.job}} has returned {{ $value | printf `%.2f` }}% errors over the last hour."
  - alert: ErrorBudgetBurn
    expr: |2
        (
          100 * router:haproxy_csv_fetch_errors:ratio_rate6h
        > on (job)
          6 * router:error_slo:percent
        )
      and
        (
          100 * router:haproxy_csv_fetch_errors:ratio_rate30m
        > on (job)
          6 * router:error_slo:percent
        )
    for: 15m
    labels:
      system: "{{$labels.job}}"
      severity: "critical"
      long_window: "6h"
    annotations:
      summary: "a router burns its error budget very fast"
      description: "Router {{$labels.job}} has returned {{ $value | printf `%.2f` }}% errors over the 6 hours."
  - alert: ErrorBudgetBurn
    expr: |2
        (
          100 * router:haproxy_csv_fetch_errors:ratio_rate1d
        > on (job)
          3 * router:error_slo:percent
        )
      and
        (
          100 * router:haproxy_csv_fetch_errors:ratio_rate2h
        > on (job)
          3 * router:error_slo:percent
        )
    for: 1h
    labels:
      system: "{{$labels.job}}"
      severity: "warning"
      long_window: "1d"
    annotations:
      summary: "a router burns its error budget very fast"
      description: "Router {{$labels.job}} has returned {{ $value | printf `%.2f` }}% errors over the last day."
  - alert: ErrorBudgetBurn
    expr: |2
        (
          100 * router:haproxy_csv_fetch_errors:ratio_rate3d
        > on (job)
          1 * router:error_slo:percent
        )
      and
        (
          100 * router:haproxy_csv_fetch_errors:ratio_rate6h
        > on (job)
          1 * router:error_slo:percent
        )
    for: 1h
    labels:
      system: "{{$labels.job}}"
      severity: "warning"
      long_window: "3d"
    annotations:
      summary: "a router burns its error budget very fast"
      description: "Router {{$labels.job}} has returned {{ $value | printf `%.2f` }}% errors over the last 3 days."

once the metrics from https://bugzilla.redhat.com/show_bug.cgi?id=1861455 are fixed

@sgreene570
Copy link
Contributor Author

once the metrics from https://bugzilla.redhat.com/show_bug.cgi?id=1861455 are fixed

@RiRa12621 sounds good. Lets reconvene after that BZ is fixed. If later we find that this PR is as doesn't have any useful alerts, we can scrap it and open a new one then.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 10, 2020
@sgreene570 sgreene570 changed the title [WIP] NE-168: Add HAProxy alert rules Bug 1861455: Add basic HAProxy alert rules for HAProxy status and Reload failures Aug 10, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Aug 10, 2020
@openshift-ci-robot
Copy link
Contributor

@sgreene570: This pull request references Bugzilla bug 1861455, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, ON_DEV, POST, POST, but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1861455: Add basic HAProxy alert rules for HAProxy status and Reload failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci-robot
Copy link
Contributor

@sgreene570: This pull request references Bugzilla bug 1861455, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, ON_DEV, POST, POST, but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1861455: Add basic HAProxy alert rules for HAProxy status and Reload failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sgreene570
Copy link
Contributor Author

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 10, 2020
@openshift-ci-robot
Copy link
Contributor

@sgreene570: This pull request references Bugzilla bug 1861455, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 10, 2020
@sgreene570
Copy link
Contributor Author

@RiRa12621 I updated this PR to add 2 basic alerts in response to BZ 1861455. Could I get your feedback on the new alerts when you get the chance?

@Miciah
Copy link
Contributor

Miciah commented Aug 11, 2020

👍
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 11, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah, sgreene570

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sgreene570
Copy link
Contributor Author

/retest

@Miciah
Copy link
Contributor

Miciah commented Aug 11, 2020

TestInternalLoadBalancer timed out in the last e2e-aws-operator test run, but the ingress operator's logs are empty, kcm's log appear to be incomplete, and I don't see any events that could explain the failure.
/test e2e-aws-operator

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 8aa1ce2 into openshift:master Aug 11, 2020
@openshift-ci-robot
Copy link
Contributor

@sgreene570: All pull requests linked via external trackers have merged: openshift/router#165, openshift/cluster-ingress-operator#397. Bugzilla bug 1861455 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1861455: Add basic HAProxy alert rules for HAProxy status and Reload failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sgreene570
Copy link
Contributor Author

/cherry-pick release-4.5

@openshift-cherrypick-robot

@sgreene570: new pull request created: #440

Details

In response to this:

/cherry-pick release-4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@RiRa12621
Copy link

turns out i forgot to hit "publish" on my comments @sgreene570

- name: openshift-ingress.rules
rules:
- alert: HAProxyReloadFail
expr: increase(template_router_reload_fails[5m]) > 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often does it reload? Because some failure every 5 minutes still seems ok to me.
If we make it alert on the first failure I'm afraid it will be noisy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HAProxy reloads often. How can we change this expression to not alert on the first failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is, if HAProxy fails to reload, no successive reloads will succeed and the router will become "wedged". So I think we want to alert on the first reload failure. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openshift/router#190 needs to be back-ported to 4.5 to reduce the reload alert noise.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so if it fails once, it's consecutively going to fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally yes.

expr: haproxy_up == 0
for: 5m
labels:
severity: critical

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's add a warning if more are up than down

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RiRa12621 the default ingress controller run 2 HAProxy instances. With that in mind, is it not a good idea to alert when any HAProxy instances report haproxy_up == 0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants