Skip to content

Bug 1953798: Add apiserver burn rate SLO alerts#1126

Merged
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
ihcsim:apiserver-errorbudgetburn-alert
Jun 1, 2021
Merged

Bug 1953798: Add apiserver burn rate SLO alerts#1126
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
ihcsim:apiserver-errorbudgetburn-alert

Conversation

@ihcsim
Copy link
Copy Markdown

@ihcsim ihcsim commented May 4, 2021

This PR copies the upstream KubeAPIErrorBudgetBurn alerts and its recording
rules to CKAO. It updates the rules to:

  1. Exclude long-running requests to subresources likeproxy, logs, exec etc.,
    from the error budget calculation. As described in
    https://bugzilla.redhat.com/show_bug.cgi?id=1953798#c1, these long-running
    GET, LIST requests are causing flakiness in the e2e parallel and upgrade
    jobs, due to slow resource reads.
  2. Separate the request latency calculation by the request scope (i.e. resource,
    namespace, cluster).

Subsequent PRs to fix other parts of the e2e test suites will follow, if the changes
in this PR don't fully eliminate the flakiness.

As part of the work to complete 1953798, a separate PR will also be submitted to
CMO to remove (disable, mask etc.) the same alerts.

The PR to remove the same alerts from CMO is at
openshift/cluster-monitoring-operator#1164.

Signed-off-by: Ivan Sim isim@redhat.com

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 4, 2021
@openshift-ci-robot
Copy link
Copy Markdown

@ihcsim: This pull request references Bugzilla bug 1953798, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

Details

In response to this:

Bug 1953798: Add apiserver burn rate SLO alerts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ihcsim
To complete the pull request process, please assign soltysh after the PR has been reviewed.
You can assign the PR to them by writing /assign @soltysh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment thread bindata/v4.1.0/alerts/kube-apiserver-slos.yaml Outdated
Comment thread bindata/v4.1.0/alerts/kube-apiserver-slos.yaml Outdated
Comment thread bindata/v4.1.0/alerts/kube-apiserver-slos.yaml Outdated
Comment thread bindata/v4.1.0/alerts/kube-apiserver-slos.yaml Outdated
@openshift-ci-robot
Copy link
Copy Markdown

@ihcsim: This pull request references Bugzilla bug 1953798, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

Details

In response to this:

Bug 1953798: Add apiserver burn rate SLO alerts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ihcsim
Copy link
Copy Markdown
Author

ihcsim commented May 5, 2021

@smarterclayton Per slack convo, I replaced the old recording rules with the ones defined in the BZ, and excluded the subresources from the resource-scoped calculation. The results of the new and old rules look comparable when tested on my dev cluster, which makes me think that we can replace the old rules entirely. The new burn-type groupings also make things easier to reason about.

The trend of the aggregation rates looks almost identical at a lower time range. E.g. with [30m]:

new-recording-rules-30m

old-recording-rules-30m

At a higher time range, even though the shape of the charts looks different (due to the groupings), the values still fall closely within the same range. E.g., with [1h]:

new-recording-rules-1d

old-recording-rules-1d

Comment thread bindata/v4.1.0/alerts/kube-apiserver-slos.yaml Outdated
rules:
- expr: |
# error
label_replace(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether I recommended a bad factoring here - it's possible that the rules with / and label matching this doesn't compose like it would in normal math rules.

Copy link
Copy Markdown
Contributor

@smarterclayton smarterclayton May 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I I think I figured out what the confusion was (your code is right). I was looking for the highest rate of slowest request query which roughly looks like:

        sum without (instance) (rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET",scope=~"resource|"}[30m]))

        - (
          sum without (instance,le) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30m]))
          or
          vector(0)
        )

The graph of this shows "biggest contribution to slow resource requests"



{apiserver="kube-apiserver", component="apiserver", endpoint="https", group="apiserver.openshift.io", job="apiserver", namespace="default", resource="apirequestcounts", scope="resource", service="kubernetes", verb="GET", version="v1"} | 3.1326662908397314
{apiserver="kube-apiserver", component="apiserver", endpoint="https", job="apiserver", namespace="default", resource="serviceaccounts", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.9384180790960457
{apiserver="kube-apiserver", component="apiserver", endpoint="https", job="apiserver", namespace="default", resource="pods", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.7807909604519772
{apiserver="kube-apiserver", component="apiserver", endpoint="https", job="apiserver", namespace="default", resource="configmaps", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.7785310734463291
{apiserver="kube-apiserver", component="apiserver", endpoint="https", job="apiserver", namespace="default", resource="secrets", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.507344632768362
{apiserver="kube-apiserver", component="apiserver", endpoint="https", job="apiserver", namespace="default", resource="namespaces", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.4881355932203384
{apiserver="kube-apiserver", component="apiserver", endpoint="https", job="apiserver", namespace="default", resource="pods", scope="resource", service="kubernetes", subresource="log", verb="GET", version="v1"} | 0.42126925907526447
{apiserver="kube-apiserver", component="apiserver", endpoint="https", group="rbac.authorization.k8s.io", job="apiserver", namespace="default", resource="clusterrolebindings", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.40677966101694807
{apiserver="kube-apiserver", endpoint="https", job="apiserver", namespace="default", service="kubernetes", subresource="/readyz", verb="GET"} | 0.3903954802259868
{apiserver="kube-apiserver", component="apiserver", endpoint="https", job="apiserver", namespace="default", resource="services", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.17457627118644137
{apiserver="kube-apiserver", component="apiserver", endpoint="https", group="apps", job="apiserver", namespace="default", resource="deployments", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.17344632768361556
{apiserver="kube-apiserver", component="apiserver", endpoint="https", group="rbac.authorization.k8s.io", job="apiserver", namespace="default", resource="rolebindings", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.167796610169491
{apiserver="kube-apiserver", component="apiserver", endpoint="https", group="apiextensions.k8s.io", job="apiserver", namespace="default", resource="customresourcedefinitions", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.15932203389830524
{apiserver="kube-apiserver", component="apiserver", endpoint="https", group="rbac.authorization.k8s.io", job="apiserver", namespace="default", resource="clusterroles", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.12655367231638337

So it looks like apirequestcounts is the biggest contributor

sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET",subresource!~"proxy|log|exec",scope=~"resource|"}[5m]))
-
(sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",subresource!~"proxy|log|exec",scope=~"resource|",le="0.1"}[5m])) or vector(0))
) / scalar(sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",subresource!~"proxy|log|exec"}[5m])))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem here is that if anyone ever adds a new scope this would be off. I think it's ok for now (because scope covers all the possible values), but we'd need to be aware of that.

Comment thread bindata/v4.1.0/alerts/kube-apiserver-slos.yaml Outdated
@openshift-ci openshift-ci Bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 13, 2021
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 17, 2021

@ihcsim: This pull request references Bugzilla bug 1953798, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

Details

In response to this:

Bug 1953798: Add apiserver burn rate SLO alerts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

(
(
# too slow
sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h]))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't exclude imageimport? I guess we can handle this separately later.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I added these write rules as-is from upstream. Do we foresee image resources contribute to significant burns? Don't image imports get auto-pruned anyway?

@smarterclayton
Copy link
Copy Markdown
Contributor

The read metrics look ok. We can defer improving the write ones to later since they look correct today.

@smarterclayton
Copy link
Copy Markdown
Contributor

Can you link the PR to remove the old alerts here?

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 17, 2021

@ihcsim: This pull request references Bugzilla bug 1953798, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

Details

In response to this:

Bug 1953798: Add apiserver burn rate SLO alerts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ihcsim
Copy link
Copy Markdown
Author

ihcsim commented May 17, 2021

The PR to remove the same alerts from CMO is at openshift/cluster-monitoring-operator#1164.

@ihcsim
Copy link
Copy Markdown
Author

ihcsim commented May 26, 2021

/retest

@smarterclayton
Copy link
Copy Markdown
Contributor

/approve
/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 26, 2021
@smarterclayton
Copy link
Copy Markdown
Contributor

/test all

This is looking good - while we've talked about removing the alert, I'm tempted to say the current alert is sufficient to merge and then we can remove prior to 4.8.

@ihcsim
Copy link
Copy Markdown
Author

ihcsim commented May 30, 2021

/retest

Ivan Sim added 2 commits May 31, 2021 09:38
These alerts and recording rules are transferred from the monitoring stack to
CKAO for easier management. See
openshift/cluster-monitoring-operator#1164

Signed-off-by: Ivan Sim <isim@redhat.com>
The new recording rules exclude long-running requests to subresources
from the resource-scoped and namespace-scoped requests. They also break
down slow request distributions by scopes.

Signed-off-by: Ivan Sim <isim@redhat.com>
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 31, 2021
@ihcsim
Copy link
Copy Markdown
Author

ihcsim commented May 31, 2021

@sttts Per slack convo, I separated the PR into two commits, to make future tracking easier. The first commit has the original alerts and rules, and the second has the new rules.

@sttts
Copy link
Copy Markdown
Contributor

sttts commented May 31, 2021

Factoring looks good. With @smarterclayton's lgtm #1126 (comment) for the actual alert expressions.

/lgtm
/approve

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 31, 2021
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 31, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ihcsim, smarterclayton, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 31, 2021
@ihcsim
Copy link
Copy Markdown
Author

ihcsim commented May 31, 2021

/retest

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 31, 2021

@ihcsim: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-operator-encryption-rotation 4f46165 link /test e2e-aws-operator-encryption-rotation
ci/prow/e2e-aws-operator e6727ca link /test e2e-aws-operator

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ihcsim
Copy link
Copy Markdown
Author

ihcsim commented May 31, 2021

/retest

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 1b61ad5 into openshift:master Jun 1, 2021
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 1, 2021

@ihcsim: All pull requests linked via external trackers have merged:

Bugzilla bug 1953798 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1953798: Add apiserver burn rate SLO alerts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants