Bug 1953798: Add apiserver burn rate SLO alerts by ihcsim · Pull Request #1126 · openshift/cluster-kube-apiserver-operator

ihcsim · 2021-05-04T21:51:37Z

This PR copies the upstream KubeAPIErrorBudgetBurn alerts and its recording
rules to CKAO. It updates the rules to:

Exclude long-running requests to subresources likeproxy, logs, exec etc.,
from the error budget calculation. As described in
https://bugzilla.redhat.com/show_bug.cgi?id=1953798#c1, these long-running
GET, LIST requests are causing flakiness in the e2e parallel and upgrade
jobs, due to slow resource reads.
Separate the request latency calculation by the request scope (i.e. resource,
namespace, cluster).

Subsequent PRs to fix other parts of the e2e test suites will follow, if the changes
in this PR don't fully eliminate the flakiness.

~~As part of the work to complete 1953798, a separate PR will also be submitted to~~
~~CMO to remove (disable, mask etc.) the same alerts.~~

The PR to remove the same alerts from CMO is at
openshift/cluster-monitoring-operator#1164.

Signed-off-by: Ivan Sim isim@redhat.com

openshift-ci-robot · 2021-05-04T21:51:44Z

@ihcsim: This pull request references Bugzilla bug 1953798, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.8.0) matches configured target release for branch (4.8.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

Details

In response to this:

Bug 1953798: Add apiserver burn rate SLO alerts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2021-05-04T21:51:55Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ihcsim
To complete the pull request process, please assign soltysh after the PR has been reviewed.
You can assign the PR to them by writing /assign @soltysh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2021-05-05T21:03:50Z

@ihcsim: This pull request references Bugzilla bug 1953798, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.8.0) matches configured target release for branch (4.8.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

Details

In response to this:

Bug 1953798: Add apiserver burn rate SLO alerts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ihcsim · 2021-05-05T21:17:35Z

@smarterclayton Per slack convo, I replaced the old recording rules with the ones defined in the BZ, and excluded the subresources from the resource-scoped calculation. The results of the new and old rules look comparable when tested on my dev cluster, which makes me think that we can replace the old rules entirely. The new burn-type groupings also make things easier to reason about.

The trend of the aggregation rates looks almost identical at a lower time range. E.g. with [30m]:

At a higher time range, even though the shape of the charts looks different (due to the groupings), the values still fall closely within the same range. E.g., with [1h]:

smarterclayton · 2021-05-10T13:36:41Z

+    rules:
+    - expr: |
+        # error
+        label_replace(


I'm wondering whether I recommended a bad factoring here - it's possible that the rules with / and label matching this doesn't compose like it would in normal math rules.

Ok, I I think I figured out what the confusion was (your code is right). I was looking for the highest rate of slowest request query which roughly looks like:

sum without (instance) (rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET",scope=~"resource|"}[30m])) - ( sum without (instance,le) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30m])) or vector(0) )

The graph of this shows "biggest contribution to slow resource requests"

{apiserver="kube-apiserver", component="apiserver", endpoint="https", group="apiserver.openshift.io", job="apiserver", namespace="default", resource="apirequestcounts", scope="resource", service="kubernetes", verb="GET", version="v1"} | 3.1326662908397314 {apiserver="kube-apiserver", component="apiserver", endpoint="https", job="apiserver", namespace="default", resource="serviceaccounts", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.9384180790960457 {apiserver="kube-apiserver", component="apiserver", endpoint="https", job="apiserver", namespace="default", resource="pods", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.7807909604519772 {apiserver="kube-apiserver", component="apiserver", endpoint="https", job="apiserver", namespace="default", resource="configmaps", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.7785310734463291 {apiserver="kube-apiserver", component="apiserver", endpoint="https", job="apiserver", namespace="default", resource="secrets", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.507344632768362 {apiserver="kube-apiserver", component="apiserver", endpoint="https", job="apiserver", namespace="default", resource="namespaces", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.4881355932203384 {apiserver="kube-apiserver", component="apiserver", endpoint="https", job="apiserver", namespace="default", resource="pods", scope="resource", service="kubernetes", subresource="log", verb="GET", version="v1"} | 0.42126925907526447 {apiserver="kube-apiserver", component="apiserver", endpoint="https", group="rbac.authorization.k8s.io", job="apiserver", namespace="default", resource="clusterrolebindings", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.40677966101694807 {apiserver="kube-apiserver", endpoint="https", job="apiserver", namespace="default", service="kubernetes", subresource="/readyz", verb="GET"} | 0.3903954802259868 {apiserver="kube-apiserver", component="apiserver", endpoint="https", job="apiserver", namespace="default", resource="services", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.17457627118644137 {apiserver="kube-apiserver", component="apiserver", endpoint="https", group="apps", job="apiserver", namespace="default", resource="deployments", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.17344632768361556 {apiserver="kube-apiserver", component="apiserver", endpoint="https", group="rbac.authorization.k8s.io", job="apiserver", namespace="default", resource="rolebindings", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.167796610169491 {apiserver="kube-apiserver", component="apiserver", endpoint="https", group="apiextensions.k8s.io", job="apiserver", namespace="default", resource="customresourcedefinitions", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.15932203389830524 {apiserver="kube-apiserver", component="apiserver", endpoint="https", group="rbac.authorization.k8s.io", job="apiserver", namespace="default", resource="clusterroles", scope="resource", service="kubernetes", verb="GET", version="v1"} | 0.12655367231638337

So it looks like apirequestcounts is the biggest contributor

smarterclayton · 2021-05-10T13:45:48Z

+            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET",subresource!~"proxy|log|exec",scope=~"resource|"}[5m]))
+          -
+            (sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",subresource!~"proxy|log|exec",scope=~"resource|",le="0.1"}[5m])) or vector(0))
+          ) / scalar(sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",subresource!~"proxy|log|exec"}[5m])))


One problem here is that if anyone ever adds a new scope this would be off. I think it's ok for now (because scope covers all the possible values), but we'd need to be aware of that.

openshift-ci · 2021-05-17T15:31:21Z

@ihcsim: This pull request references Bugzilla bug 1953798, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.8.0) matches configured target release for branch (4.8.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

Details

In response to this:

Bug 1953798: Add apiserver burn rate SLO alerts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

smarterclayton · 2021-05-17T17:05:20Z

+        (
+          (
+            # too slow
+            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h]))


Doesn't exclude imageimport? I guess we can handle this separately later.

Yeah, I added these write rules as-is from upstream. Do we foresee image resources contribute to significant burns? Don't image imports get auto-pruned anyway?

smarterclayton · 2021-05-17T18:23:19Z

The read metrics look ok. We can defer improving the write ones to later since they look correct today.

smarterclayton · 2021-05-17T18:28:27Z

Can you link the PR to remove the old alerts here?

openshift-ci · 2021-05-17T20:46:49Z

@ihcsim: This pull request references Bugzilla bug 1953798, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.8.0) matches configured target release for branch (4.8.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

Details

In response to this:

Bug 1953798: Add apiserver burn rate SLO alerts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ihcsim · 2021-05-17T20:47:32Z

The PR to remove the same alerts from CMO is at openshift/cluster-monitoring-operator#1164.

ihcsim · 2021-05-26T04:39:56Z

/retest

smarterclayton · 2021-05-26T16:55:17Z

/approve
/lgtm

smarterclayton · 2021-05-26T17:06:03Z

/test all

This is looking good - while we've talked about removing the alert, I'm tempted to say the current alert is sufficient to merge and then we can remove prior to 4.8.

ihcsim · 2021-05-30T04:29:06Z

/retest

These alerts and recording rules are transferred from the monitoring stack to CKAO for easier management. See openshift/cluster-monitoring-operator#1164 Signed-off-by: Ivan Sim <isim@redhat.com>

The new recording rules exclude long-running requests to subresources from the resource-scoped and namespace-scoped requests. They also break down slow request distributions by scopes. Signed-off-by: Ivan Sim <isim@redhat.com>

ihcsim · 2021-05-31T16:59:36Z

@sttts Per slack convo, I separated the PR into two commits, to make future tracking easier. The first commit has the original alerts and rules, and the second has the new rules.

sttts · 2021-05-31T18:26:52Z

Factoring looks good. With @smarterclayton's lgtm #1126 (comment) for the actual alert expressions.

/lgtm
/approve

openshift-ci · 2021-05-31T18:27:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ihcsim, smarterclayton, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [sttts]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ihcsim · 2021-05-31T19:55:03Z

/retest

openshift-bot · 2021-05-31T19:59:13Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2021-05-31T22:36:42Z

@ihcsim: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-operator-encryption-rotation	`4f46165`	link	`/test e2e-aws-operator-encryption-rotation`
ci/prow/e2e-aws-operator	`e6727ca`	link	`/test e2e-aws-operator`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ihcsim · 2021-05-31T22:40:41Z

/retest

openshift-bot · 2021-05-31T22:47:12Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2021-06-01T01:06:36Z

@ihcsim: All pull requests linked via external trackers have merged:

Bugzilla bug 1953798 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1953798: Add apiserver burn rate SLO alerts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 4, 2021

openshift-ci-robot requested review from deads2k, sttts and wangke19 May 4, 2021 21:51

smarterclayton reviewed May 5, 2021

View reviewed changes