Revert commit "Add a namespace label to admission metrics and expand histogram range to 0-10s"#104033
Conversation
…am range to 0-10s"
|
/triage accepted |
|
|
||
| var ( | ||
| // Use buckets ranging from 5 ms to 10 seconds (admission webhooks timeout at 30 seconds by default). | ||
| latencyBuckets = []float64{0.005, 0.025, 0.1, 0.5, 2.5, 5.0, 10.0} |
There was a problem hiding this comment.
Was the issue the namespace label addition or the extra buckets?
There was a problem hiding this comment.
we didn't identify, but the additional buckets for sure amplified the cardinality issue even more. mostly the namespace label is the biggest cardinality contributor.
There was a problem hiding this comment.
How many series in total did you observe this added?
There was a problem hiding this comment.
It was the churn from e2e tests. They basically create namespaces on a per-test basis.
There was a problem hiding this comment.
We even have a test which creates 100 namespaces and deletes them.
There was a problem hiding this comment.
|
cc @kubernetes/sig-instrumentation-approvers |
|
|
||
| var ( | ||
| // Use buckets ranging from 5 ms to 10 seconds (admission webhooks timeout at 30 seconds by default). | ||
| latencyBuckets = []float64{0.005, 0.025, 0.1, 0.5, 2.5, 5.0, 10.0} |
|
/approve |
|
/approve |
|
please pick to the release-1.22 branch as well and give the @kubernetes/release-managers a heads up that this is incoming |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, liggitt, logicalhan, s-urbaniak, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
I gave the slack release-managers a ping and CCed them on the PR. |
…04033-upstream-release-1.22 Automated cherry pick of #104033: Revert "Add a namespace label to admission metrics and expand
/kind bug
What this PR does / why we need it:
By adding a namespace label to admission metrics we found that prometheus will be overwhelmed with out of memory errors within seconds due to amplified cardinality issues. This caused OOMs, raised memory usage in Prometheus from ~1,5GiB RAM steady usage to ~8GiB RAM usage (note, this is for OpenShift).
Which issue(s) this PR fixes:
Fixes #104008
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Revert addition of Add a namespace label to admission metrics and expand histogram range to 0-10s