Use prometheus conventions for workqueue metrics #71300

danielqsj · 2018-11-21T04:23:51Z

What type of PR is this?
/kind feature
/sig api-machinery

What this PR does / why we need it:
Use prometheus conventions for workqueue metrics

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #71165

Special notes for your reviewer:
This patch does not remove the existing metrics but mark them as deprecated.
We need 2 releases for users to convert monitoring configuration.

Does this PR introduce a user-facing change?:

Use prometheus conventions for workqueue metrics.
It is now deprecated to use the following metrics:
* `{WorkQueueName}_depth`
* `{WorkQueueName}_adds`
* `{WorkQueueName}_queue_latency`
* `{WorkQueueName}_work_duration`
* `{WorkQueueName}_unfinished_work_seconds`
* `{WorkQueueName}_longest_running_processor_microseconds`
* `{WorkQueueName}_retries`
Please convert to the following metrics:
* `workqueue_depth`
* `workqueue_adds_total`
* `workqueue_queue_latency_seconds`
* `workqueue_work_duration_seconds`
* `workqueue_unfinished_work_seconds`
* `workqueue_longest_running_processor_seconds`
* `workqueue_retries_total`

danielqsj · 2018-11-21T04:25:25Z

/assign @mortent

mortent · 2018-11-21T18:38:14Z

pkg/util/workqueue/prometheus/prometheus.go

+	QueueLatencyKey            = "queue_latency_microseconds"
+	WorkDurationKey            = "work_duration_microseconds"
+	UnfinishedWorkKey          = "unfinished_work_seconds"
+	LongestRunningProcessorKey = "longest_running_processor_microseconds"


I think we should stick with one common unit for all the metrics for workqueue and not mix seconds and microseconds. Seconds is one the base units suggested in the prometheus docs (https://prometheus.io/docs/practices/naming/), so I think we can use that unless we have a good reason to use microseconds.

Agree. Changed unit to seconds. PTAL

mortent · 2018-11-24T07:18:45Z

/lgtm

jennybuckley · 2018-11-26T21:14:46Z

/cc @logicalhan

k8s-ci-robot · 2018-11-26T21:14:47Z

@jennybuckley: GitHub didn't allow me to request PR reviews from the following users: logicalhan.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @logicalhan

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mortent · 2018-11-29T23:27:37Z

/assign @smarterclayton

logicalhan

I realize you did not create the files but since you are touching rate_limitting_queue_test.go, would you mind renaming rate_limitting_queue.go and rate_limitting_queue_test.go? Limitting is a typo.

danielqsj · 2018-12-04T03:20:32Z

@logicalhan good catch, let's discuss it in #71683 or #71684 .

yue9944882 · 2018-12-05T05:31:55Z

/remove-sig api-machinery
/sig instrumentation

loburm · 2018-12-06T10:49:08Z

pkg/util/workqueue/prometheus/prometheus.go

+	return adds
+}
+
+func (prometheusMetricsProvider) NewLatencyMetric(name string) workqueue.SummaryMetric {


We should stop using Summary metrics, please use Histogram instead.

Summary metrics can't be aggregated.

@loburm what buckets do you prefer or just ignore it now?

I'm not familiar with those queues here. I remember that default is for example almost useless for kube-apiserver requests latency, because most of samples are going to first few buckets and wasn't giving enough information for measuring it.

Usually I prefer near 20 reasonable buckets. Let's ask for advice from someone sig-instrumentation.

@DirectXMan12 @brancz

As this is on internal queues, the latencies should be rather small, I'd suggest something along the lines of:

prometheus.ExponentialBuckets(10e-9, 10, 10)

That gives us exponential buckets from 1 nanosecond to 10 seconds.

Not sure that the best approach. I would check current values from few kube-apiservers and select range base on it.

If there are histograms for queues in the apiserver, then yes we should be consistent, latency histograms for api requests (as in a service that performs network requests) are very different though from queues. Queues should be substantially faster.

@loburm have the conclusion about the buckets ?

Do you have any data about current distribution of those samples? But if you are happy with buckets:
1ns - 10ns
10ns - 100ns...
1s - 10s
Then you can proposal about.

After checking current samples, I agree with your proposal.
Code fixed. PTAL @loburm @brancz

loburm · 2018-12-12T09:26:25Z

/lgtm

danielqsj · 2018-12-12T12:30:24Z

@loburm @mortent @brancz after format code, need a new LGTM, thanks

loburm

/lgtm

brancz · 2018-12-14T16:02:57Z

/lgtm
/approve

Could you also do a PR to add this to the metrics overhaul KEP? I just want to make sure we keep everything in one place documented.

smarterclayton · 2018-12-14T16:33:31Z

/approve

k8s-ci-robot · 2018-12-14T16:34:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brancz, danielqsj, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/OWNERS~~ [smarterclayton]
~~staging/src/k8s.io/client-go/OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ash2k · 2019-01-01T03:11:04Z

/test pull-kubernetes-godeps

brancz · 2019-01-10T17:17:38Z

@danielqsj can you make sure to create a follow up for this one to add the deprecation notice in the help text for the metrics deprecated in this PR as well? Thanks!

danielqsj · 2019-01-14T06:07:00Z

@brancz yes, this PR #72679 aims to mark these deprecated metrics.

MikeSpreitzer · 2019-01-19T22:31:51Z

YAY on the change from Summary to Histogram. (Although, I would have liked a little finer granularity in the buckets, or control over the buckets.)

k8s-ci-robot requested review from dchen1107 and yliaog November 21, 2018 04:23

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 21, 2018

k8s-ci-robot assigned mortent Nov 21, 2018

mortent reviewed Nov 21, 2018

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 24, 2018

k8s-ci-robot assigned smarterclayton Nov 29, 2018

logicalhan reviewed Dec 3, 2018

View reviewed changes

k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Dec 5, 2018

loburm suggested changes Dec 6, 2018

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 7, 2018

logicalhan mentioned this pull request Dec 8, 2018

REQUEST: New membership for @logicalhan kubernetes/org#292

Closed

6 tasks

Use prometheus conventions for workqueue metrics

b828bc1

danielqsj force-pushed the 71165 branch from 22c3363 to f7f7500 Compare December 12, 2018 08:56

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 12, 2018

k8s-ci-robot assigned loburm Dec 12, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 12, 2018

Using histogram metrics instead of summary

42214c5

danielqsj force-pushed the 71165 branch from f7f7500 to 42214c5 Compare December 12, 2018 09:53

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 12, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 12, 2018

loburm approved these changes Dec 12, 2018

View reviewed changes

k8s-ci-robot assigned brancz Dec 14, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 14, 2018

This was referenced Dec 24, 2018

Add workqueue metrics changes to metrics overhaul kubernetes/enhancements#662

Merged

Change existing metrics to conform metrics guidelines #72333

Closed

k8s-ci-robot merged commit 7284660 into kubernetes:master Jan 1, 2019

danielqsj mentioned this pull request Jan 7, 2019

Add more metrics changes to metrics overhaul kubernetes/enhancements#679

Merged

danielqsj deleted the 71165 branch January 8, 2019 10:04

roycaihw mentioned this pull request Jan 30, 2019

Some workqueue metrics aren't registered #73551

Closed

danielqsj mentioned this pull request Feb 22, 2019

convert latency/latencies in metrics name to duration #74418

Merged

rtheis mentioned this pull request Apr 23, 2019

duplicate metrics collector registration attempted #76956

Closed

brancz mentioned this pull request Aug 7, 2019

Kubernetes Metrics Overhaul kubernetes/enhancements#1206

Closed

thxCode mentioned this pull request Nov 7, 2019

Monitoring - some graphs are missing in k8s 1.16 cluster rancher/rancher#23731

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use prometheus conventions for workqueue metrics #71300

Use prometheus conventions for workqueue metrics #71300

danielqsj commented Nov 21, 2018 •

edited

Loading

danielqsj commented Nov 21, 2018

mortent Nov 21, 2018

danielqsj Nov 22, 2018

mortent commented Nov 24, 2018

jennybuckley commented Nov 26, 2018

k8s-ci-robot commented Nov 26, 2018

mortent commented Nov 29, 2018

logicalhan left a comment

danielqsj commented Dec 4, 2018

yue9944882 commented Dec 5, 2018

loburm Dec 6, 2018

danielqsj Dec 6, 2018

loburm Dec 6, 2018

brancz Dec 6, 2018

loburm Dec 6, 2018

brancz Dec 6, 2018 •

edited

Loading

danielqsj Dec 11, 2018

loburm Dec 11, 2018

danielqsj Dec 12, 2018

loburm commented Dec 12, 2018

danielqsj commented Dec 12, 2018

loburm left a comment

brancz commented Dec 14, 2018

smarterclayton commented Dec 14, 2018

k8s-ci-robot commented Dec 14, 2018

ash2k commented Jan 1, 2019

brancz commented Jan 10, 2019

danielqsj commented Jan 14, 2019

MikeSpreitzer commented Jan 19, 2019

Use prometheus conventions for workqueue metrics #71300

Use prometheus conventions for workqueue metrics #71300

Conversation

danielqsj commented Nov 21, 2018 • edited Loading

danielqsj commented Nov 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mortent commented Nov 24, 2018

jennybuckley commented Nov 26, 2018

k8s-ci-robot commented Nov 26, 2018

mortent commented Nov 29, 2018

logicalhan left a comment

Choose a reason for hiding this comment

danielqsj commented Dec 4, 2018

yue9944882 commented Dec 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brancz Dec 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

loburm commented Dec 12, 2018

danielqsj commented Dec 12, 2018

loburm left a comment

Choose a reason for hiding this comment

brancz commented Dec 14, 2018

smarterclayton commented Dec 14, 2018

k8s-ci-robot commented Dec 14, 2018

ash2k commented Jan 1, 2019

brancz commented Jan 10, 2019

danielqsj commented Jan 14, 2019

MikeSpreitzer commented Jan 19, 2019

danielqsj commented Nov 21, 2018 •

edited

Loading

brancz Dec 6, 2018 •

edited

Loading