-
Notifications
You must be signed in to change notification settings - Fork 39.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use prometheus conventions for workqueue metrics #71300
Conversation
/assign @mortent |
QueueLatencyKey = "queue_latency_microseconds" | ||
WorkDurationKey = "work_duration_microseconds" | ||
UnfinishedWorkKey = "unfinished_work_seconds" | ||
LongestRunningProcessorKey = "longest_running_processor_microseconds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should stick with one common unit for all the metrics for workqueue and not mix seconds and microseconds. Seconds is one the base units suggested in the prometheus docs (https://prometheus.io/docs/practices/naming/), so I think we can use that unless we have a good reason to use microseconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Changed unit to seconds. PTAL
/lgtm |
/cc @logicalhan |
@jennybuckley: GitHub didn't allow me to request PR reviews from the following users: logicalhan. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @smarterclayton |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize you did not create the files but since you are touching rate_limitting_queue_test.go
, would you mind renaming rate_limitting_queue.go
and rate_limitting_queue_test.go
? Limitting is a typo.
@logicalhan good catch, let's discuss it in #71683 or #71684 . |
/remove-sig api-machinery |
return adds | ||
} | ||
|
||
func (prometheusMetricsProvider) NewLatencyMetric(name string) workqueue.SummaryMetric { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should stop using Summary metrics, please use Histogram instead.
Summary metrics can't be aggregated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@loburm what buckets do you prefer or just ignore it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with those queues here. I remember that default is for example almost useless for kube-apiserver requests latency, because most of samples are going to first few buckets and wasn't giving enough information for measuring it.
Usually I prefer near 20 reasonable buckets. Let's ask for advice from someone sig-instrumentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is on internal queues, the latencies should be rather small, I'd suggest something along the lines of:
prometheus.ExponentialBuckets(10e-9, 10, 10)
That gives us exponential buckets from 1 nanosecond to 10 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that the best approach. I would check current values from few kube-apiservers and select range base on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are histograms for queues in the apiserver, then yes we should be consistent, latency histograms for api requests (as in a service that performs network requests) are very different though from queues. Queues should be substantially faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@loburm have the conclusion about the buckets ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any data about current distribution of those samples? But if you are happy with buckets:
1ns - 10ns
10ns - 100ns...
1s - 10s
Then you can proposal about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/lgtm Could you also do a PR to add this to the metrics overhaul KEP? I just want to make sure we keep everything in one place documented. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: brancz, danielqsj, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-godeps |
@danielqsj can you make sure to create a follow up for this one to add the deprecation notice in the help text for the metrics deprecated in this PR as well? Thanks! |
YAY on the change from Summary to Histogram. (Although, I would have liked a little finer granularity in the buckets, or control over the buckets.) |
What type of PR is this?
/kind feature
/sig api-machinery
What this PR does / why we need it:
Use prometheus conventions for workqueue metrics
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #71165
Special notes for your reviewer:
This patch does not remove the existing metrics but mark them as deprecated.
We need 2 releases for users to convert monitoring configuration.
Does this PR introduce a user-facing change?: