-
Notifications
You must be signed in to change notification settings - Fork 39.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scheduler throughput metric #64526
Conversation
@misterikkit Yes, this can help with that and I believe is also a much-needed metric even otherwise. |
pkg/scheduler/metrics/metrics.go
Outdated
Help: "Number of attempts to schedule pods, by whether or not we succeeded", | ||
}, []string{"result"}) | ||
PodScheduleSuccesses = scheduleAttempts.With(prometheus.Labels{"result": "success"}) | ||
PodScheduleFailures = scheduleAttempts.With(prometheus.Labels{"result": "unschedulable"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it is unschedulable
instead of failure
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that unschedulable
is a failure result. In fact, I am not correctly propagating actual errors to this metric. How about using these cells for the metric:
- ok_scheduled
- ok_unschedulable
- error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, drop "ok_" and just keep scheduled, unschedulable, and error.
pkg/scheduler/scheduler.go
Outdated
@@ -459,6 +459,7 @@ func (sched *Scheduler) scheduleOne() { | |||
metrics.PreemptionAttempts.Inc() | |||
metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime)) | |||
} | |||
metrics.PodScheduleFailures.Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not accurate. We should increment PodScheduleFailures
if the returned error is a "FitError". Otherwise, we should increment PodScheduleErrors
.
pkg/scheduler/scheduler.go
Outdated
return | ||
} | ||
|
||
// assume modifies `assumedPod` by setting NodeName=suggestedHost | ||
err = sched.assume(assumedPod, suggestedHost) | ||
if err != nil { | ||
metrics.PodScheduleFailures.Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we increment PodScheduleErrors
here?
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
@misterikkit Could you please address the comments here? |
8ad61a1
to
bd7cee6
Compare
PTAL - I've fixed the counting errors and rebased. I also added some logs to errors that were otherwise being handled "invisibly". |
This adds a counter to the scheduler that can be used to calculate throughput and error ratio. Pods which fail to schedule are not counted as errors, but can still be tracked separately from successes. We already measure scheduler latency, but throughput was missing. This should be considered a key metric for the scheduler.
bd7cee6
to
b0a8dbb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @misterikkit!
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, misterikkit The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Review the full test history for this PR. Silence the bot with an |
What this PR does / why we need it:
This adds a counter to the scheduler that can be used to calculate
throughput.
We already measure scheduler latency, but throughput was missing. This
should be considered a key metric for the scheduler.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
We have not been following best practices or even consistent naming in scheduler metrics up to this point. We can't reasonably change existing metrics, but I would like to choose good, consistent names going forward.
Release note:
/sig scheduling
/sig instrumentation