Add scheduler throughput metric #64526

misterikkit · 2018-05-30T19:03:32Z

What this PR does / why we need it:
This adds a counter to the scheduler that can be used to calculate
throughput.

We already measure scheduler latency, but throughput was missing. This
should be considered a key metric for the scheduler.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:
We have not been following best practices or even consistent naming in scheduler metrics up to this point. We can't reasonably change existing metrics, but I would like to choose good, consistent names going forward.

Release note:

Add prometheus metric for scheduling throughput.

/sig scheduling
/sig instrumentation

misterikkit · 2018-05-30T19:04:06Z

/cc @shyamjvs We need this regardless of testing, but would it be useful for #64266 ?

shyamjvs · 2018-05-30T19:13:59Z

@misterikkit Yes, this can help with that and I believe is also a much-needed metric even otherwise.
#64266 was a shorter run attempt to start capturing throughput using our existing test framework.

wgliang · 2018-05-31T02:30:08Z

pkg/scheduler/metrics/metrics.go

+			Help:      "Number of attempts to schedule pods, by whether or not we succeeded",
+		}, []string{"result"})
+	PodScheduleSuccesses = scheduleAttempts.With(prometheus.Labels{"result": "success"})
+	PodScheduleFailures  = scheduleAttempts.With(prometheus.Labels{"result": "unschedulable"})


Why is it is unschedulable instead of failure?

I don't think that unschedulable is a failure result. In fact, I am not correctly propagating actual errors to this metric. How about using these cells for the metric:

ok_scheduled

ok_unschedulable

error

Or, drop "ok_" and just keep scheduled, unschedulable, and error.

bsalamat · 2018-06-12T00:01:08Z

pkg/scheduler/scheduler.go

@@ -459,6 +459,7 @@ func (sched *Scheduler) scheduleOne() {
 			metrics.PreemptionAttempts.Inc()
 			metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
 		}
+		metrics.PodScheduleFailures.Inc()


This is not accurate. We should increment PodScheduleFailures if the returned error is a "FitError". Otherwise, we should increment PodScheduleErrors.

bsalamat · 2018-06-12T00:02:00Z

pkg/scheduler/scheduler.go

 		return
 	}

 	// assume modifies `assumedPod` by setting NodeName=suggestedHost
 	err = sched.assume(assumedPod, suggestedHost)
 	if err != nil {
+		metrics.PodScheduleFailures.Inc()


Shouldn't we increment PodScheduleErrors here?

fejta-bot · 2018-09-15T00:02:12Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

bsalamat · 2018-09-15T00:11:23Z

/remove-lifecycle stale

bsalamat · 2018-09-15T00:11:46Z

@misterikkit Could you please address the comments here?

misterikkit · 2018-09-24T21:27:07Z

PTAL - I've fixed the counting errors and rebased. I also added some logs to errors that were otherwise being handled "invisibly".

This adds a counter to the scheduler that can be used to calculate throughput and error ratio. Pods which fail to schedule are not counted as errors, but can still be tracked separately from successes. We already measure scheduler latency, but throughput was missing. This should be considered a key metric for the scheduler.

bsalamat

Thanks, @misterikkit!

/lgtm

k8s-ci-robot · 2018-09-24T21:39:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, misterikkit

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [bsalamat]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2018-09-25T00:43:45Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

k8s-ci-robot requested review from bsalamat and jayunit100 May 30, 2018 19:03

wgliang reviewed May 31, 2018

View reviewed changes

misterikkit force-pushed the metrics branch from 640e215 to 8ad61a1 Compare June 5, 2018 21:20

bsalamat reviewed Jun 12, 2018

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2018

misterikkit force-pushed the metrics branch from 8ad61a1 to bd7cee6 Compare September 24, 2018 21:26

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Sep 24, 2018

misterikkit force-pushed the metrics branch from bd7cee6 to b0a8dbb Compare September 24, 2018 21:38

k8s-ci-robot assigned bsalamat Sep 24, 2018

bsalamat approved these changes Sep 24, 2018

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 24, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 24, 2018

k8s-ci-robot merged commit e1989af into kubernetes:master Sep 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scheduler throughput metric #64526

Add scheduler throughput metric #64526

misterikkit commented May 30, 2018 •

edited

Loading

misterikkit commented May 30, 2018

shyamjvs commented May 30, 2018

wgliang May 31, 2018

misterikkit May 31, 2018

bsalamat May 31, 2018

bsalamat Jun 12, 2018

bsalamat Jun 12, 2018

fejta-bot commented Sep 15, 2018

bsalamat commented Sep 15, 2018

bsalamat commented Sep 15, 2018

misterikkit commented Sep 24, 2018

bsalamat left a comment

k8s-ci-robot commented Sep 24, 2018

fejta-bot commented Sep 25, 2018

Add scheduler throughput metric #64526

Add scheduler throughput metric #64526

Conversation

misterikkit commented May 30, 2018 • edited Loading

misterikkit commented May 30, 2018

shyamjvs commented May 30, 2018

wgliang May 31, 2018

Choose a reason for hiding this comment

misterikkit May 31, 2018

Choose a reason for hiding this comment

bsalamat May 31, 2018

Choose a reason for hiding this comment

bsalamat Jun 12, 2018

Choose a reason for hiding this comment

bsalamat Jun 12, 2018

Choose a reason for hiding this comment

fejta-bot commented Sep 15, 2018

bsalamat commented Sep 15, 2018

bsalamat commented Sep 15, 2018

misterikkit commented Sep 24, 2018

bsalamat left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 24, 2018

fejta-bot commented Sep 25, 2018

misterikkit commented May 30, 2018 •

edited

Loading