Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scheduler throughput metric #64526

Merged
merged 1 commit into from
Sep 25, 2018
Merged

Conversation

misterikkit
Copy link

@misterikkit misterikkit commented May 30, 2018

What this PR does / why we need it:
This adds a counter to the scheduler that can be used to calculate
throughput.

We already measure scheduler latency, but throughput was missing. This
should be considered a key metric for the scheduler.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:
We have not been following best practices or even consistent naming in scheduler metrics up to this point. We can't reasonably change existing metrics, but I would like to choose good, consistent names going forward.

Release note:

Add prometheus metric for scheduling throughput.

/sig scheduling
/sig instrumentation

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 30, 2018
@misterikkit
Copy link
Author

/cc @shyamjvs We need this regardless of testing, but would it be useful for #64266 ?

@shyamjvs
Copy link
Member

@misterikkit Yes, this can help with that and I believe is also a much-needed metric even otherwise.
#64266 was a shorter run attempt to start capturing throughput using our existing test framework.

Help: "Number of attempts to schedule pods, by whether or not we succeeded",
}, []string{"result"})
PodScheduleSuccesses = scheduleAttempts.With(prometheus.Labels{"result": "success"})
PodScheduleFailures = scheduleAttempts.With(prometheus.Labels{"result": "unschedulable"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it is unschedulable instead of failure?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that unschedulable is a failure result. In fact, I am not correctly propagating actual errors to this metric. How about using these cells for the metric:

  • ok_scheduled
  • ok_unschedulable
  • error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, drop "ok_" and just keep scheduled, unschedulable, and error.

@@ -459,6 +459,7 @@ func (sched *Scheduler) scheduleOne() {
metrics.PreemptionAttempts.Inc()
metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
}
metrics.PodScheduleFailures.Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accurate. We should increment PodScheduleFailures if the returned error is a "FitError". Otherwise, we should increment PodScheduleErrors.

return
}

// assume modifies `assumedPod` by setting NodeName=suggestedHost
err = sched.assume(assumedPod, suggestedHost)
if err != nil {
metrics.PodScheduleFailures.Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we increment PodScheduleErrors here?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2018
@bsalamat
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2018
@bsalamat
Copy link
Member

@misterikkit Could you please address the comments here?

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Sep 24, 2018
@misterikkit
Copy link
Author

PTAL - I've fixed the counting errors and rebased. I also added some logs to errors that were otherwise being handled "invisibly".

This adds a counter to the scheduler that can be used to calculate
throughput and error ratio. Pods which fail to schedule are not counted
as errors, but can still be tracked separately from successes.

We already measure scheduler latency, but throughput was missing. This
should be considered a key metric for the scheduler.
Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @misterikkit!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 24, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, misterikkit

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 24, 2018
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit e1989af into kubernetes:master Sep 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants