Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jittering periods of some kubelet's sync loops: #20726

Merged

Conversation

ingvagabund
Copy link
Contributor

In order to synchronize the current state of Kubernetes's objects (e.g. pods, containers, etc.),
periodic synch loops are run. When there is a lot of objects to synchronize with,
loops increase communication traffic. At some point when all the traffic interfere cpu usage curve
hits the roof causing 100% cpu utilization.

To distribute the traffic in time, some sync loops can jitter their period in each loop
and help to flatten the curve.

kubelet sync loops that are jittered:

  • kubelet.syncLoop responsible for housekeeping
  • prober manager: checking containers for readiness

@k8s-github-robot
Copy link

Labelling this PR as size/XL

@k8s-github-robot k8s-github-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 5, 2016
@ingvagabund ingvagabund force-pushed the jitter-sync-loops-in-kubelet branch from a52d0a2 to a8e4099 Compare February 5, 2016 16:44
@k8s-bot
Copy link

k8s-bot commented Feb 5, 2016

GCE e2e build/test failed for commit a52d0a2d3b2c5e31760139673584dc67e8ee4497.

@k8s-bot
Copy link

k8s-bot commented Feb 5, 2016

GCE e2e build/test failed for commit a8e40992f0f4decadbaf5af27d5e7b21d68ffef7.

@yujuhong yujuhong assigned yujuhong and unassigned caesarxuchao Feb 5, 2016
@yujuhong
Copy link
Contributor

yujuhong commented Feb 5, 2016

IIUC, this PR is based on #19917, and only the last commit (a8e40992f0f4decadbaf5af27d5e7b21d68ffef7) is new.

The jitters introduced into kubelet in that commit don't really matter. What should be modified is the the individual workers' (pod workers, probers) sync/probing period.
E.g.,

p.workQueue.Enqueue(uid, p.resyncInterval)
,
// Error occurred during the sync; back off and then retry.

probeTicker := time.NewTicker(time.Duration(w.spec.PeriodSeconds) * time.Second)

In general, I think introducing jitters is good. I did some tests after #19850 was merged and jittering period did not have significant difference in the cpu resource usage data. Perhaps with more pods (100 per node), this would make a difference. I would like to get some data before the PR is merged though.

@ingvagabund
Copy link
Contributor Author

That is correct.

Thanks for the tips, I will take a look.

@timothysc
Copy link
Member

Could we separate out the PR's? It seems this should be 2 different PRs.

  1. Conversion to wait.*
  2. Update some kubernetes-defaults

In general, I think introducing jitters is good. I did some tests after #19850 was merged and jittering period did not have significant difference in the cpu resource usage data.

Our profiles simply show a flattening of the spikes, but the ~90th-percentile usually remains steady. You would need something like pbench to see the spikes b/c most profiling would average it away.

@yujuhong
Copy link
Contributor

yujuhong commented Feb 5, 2016

Could we separate out the PR's? It seems this should be 2 different PRs.

  1. Conversion to wait.*
  2. Update some kubernetes-defaults

@timothysc, only the last commit is new. The rest is just #19917.

Our profiles simply show a flattening of the spikes, but the ~90th-percentile usually remains steady. You would need something like pbench to see the spikes b/c most profiling would average it away.

Could you share some numbers? That'd be very useful.

@ingvagabund ingvagabund changed the title Jittering periods of some kubelet's sync loops: WIP: Jittering periods of some kubelet's sync loops: Feb 5, 2016
@pmorie
Copy link
Member

pmorie commented Feb 5, 2016

@yujuhong @timothysc this is meant to be a distinct PR which depends on #19917

@k8s-github-robot
Copy link

PR needs rebase

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 6, 2016
@timothysc
Copy link
Member

Looks like the other commit landed so we can probably rebase this one.

@ingvagabund ingvagabund force-pushed the jitter-sync-loops-in-kubelet branch 2 times, most recently from c0cf0d1 to 0f98e73 Compare February 8, 2016 16:13
@k8s-bot
Copy link

k8s-bot commented Feb 8, 2016

GCE e2e build/test failed for commit c0cf0d1578408387045fdafa80241c46cf2460ed.

@k8s-github-robot
Copy link

Labelling this PR as size/M

@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 8, 2016
@ingvagabund
Copy link
Contributor Author

probeTicker := time.NewTicker(time.Duration(w.spec.PeriodSeconds) * time.Second)
...
probeLoop:
    for w.doProbe() {
        // Wait for next probe tick.
        select {
        case <-w.stop:
            break probeLoop
        case <-probeTicker.C:
            // continue
        }
    }

got replaced with

wait.JitterUntil(func() { w.doProbe() }, time.Duration(w.spec.PeriodSeconds)*time.Second, workerProbeJitterFactor, w.stop)

Why? reasoning:

The for loop can be replaced equivalently with:

wait.Until(func() { w.doProbe() }, time.Duration(w.spec.PeriodSeconds)*time.Second, w.stop)

Once w.stop is closed, the for loop ends (break probeLoop). The same happens for Until as whenever w.stop is closed, the Until ends. The for loop periodically loops (calls w.doProbe) as probeTicker.C ticks. Until calls w.doProbe periodically with the period that is the same as for the ticker. So at the end both for and Until provides the same functionality.

The second step is replacing Until with JitterUntil and introducing jitter factor.

@ingvagabund
Copy link
Contributor Author

If the doProbe returns false, the Until will not stop :(. Grrr

@k8s-bot
Copy link

k8s-bot commented Feb 8, 2016

GCE e2e test build/test passed for commit 0f98e735fce2d3bc7335ca6ede6fba9d1a499ce6.

@@ -39,6 +40,14 @@ type PodWorkers interface {

type syncPodFnType func(*api.Pod, *api.Pod, *kubecontainer.PodStatus, kubetypes.SyncPodType) error

const (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think @yujuhong is the expert on which timers to mod here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a smaller factor such as 0.5? I think that should be enough to distribute the sync times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both factors decreased to 0.5

@k8s-github-robot
Copy link

The author of this PR is not in the whitelist for merge, can one of the admins add the 'ok-to-merge' label?

@timothysc
Copy link
Member

@ingvagabund thanks for your help in seeing this one through.

@ingvagabund
Copy link
Contributor Author

Np. Thank you all for spending your time on reviewing the PR.

@pmorie pmorie added ok-to-merge lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 10, 2016
@pmorie
Copy link
Member

pmorie commented Feb 10, 2016

Tagging for @yujuhong

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@pmorie pmorie removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2016
@pmorie pmorie changed the title WIP: Jittering periods of some kubelet's sync loops: Jittering periods of some kubelet's sync loops: Feb 10, 2016
@pmorie pmorie added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2016
@bgrant0607
Copy link
Member

We shouldn't LGTM PRs with WIP in the title.

@pmorie
Copy link
Member

pmorie commented Feb 10, 2016

@bgrant0607 my bad, I had missed that.

@bgrant0607
Copy link
Member

Ah, I see @pmorie changed the title. Nevermind. The change hasn't been reflected in the submit queue yet.

@k8s-bot
Copy link

k8s-bot commented Feb 10, 2016

GCE e2e build/test failed for commit 392fc66.

@wojtek-t
Copy link
Member

cluster didn't started correctly

@k8s-bot e2e test this please github issue: #IGNORE

@k8s-bot
Copy link

k8s-bot commented Feb 10, 2016

GCE e2e test build/test passed for commit 392fc66.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Feb 10, 2016

GCE e2e test build/test passed for commit 392fc66.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

k8s-github-robot pushed a commit that referenced this pull request Feb 10, 2016
@k8s-github-robot k8s-github-robot merged commit c382943 into kubernetes:master Feb 10, 2016
@ingvagabund ingvagabund deleted the jitter-sync-loops-in-kubelet branch February 10, 2016 17:44
@ixdy
Copy link
Member

ixdy commented Feb 10, 2016

@wojtek-t why didn't you file an issue for the cluster failing to start?

@jeremyeder
Copy link

@pweil- will our rebase catch this?

@pweil-
Copy link
Contributor

pweil- commented Feb 12, 2016

@jeremyeder this should be included

@timothysc
Copy link
Member

@pweil - We determined this is a "nice to have" but not a blocker, as you will drag in transitive deps on this one. The larger issue of cadvisor jitter should be in the upcoming rebase with the lasted cadvisor updates.

So you can make the call here.

/cc @ncdc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.