Jittering periods of some kubelet's sync loops: #20726

ingvagabund · 2016-02-05T16:36:27Z

In order to synchronize the current state of Kubernetes's objects (e.g. pods, containers, etc.),
periodic synch loops are run. When there is a lot of objects to synchronize with,
loops increase communication traffic. At some point when all the traffic interfere cpu usage curve
hits the roof causing 100% cpu utilization.

To distribute the traffic in time, some sync loops can jitter their period in each loop
and help to flatten the curve.

kubelet sync loops that are jittered:

kubelet.syncLoop responsible for housekeeping
prober manager: checking containers for readiness

k8s-github-robot · 2016-02-05T16:41:04Z

Labelling this PR as size/XL

k8s-bot · 2016-02-05T17:21:14Z

GCE e2e build/test failed for commit a52d0a2d3b2c5e31760139673584dc67e8ee4497.

k8s-bot · 2016-02-05T17:25:31Z

GCE e2e build/test failed for commit a8e40992f0f4decadbaf5af27d5e7b21d68ffef7.

yujuhong · 2016-02-05T18:43:17Z

IIUC, this PR is based on #19917, and only the last commit (a8e40992f0f4decadbaf5af27d5e7b21d68ffef7) is new.

The jitters introduced into kubelet in that commit don't really matter. What should be modified is the the individual workers' (pod workers, probers) sync/probing period.
E.g.,

kubernetes/pkg/kubelet/pod_workers.go

Line 212 in ff04de4

p.workQueue.Enqueue(uid, p.resyncInterval)

,

kubernetes/pkg/kubelet/pod_workers.go

Line 214 in ff04de4

// Error occurred during the sync; back off and then retry.

kubernetes/pkg/kubelet/prober/worker.go

Line 96 in ff04de4

    
           probeTicker := time.NewTicker(time.Duration(w.spec.PeriodSeconds) * time.Second)

In general, I think introducing jitters is good. I did some tests after #19850 was merged and jittering period did not have significant difference in the cpu resource usage data. Perhaps with more pods (100 per node), this would make a difference. I would like to get some data before the PR is merged though.

ingvagabund · 2016-02-05T18:46:46Z

That is correct.

Thanks for the tips, I will take a look.

timothysc · 2016-02-05T19:30:30Z

Could we separate out the PR's? It seems this should be 2 different PRs.

Conversion to wait.*
Update some kubernetes-defaults

In general, I think introducing jitters is good. I did some tests after #19850 was merged and jittering period did not have significant difference in the cpu resource usage data.

Our profiles simply show a flattening of the spikes, but the ~90th-percentile usually remains steady. You would need something like pbench to see the spikes b/c most profiling would average it away.

yujuhong · 2016-02-05T19:40:05Z

Could we separate out the PR's? It seems this should be 2 different PRs.

Conversion to wait.*

Update some kubernetes-defaults

@timothysc, only the last commit is new. The rest is just #19917.

Our profiles simply show a flattening of the spikes, but the ~90th-percentile usually remains steady. You would need something like pbench to see the spikes b/c most profiling would average it away.

Could you share some numbers? That'd be very useful.

pmorie · 2016-02-05T21:08:42Z

@yujuhong @timothysc this is meant to be a distinct PR which depends on #19917

k8s-github-robot · 2016-02-06T00:10:39Z

PR needs rebase

timothysc · 2016-02-08T14:53:52Z

Looks like the other commit landed so we can probably rebase this one.

k8s-bot · 2016-02-08T16:13:58Z

GCE e2e build/test failed for commit c0cf0d1578408387045fdafa80241c46cf2460ed.

k8s-github-robot · 2016-02-08T16:14:54Z

Labelling this PR as size/M

ingvagabund · 2016-02-08T16:22:10Z

probeTicker := time.NewTicker(time.Duration(w.spec.PeriodSeconds) * time.Second)
...
probeLoop:
    for w.doProbe() {
        // Wait for next probe tick.
        select {
        case <-w.stop:
            break probeLoop
        case <-probeTicker.C:
            // continue
        }
    }

got replaced with

wait.JitterUntil(func() { w.doProbe() }, time.Duration(w.spec.PeriodSeconds)*time.Second, workerProbeJitterFactor, w.stop)

Why? reasoning:

The for loop can be replaced equivalently with:

wait.Until(func() { w.doProbe() }, time.Duration(w.spec.PeriodSeconds)*time.Second, w.stop)

Once w.stop is closed, the for loop ends (break probeLoop). The same happens for Until as whenever w.stop is closed, the Until ends. The for loop periodically loops (calls w.doProbe) as probeTicker.C ticks. Until calls w.doProbe periodically with the period that is the same as for the ticker. So at the end both for and Until provides the same functionality.

The second step is replacing Until with JitterUntil and introducing jitter factor.

ingvagabund · 2016-02-08T16:28:35Z

If the doProbe returns false, the Until will not stop :(. Grrr

k8s-bot · 2016-02-08T16:40:27Z

GCE e2e test build/test passed for commit 0f98e735fce2d3bc7335ca6ede6fba9d1a499ce6.

timothysc · 2016-02-08T16:43:26Z

pkg/kubelet/pod_workers.go

@@ -39,6 +40,14 @@ type PodWorkers interface {

 type syncPodFnType func(*api.Pod, *api.Pod, *kubecontainer.PodStatus, kubetypes.SyncPodType) error

+const (


So I think @yujuhong is the expert on which timers to mod here.

Can we use a smaller factor such as 0.5? I think that should be enough to distribute the sync times.

Both factors decreased to 0.5

k8s-github-robot · 2016-02-08T16:45:40Z

The author of this PR is not in the whitelist for merge, can one of the admins add the 'ok-to-merge' label?

timothysc · 2016-02-10T04:21:00Z

@ingvagabund thanks for your help in seeing this one through.

ingvagabund · 2016-02-10T09:36:16Z

Np. Thank you all for spending your time on reviewing the PR.

pmorie · 2016-02-10T14:49:39Z

Tagging for @yujuhong

k8s-github-robot · 2016-02-10T15:11:21Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

bgrant0607 · 2016-02-10T15:19:23Z

We shouldn't LGTM PRs with WIP in the title.

pmorie · 2016-02-10T15:19:58Z

@bgrant0607 my bad, I had missed that.

bgrant0607 · 2016-02-10T15:20:10Z

Ah, I see @pmorie changed the title. Nevermind. The change hasn't been reflected in the submit queue yet.

k8s-bot · 2016-02-10T15:28:55Z

GCE e2e build/test failed for commit 392fc66.

wojtek-t · 2016-02-10T15:34:04Z

cluster didn't started correctly

@k8s-bot e2e test this please github issue: #IGNORE

k8s-bot · 2016-02-10T16:01:47Z

GCE e2e test build/test passed for commit 392fc66.

k8s-github-robot · 2016-02-10T16:37:19Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-bot · 2016-02-10T17:06:44Z

GCE e2e test build/test passed for commit 392fc66.

k8s-github-robot · 2016-02-10T17:06:58Z

Automatic merge from submit-queue

Auto commit by PR queue bot

ixdy · 2016-02-10T23:34:32Z

@wojtek-t why didn't you file an issue for the cluster failing to start?

jeremyeder · 2016-02-12T11:56:39Z

@pweil- will our rebase catch this?

pweil- · 2016-02-12T12:17:12Z

@jeremyeder this should be included

timothysc · 2016-02-12T14:41:13Z

@pweil - We determined this is a "nice to have" but not a blocker, as you will drag in transitive deps on this one. The larger issue of cadvisor jitter should be in the upcoming rebase with the lasted cadvisor updates.

So you can make the call here.

/cc @ncdc

googlebot added the cla: yes label Feb 5, 2016

k8s-github-robot assigned caesarxuchao Feb 5, 2016

k8s-github-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 5, 2016

ingvagabund force-pushed the jitter-sync-loops-in-kubelet branch from a52d0a2 to a8e4099 Compare February 5, 2016 16:44

yujuhong assigned yujuhong and unassigned caesarxuchao Feb 5, 2016

ingvagabund changed the title ~~Jittering periods of some kubelet's sync loops:~~ WIP: Jittering periods of some kubelet's sync loops: Feb 5, 2016

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 6, 2016

ingvagabund force-pushed the jitter-sync-loops-in-kubelet branch 2 times, most recently from c0cf0d1 to 0f98e73 Compare February 8, 2016 16:13

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 8, 2016

timothysc reviewed Feb 8, 2016
View reviewed changes

pmorie added ok-to-merge lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 10, 2016

k8s-github-robot removed the needs-ok-to-merge label Feb 10, 2016

pmorie removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2016

pmorie changed the title ~~WIP: Jittering periods of some kubelet's sync loops:~~ Jittering periods of some kubelet's sync loops: Feb 10, 2016

pmorie added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2016

k8s-github-robot pushed a commit that referenced this pull request Feb 10, 2016

Merge pull request #20726 from ingvagabund/jitter-sync-loops-in-kubelet

c382943

Auto commit by PR queue bot

k8s-github-robot merged commit c382943 into kubernetes:master Feb 10, 2016

ingvagabund deleted the jitter-sync-loops-in-kubelet branch February 10, 2016 17:44

yujuhong mentioned this pull request Feb 12, 2016

Kubelet: Introduce jitter on all interval stats and checks. #19655

Closed

timothysc mentioned this pull request Feb 12, 2016

Finalize Kube items 3.2 openshift/origin#6766

Closed

85 tasks

yujuhong mentioned this pull request Feb 13, 2016

RFC: Determine the scalability goal for v1.2 kubelet #16943

Closed

mattjmcnaughton mentioned this pull request Dec 3, 2019

Worker probe 80431 #85769

Closed

		@@ -39,6 +40,14 @@ type PodWorkers interface {

		type syncPodFnType func(api.Pod, api.Pod, *kubecontainer.PodStatus, kubetypes.SyncPodType) error

		const (

Jittering periods of some kubelet's sync loops: #20726

Jittering periods of some kubelet's sync loops: #20726

Conversation

ingvagabund commented Feb 5, 2016

k8s-github-robot commented Feb 5, 2016

k8s-bot commented Feb 5, 2016

k8s-bot commented Feb 5, 2016

yujuhong commented Feb 5, 2016

ingvagabund commented Feb 5, 2016

timothysc commented Feb 5, 2016

yujuhong commented Feb 5, 2016

pmorie commented Feb 5, 2016

k8s-github-robot commented Feb 6, 2016

timothysc commented Feb 8, 2016

k8s-bot commented Feb 8, 2016

k8s-github-robot commented Feb 8, 2016

ingvagabund commented Feb 8, 2016

ingvagabund commented Feb 8, 2016

k8s-bot commented Feb 8, 2016

timothysc Feb 8, 2016

Choose a reason for hiding this comment

yujuhong Feb 8, 2016

Choose a reason for hiding this comment

ingvagabund Feb 9, 2016

Choose a reason for hiding this comment

k8s-github-robot commented Feb 8, 2016

timothysc commented Feb 10, 2016

ingvagabund commented Feb 10, 2016

pmorie commented Feb 10, 2016

k8s-github-robot commented Feb 10, 2016

bgrant0607 commented Feb 10, 2016

pmorie commented Feb 10, 2016

bgrant0607 commented Feb 10, 2016

k8s-bot commented Feb 10, 2016

wojtek-t commented Feb 10, 2016

k8s-bot commented Feb 10, 2016

k8s-github-robot commented Feb 10, 2016

k8s-bot commented Feb 10, 2016

k8s-github-robot commented Feb 10, 2016

ixdy commented Feb 10, 2016

jeremyeder commented Feb 12, 2016

pweil- commented Feb 12, 2016

timothysc commented Feb 12, 2016