runtime: latency in sweep assists when there's no garbage #18155

rhysh · 2016-12-02T05:29:28Z

go version devel +ae9712c766 Thu Dec 1 22:20:31 2016 +0000 darwin/amd64, and linux/amd64

I wrote a test case for the GC latency described by @WillSewell in https://blog.pusher.com/golangs-real-time-gc-in-theory-and-practice/ . I recorded the timestamps of when pauses (during allocation) were observed, and matched them against GODEBUG=gctrace=1 output.

The pauses come after the GC cycle, during the concurrent sweep phase. Also, they happen as the heap grows and do not happen once the heap size is stable. And the pauses are often longer than the clock time of any single gctrace line.

A goroutine that needs to allocate memory right after a GC cycle is required to help with the sweep work. This means looking through the heap until suitable memory is found. If the heap is growing rapidly with little or no garbage created, this mutator goroutine will need to search for a long time: it can end up sweeping half of the heap (shared with the background sweep worker).

The GC is pretty good at controlling the mark assist burden on any single goroutine—and starting the GC cycle before the allowable heap growth is exhausted, allowing tradeoffs between assist burden and heap growth during the cycle. These protections don't seem to be in place for sweep assists.

The execution tracer's view of this is confusing since it emits events for each call to runtime.(*mspan).sweep, which is called in a loop from runtime.(*mcentral).cacheSpan. This makes it look like the user goroutine is making a lot of progress on sweep work, since a single call to runtime.mallocgc can lead to thousands of tiny "SWEEP" blocks.

Frustrated assists like this are perhaps more likely to happen in microbenchmarks than in production applications, but the way sweep work is shown in the execution tracer makes it hard to say for sure. This could be a contributing factor to #14812 — one of the goroutines in the recent execution trace spent over 17ms in sweep assists.

The test is below, followed by some example runs.

package gc_test

import (
	"flag"
	"testing"
	"time"
)

var (
	start       = time.Now()
	allocSize   = flag.Int("alloc-size", 1000, "Bytes per allocation")
	retainCount = flag.Int("retain-count", 1000000, "Number of allocations to retain")
	loopCount   = flag.Int("loop-count", 10, "Number of times to execute loop")
	preallocate = flag.Bool("preallocate", false, "Preallocate retention slice")
)

func TestHeapGrowthStalls(t *testing.T) {
	var mem [][]byte
	if *preallocate {
		mem = make([][]byte, *retainCount)
	}

	var slowDelta time.Duration

	for i := 0; i < (*loopCount)*(*retainCount); i++ {
		// Allow the program's heap to start small, instead of pre-allocating
		// the entire slice.
		if len(mem) < *retainCount {
			mem = append(mem, nil)
		}

		j := i % len(mem)
		before := time.Now()
		mem[j] = make([]byte, *allocSize)
		delta := time.Since(before)
		if delta > slowDelta || (delta > time.Millisecond && delta > slowDelta/2) {
			slowDelta = delta
			at := time.Since(start)
			t.Logf("i=%-8d at=%.3fs   delta=%.3fms",
				i,
				at.Seconds(),
				delta.Seconds()*float64(time.Second/time.Millisecond))
		}
	}
}

On a darwin/amd64 laptop:

$ GODEBUG=gctrace=1 /tmp/gc.test -test.v
=== RUN   TestHeapGrowthStalls
gc 1 @0.002s 1%: 0.035+0.14+0.019 ms clock, 0.10+0.047/0.12/0.12+0.059 ms cpu, 4->4->3 MB, 5 MB goal, 8 P
gc 2 @0.003s 2%: 0.003+0.63+0.028 ms clock, 0.024+0.054/0.17/0.14+0.22 ms cpu, 6->7->6 MB, 7 MB goal, 8 P
gc 3 @0.004s 3%: 0.003+0.29+0.034 ms clock, 0.026+0.16/0.16/0.095+0.27 ms cpu, 12->12->12 MB, 13 MB goal, 8 P
gc 4 @0.007s 4%: 0.004+1.2+0.048 ms clock, 0.033+0.94/0.059/0.29+0.39 ms cpu, 23->23->22 MB, 24 MB goal, 8 P
gc 5 @0.012s 3%: 0.005+0.59+0.024 ms clock, 0.044+0.054/0.59/0.58+0.19 ms cpu, 44->44->42 MB, 45 MB goal, 8 P
gc 6 @0.020s 2%: 0.005+5.9+0.033 ms clock, 0.040+0.093/1.1/5.7+0.26 ms cpu, 84->84->80 MB, 85 MB goal, 8 P
gc 7 @0.045s 2%: 0.006+1.7+0.036 ms clock, 0.051+0.25/2.2/2.3+0.29 ms cpu, 156->157->152 MB, 160 MB goal, 8 P
gc 8 @0.077s 1%: 0.006+2.1+0.027 ms clock, 0.055+1.0/2.9/7.3+0.21 ms cpu, 297->300->286 MB, 305 MB goal, 8 P
gc 9 @0.154s 1%: 0.008+4.4+0.044 ms clock, 0.070+2.3/4.6/18+0.35 ms cpu, 558->563->537 MB, 573 MB goal, 8 P
gc 10 @0.268s 1%: 0.008+7.5+0.044 ms clock, 0.067+3.7/12/32+0.35 ms cpu, 1047->1056->1004 MB, 1074 MB goal, 8 P
gc 11 @0.428s 1%: 0.008+7.8+0.043 ms clock, 0.070+1.9/13/36+0.34 ms cpu, 1959->1974->1018 MB, 2009 MB goal, 8 P
gc 12 @0.924s 0%: 0.008+8.6+0.17 ms clock, 0.071+1.8/15/38+1.3 ms cpu, 1985->2004->1022 MB, 2036 MB goal, 8 P
gc 13 @1.426s 0%: 0.006+9.0+0.034 ms clock, 0.053+1.8/17/41+0.27 ms cpu, 1993->2000->1010 MB, 2044 MB goal, 8 P
gc 14 @1.658s 0%: 0.008+9.4+0.026 ms clock, 0.070+1.8/15/45+0.21 ms cpu, 1969->1982->1015 MB, 2020 MB goal, 8 P
gc 15 @1.905s 0%: 0.015+9.5+0.075 ms clock, 0.12+1.9/18/41+0.60 ms cpu, 1980->1994->1017 MB, 2031 MB goal, 8 P
gc 16 @2.143s 0%: 0.016+11+0.045 ms clock, 0.12+1.8/21/54+0.36 ms cpu, 1983->1999->1019 MB, 2034 MB goal, 8 P
gc 17 @2.385s 0%: 0.015+10+0.066 ms clock, 0.12+2.4/20/49+0.53 ms cpu, 1987->2002->1018 MB, 2038 MB goal, 8 P
gc 18 @2.616s 0%: 0.015+11+0.077 ms clock, 0.12+2.4/15/59+0.61 ms cpu, 1985->2004->1021 MB, 2036 MB goal, 8 P
--- PASS: TestHeapGrowthStalls (2.85s)
	gc_test.go:42: i=0        at=0.000s   delta=0.000ms
	gc_test.go:42: i=2        at=0.000s   delta=0.001ms
	gc_test.go:42: i=18       at=0.000s   delta=0.002ms
	gc_test.go:42: i=26       at=0.001s   delta=0.002ms
	gc_test.go:42: i=114      at=0.001s   delta=0.003ms
	gc_test.go:42: i=170      at=0.001s   delta=0.004ms
	gc_test.go:42: i=218      at=0.001s   delta=0.014ms
	gc_test.go:42: i=3226     at=0.002s   delta=0.157ms
	gc_test.go:42: i=12016    at=0.005s   delta=0.385ms
	gc_test.go:42: i=22534    at=0.009s   delta=0.408ms
	gc_test.go:42: i=41997    at=0.014s   delta=0.797ms
	gc_test.go:42: i=79541    at=0.028s   delta=1.451ms
	gc_test.go:42: i=152493   at=0.051s   delta=3.489ms
	gc_test.go:42: i=286213   at=0.084s   delta=5.107ms
	gc_test.go:42: i=286229   at=0.087s   delta=2.922ms
	gc_test.go:42: i=536053   at=0.169s   delta=10.281ms
	gc_test.go:42: i=1001709  at=0.295s   delta=20.094ms
PASS

The at=0.169s delta=10.281ms line matches up with gc 9 @0.154s 1%: 0.008+4.4+0.044 ms clock. The GC cycle began at 0.154s and completed around 0.158s. The record-breaking allocation delay finished at 0.169s and began at 0.159s.

On a linux/amd64 server:

$ GOMAXPROCS=8 GODEBUG=gctrace=1 ./gc.test.linux_amd64 -test.v -retain-count=20000000 -preallocate=true -loop-count=3
=== RUN   TestHeapGrowthStalls
gc 1 @0.045s 11%: 0.11+21+0.072 ms clock, 0.22+20/43/106+0.14 ms cpu, 458->459->458 MB, 459 MB goal, 8 P
gc 2 @0.068s 14%: 0.004+27+0.10 ms clock, 0.036+0.063/46/93+0.87 ms cpu, 459->513->513 MB, 917 MB goal, 8 P
gc 3 @0.172s 10%: 0.019+30+0.084 ms clock, 0.15+0.20/54/135+0.67 ms cpu, 739->785->784 MB, 1027 MB goal, 8 P
gc 4 @0.389s 6%: 0.014+33+0.039 ms clock, 0.11+0.71/64/161+0.31 ms cpu, 1315->1388->1388 MB, 1569 MB goal, 8 P
gc 5 @0.795s 4%: 0.026+33+0.044 ms clock, 0.21+1.6/62/164+0.35 ms cpu, 2486->2557->2557 MB, 2777 MB goal, 8 P
gc 6 @1.589s 3%: 0.025+84+0.046 ms clock, 0.20+2.3/159/406+0.37 ms cpu, 4780->4915->4915 MB, 5115 MB goal, 8 P
gc 7 @3.202s 2%: 0.015+63+0.041 ms clock, 0.12+4.3/126/313+0.32 ms cpu, 9377->9499->9499 MB, 9830 MB goal, 8 P
gc 8 @6.350s 1%: 0.018+204+0.091 ms clock, 0.14+55/395/998+0.72 ms cpu, 18438->18685->18685 MB, 18999 MB goal, 8 P
gc 9 @13.058s 1%: 0.019+300+0.066 ms clock, 0.15+112/565/1496+0.52 ms cpu, 36436->36706->20259 MB, 37370 MB goal, 8 P
gc 10 @34.001s 0%: 0.017+265+0.11 ms clock, 0.13+83/507/1313+0.91 ms cpu, 39506->39732->20215 MB, 40519 MB goal, 8 P
--- PASS: TestHeapGrowthStalls (53.37s)
        gc_test.go:42: i=0        at=0.045s   delta=0.581ms
        gc_test.go:42: i=57055    at=0.097s   delta=2.220ms
        gc_test.go:42: i=331266   at=0.200s   delta=2.323ms
        gc_test.go:42: i=334871   at=0.221s   delta=19.590ms
        gc_test.go:42: i=952975   at=0.454s   delta=32.232ms
        gc_test.go:42: i=2150327  at=0.889s   delta=61.852ms
        gc_test.go:42: i=4564455  at=1.799s   delta=126.545ms
        gc_test.go:42: i=9258703  at=3.533s   delta=268.414ms
        gc_test.go:42: i=18664799 at=7.354s   delta=799.231ms
PASS

/cc @aclements @RLH

The text was updated successfully, but these errors were encountered:

aclements · 2016-12-02T21:16:05Z

Very interesting. I knew this problem existed for large allocations, but hadn't really thought about it with small allocations.

Assuming I understand what's going on, this isn't really a "sweep assist" problem, since it would happen even if the sweep ratio was 0. Sweep assists don't care if they found more space or not; you get credit for sweeping a span regardless of the outcome. The problem is that allocation ultimately has to find a span with some free space on it and currently it will keep sweeping until it does.

I might have a simple fix: put a threshold on how many spans allocation is willing to sweep before it gives up and just allocates a fresh span from the heap. The heap overhead of this is bounded at 1/threshold.

aclements · 2016-12-06T03:14:29Z

There are actually a few distinct causes of this.

One is when sweep has to process a lot of full spans from an mcache before finding one that has space to allocate from. My proposed fix targets this, but doesn't actually work because it just shifts the time into the next problem. (All + timestamps below are in ns.)

long cacheSpan @0.286 10ms, 83577 swept, 0 reclaimed, 1 grow
+27 (p 0) lock mcentral
+71 (p 0) locked; scanning lists
+10539832 (p 0) grow()
+10539993 (p 0) mheap.alloc()
+10540090 (p 0) mheap.alloc_m()
+10540149 (p 0) got lock
+10540220 (p 0) reclaim
+10542490 (p 0) reclaim done
+10542536 (p 0) allocSpanLocked
+10542977 (p 0) initialize span
+10543065 (p 0) unlock
+10543110 (p 0) mheap.alloc_m done
+10543166 (p 0) mheap.alloc done
+10543204 (p 0) initSpan
+10543524 (p 0) grow done
+10543588 (p 0) have span

When we go to allocate a fresh span, either because of the fallback in my fix or because we're genuinely out of spans of that size class, the heap also tries to sweep. This is not to find individual free objects, but to free up whole spans before allocating a new one and potentially growing the RSS. This is, in a sense, even more aggressive since it's trying to find completely empty spans. I've been aware of this one for a while and have various schemes for how to fix it, but none that are simple. This doesn't happen very often, but when it does it's really expensive.

long cacheSpan @2.890 22ms, 0 swept, 1 reclaimed, 1 grow
+25 (p 0) lock mcentral
+70 (p 0) locked; scanning lists
+121 (p 0) grow()
+263 (p 0) mheap.alloc()
+380 (p 0) mheap.alloc_m()
+480 (p 0) got lock
+525 (p 0) reclaim
+22056552 (p 0) reclaim done
+22056982 (p 0) allocSpanLocked
+22057164 (p 0) initialize span
+22057377 (p 0) unlock
+22057418 (p 0) mheap.alloc_m done
+22057638 (p 0) memclr
+22058327 (p 0) mheap.alloc done
+22058448 (p 0) initSpan
+22058975 (p 0) grow done
+22059177 (p 0) have span

With my proposed fix, we fall into this case any time we would otherwise have fallen into the previous case. For this particular application, the reclaim sweep has just as much trouble finding free space as the mcentral sweep. This may be less true in a more heterogeneous application.

long cacheSpan @0.161 5ms, 50+0 swept, 0 skipped, 1 grow
+27 (p 0) lock mcentral
+66 (p 0) locked; scanning lists
+5336 (p 0) retry failed
+5375 (p 0) grow()
+5425 (p 0) mheap.alloc()
+5485 (p 0) mheap.alloc_m()
+5525 (p 0) got lock
+5556 (p 0) reclaim
+5092799 (p 0) reclaim done
+5093019 (p 0) allocSpanLocked
+5093269 (p 0) initialize span
+5093403 (p 0) unlock
+5093441 (p 0) mheap.alloc_m done
+5093631 (p 0) memclr
+5098431 (p 0) mheap.alloc done
+5098462 (p 0) initSpan
+5100228 (p 0) grow done
+5100424 (p 0) have span

Every once in a while when I run the benchmark, I get dozens of ~10ms stalls. When this happens, it's because memclr in mheap.alloc_m takes 5--14ms. This almost certainly has something to do with page faults and kernel state, though I don't know what. I have a hunch it has to do with transparent huge page scavenging (not clearing; clearing a huge page takes 40 us), but that's purely a hunch.

long cacheSpan @1.643 9ms, 0 swept, 0 reclaimed, 1 grow
+28 (p 0) lock mcentral
+74 (p 0) locked; scanning lists
+194 (p 0) grow()
+247 (p 0) mheap.alloc()
+307 (p 0) mheap.alloc_m()
+350 (p 0) got lock
+391 (p 0) allocSpanLocked
+701 (p 0) initialize span
+761 (p 0) unlock
+804 (p 0) mheap.alloc_m done
+843 (p 0) memclr
+9365603 (p 0) mheap.alloc done
+9366045 (p 0) initSpan
+9366973 (p 0) grow done
+9367295 (p 0) have span

gopherbot · 2016-12-12T20:00:29Z

CL https://golang.org/cl/34291 mentions this issue.

aclements · 2016-12-12T20:59:24Z

@rhysh, care to give CL 34291 a shot? I don't think that's entirely the right approach, but it definitely makes things better.

rhysh · 2017-01-20T22:20:46Z

Hi @aclements , yes CL 34291 PS 1 makes things better. It's particularly apparent when the heap is large, and the stalls of over 500ms are reduced to less than 100ms.

Here's the output I got from go1.8rc2 with CL 34291 PS 1 cherry-picked on linux/amd64:

$ GOMAXPROCS=8 GODEBUG=gctrace=1 ./gc.test.go18rc2.cl34291ps1.linux_amd64 -test.v -retain-count=20000000 -preallocate=true -loop-count=3
=== RUN   TestHeapGrowthStalls
gc 1 @0.043s 13%: 0.17+25+0.070 ms clock, 0.35+23/49/122+0.14 ms cpu, 458->459->458 MB, 459 MB goal, 8 P
gc 2 @0.069s 15%: 0.007+34+0.13 ms clock, 0.057+0.12/57/119+1.1 ms cpu, 459->524->524 MB, 917 MB goal, 8 P
gc 3 @0.182s 11%: 0.008+29+0.054 ms clock, 0.070+0.29/57/144+0.43 ms cpu, 748->800->800 MB, 1048 MB goal, 8 P
gc 4 @0.397s 7%: 0.016+29+0.043 ms clock, 0.13+0.75/56/142+0.34 ms cpu, 1332->1375->1375 MB, 1601 MB goal, 8 P
gc 5 @0.801s 4%: 0.013+33+0.066 ms clock, 0.10+1.5/64/162+0.53 ms cpu, 2482->2555->2555 MB, 2750 MB goal, 8 P
gc 6 @1.604s 3%: 0.013+41+0.033 ms clock, 0.11+2.3/79/202+0.26 ms cpu, 4792->4876->4876 MB, 5111 MB goal, 8 P
gc 7 @3.199s 2%: 0.020+85+0.063 ms clock, 0.16+4.3/158/368+0.50 ms cpu, 9367->9524->9524 MB, 9752 MB goal, 8 P
gc 8 @6.374s 1%: 0.019+149+0.10 ms clock, 0.15+19/280/682+0.80 ms cpu, 18515->18767->18767 MB, 19049 MB goal, 8 P
gc 9 @12.683s 1%: 0.018+169+0.054 ms clock, 0.14+31/334/834+0.43 ms cpu, 36597->36799->20191 MB, 37535 MB goal, 8 P
gc 10 @37.305s 0%: 0.026+190+0.040 ms clock, 0.21+39/363/912+0.32 ms cpu, 39373->39596->20212 MB, 40382 MB goal, 8 P
--- PASS: TestHeapGrowthStalls (55.54s)
	gc_test.go:42: i=0        at=0.043s   delta=0.710ms
	gc_test.go:42: i=68087    at=0.107s   delta=0.870ms
	gc_test.go:42: i=331280   at=0.202s   delta=2.343ms
	gc_test.go:42: i=351240   at=0.230s   delta=1.670ms
	gc_test.go:42: i=940256   at=0.456s   delta=4.094ms
	gc_test.go:42: i=2150448  at=0.898s   delta=7.048ms
	gc_test.go:42: i=4529512  at=1.789s   delta=22.685ms
	gc_test.go:42: i=9294784  at=3.549s   delta=39.208ms
	gc_test.go:42: i=18770456 at=7.122s   delta=88.855ms
PASS

The longest stall, 89ms, started around (7.122s-0.089s = 7.033s), which is not immediately after the previous GC, which ended around (6.374s+0.149s = 6.523s). It's likely still during the concurrent sweep phase (which takes around 560ms, based on the go1.8rc2 run below), but it doesn't start at the beginning of the sweep phase.

Compared with the output from vanilla go1.8rc2:

$ GOMAXPROCS=8 GODEBUG=gctrace=1 ./gc.test.go18rc2.linux_amd64 -test.v -retain-count=20000000 -preallocate=true -loop-count=3
=== RUN   TestHeapGrowthStalls
gc 1 @0.038s 13%: 0.17+21+0.080 ms clock, 0.35+20/42/105+0.16 ms cpu, 458->459->458 MB, 459 MB goal, 8 P
gc 2 @0.061s 15%: 0.006+29+0.076 ms clock, 0.055+0.21/49/99+0.61 ms cpu, 459->502->502 MB, 917 MB goal, 8 P
gc 3 @0.171s 10%: 0.009+26+0.051 ms clock, 0.073+0.25/48/121+0.41 ms cpu, 729->785->785 MB, 1005 MB goal, 8 P
gc 4 @0.369s 6%: 0.013+29+0.10 ms clock, 0.10+0.69/57/143+0.83 ms cpu, 1311->1347->1347 MB, 1570 MB goal, 8 P
gc 5 @0.785s 4%: 0.015+48+0.062 ms clock, 0.12+1.6/82/175+0.49 ms cpu, 2441->2506->2505 MB, 2694 MB goal, 8 P
gc 6 @1.594s 2%: 0.011+43+0.060 ms clock, 0.090+2.5/82/211+0.48 ms cpu, 4716->4805->4805 MB, 5011 MB goal, 8 P
gc 7 @3.143s 2%: 0.016+73+0.030 ms clock, 0.13+4.8/145/364+0.24 ms cpu, 9238->9366->9365 MB, 9610 MB goal, 8 P
gc 8 @6.224s 1%: 0.019+139+0.037 ms clock, 0.15+20/267/697+0.30 ms cpu, 18241->18472->18472 MB, 18731 MB goal, 8 P
gc 9 @12.331s 1%: 0.020+152+0.038 ms clock, 0.16+30/302/755+0.30 ms cpu, 36020->36228->20196 MB, 36944 MB goal, 8 P
gc 10 @32.834s 0%: 0.013+209+0.058 ms clock, 0.10+36/397/1053+0.46 ms cpu, 39383->39624->20229 MB, 40393 MB goal, 8 P
--- PASS: TestHeapGrowthStalls (55.93s)
	gc_test.go:42: i=0        at=0.039s   delta=1.323ms
	gc_test.go:42: i=45710    at=0.091s   delta=1.879ms
	gc_test.go:42: i=331286   at=0.195s   delta=2.284ms
	gc_test.go:42: i=335222   at=0.206s   delta=9.855ms
	gc_test.go:42: i=910622   at=0.446s   delta=48.938ms
	gc_test.go:42: i=2097248  at=0.904s   delta=70.692ms
	gc_test.go:42: i=4451669  at=1.777s   delta=140.077ms
	gc_test.go:42: i=9121877  at=3.487s   delta=271.325ms
	gc_test.go:42: i=18446653 at=6.923s   delta=560.603ms
PASS

I'm not aware of any production applications that this issue affects: observing the issue (if I understand it correctly) seems to require the unusual combination of 1) a growing heap, 2) very little garbage to reclaim, and 3) caring about latency in individual goroutines.

WillSewell · 2017-02-14T12:09:07Z

I'm trying to understand this problem, and there's a few things that are confusing me. In my original benchmark, when I visualise a run of the program with go tool trace, the pauses look like this:

i.e. all processors are running a combination of runtime.gcBgMarkWorker and MARK and GC.

When I visualise the benchmark listed at the top of this issue, the pauses look like this:

This also has all processors running runtime.gcBgMarkWorker at the start, and processors running MARK at the end, but for a very short time (<0.05ms). But one processor runs runtime.gcBgMarkWorker and one runs GC for relatively much longer.

A couple of things about this confuses me:

Why is the work done during the pauses not quite the same? I appreciate this could just be due to a larger heap size.
From what I can tell (just from the names of the events) that all the work is to do with the mark phase. What about this indicates it's pausing doing sweep assist work? Is it to do with the mysteriously named GC event?
I came to the conclusion that the problem we are seeing in the visualisation is this: runtime: idle mark workers run when there's work to do #16528. Is this not correct? Or am I seeing runtime: idle mark workers run when there's work to do #16528 in the above screenshots, and I'm missing the issue discussed here in the trace visualisations?

Happy to take this offline if someone would be willing to explain this to me.

gopherbot · 2017-08-31T22:19:49Z

Change https://golang.org/cl/60790 mentions this issue: cmd/trace: add minimum mutator utilization (MMU) plot

cespare · 2018-02-24T02:34:51Z

I have what I think is an acute case of this problem. (If that's not the case -- apologies; please ignore.)

Over on #23044 I described my current workaround for the "small heap problem" -- where a program generates a lot of garbage but keeps a very small amount of live data, leading to extremely frequent GCs.

As I mentioned in #23044 I'm currently using heap ballast as a horrible workaround to force the garbage collector to set a much higher size target.

The latest system where I added ballast is described in #23044 (comment):

I ran into this again today with another service that was doing 15 collections/sec. This one is a fairly simple webserver that receives 10-15k requests per second and turns them into protobuf messages that are sent along the network to other systems for processing.

So my program does a bunch of small allocations that become garbage right away (its work) and a few giant allocations every once in a while which tend to stay live for a long time (ballast). It seems like sometimes, after more ballast is added, my program runs into the situation that @rhysh described at the top, and all the mutators can spend all their time in sweep assists.

See this unpleasant trace:

In this case no goroutine did productive work for about 600ms.

This particular case isn't a big deal for us -- I can mitigate this in a few different ways -- but I thought it might be interesting as an extreme example.

aclements · 2018-02-24T22:08:07Z

That certainly is an unfortunate trace. Quick question: could you sample a few of those sweeps and report the metadata associated with them? I'm particularly interested in their duration, how many pages were swept, and how many bytes were recovered. Also, is there anything between the obvious, long sweeps that are visible in the screenshot? (E.g., if you select one and hit the right arrow key, is the next event just the next big sweep or something else?)

cespare · 2018-02-28T17:28:56Z

Quick question: could you sample a
few of those sweeps and report the metadata associated with them? I'm
particularly interested in their duration, how many pages were swept, and
how many bytes were recovered.

Here are 6 sequential sweeps for one of the procs:

Start 4,116,135,261 ns
Wall Duration 13,317,631 ns
Swept bytes 5201920
Reclaimed bytes	5104656

Start 4,129,453,818 ns
Wall Duration 5,827 ns
Swept bytes 8192
Reclaimed bytes 8064

Start 4,129,479,153 ns
Wall Duration 13,572,150 ns
Swept bytes 6684672
Reclaimed bytes 6553744

Start 4,143,126,160 ns
Wall Duration 13,116,497 ns
Swept bytes 6479872
Reclaimed bytes 6353504

Start 4,156,243,473 ns
Wall Duration 6,180 ns
Swept bytes 8192
Reclaimed bytes 8096

Start 4,156,321,795 ns
Wall Duration 14,072,471 ns
Swept bytes 7069696
Reclaimed bytes 6929616

Also, is there anything between the obvious, long sweeps that are visible
in the screenshot? (E.g., if you select one and hit the right arrow key, is
the next event just the next big sweep or something else?)

Yeah -- as shown above, between the big sweeps there are often interspersed some small sweeps. (But not always -- sometimes there are two big sweeps next to each other.)

I also mailed you the trace (this is private code).

danieltahara · 2018-08-27T18:49:05Z

We're writing a MySQL wire-protocol compatible proxy at Dropbox, and it looks like we're hitting this same issue in production on go 1.10.3.

Some "oddities" about the service -- being effectively a MySQL server, there is no connection multiplexing and there is a ton of per-connection buffers and state (see vitessio/vitess#4138). We're working on reducing them, but something like 90% of the in use heap memory at any given time is tied up is unreclaimable, effectively constant memory, and 50+% of the # of in use objects are from the same.

Our traces look like this:

Zooming in, the concurrent SWEEP happens on the first call to MakeString(...) (copying cgo memory into a go string). The gap between GC finishing and the SWEEPs is due to the fact that most of the goroutines (one per connection) don't have any data waiting on their sockets, so they don't require any allocations:

GC happens every few seconds (GOGC=75, since we don't have much memory to spare), and concurrent mark and scan is itself expensive (and born mostly by mutators), but it still pales in comparison to the concurrent sweep stalls, which seem to tie up every core for about 100-200ms.

gc 62645 @227962.290s 4%: 0.28+223+0.23 ms clock, 5.6+2319/1090/7.2+4.6 ms cpu, 6995->7052->4089 MB, 7146 MB goal, 20 P
gc 62646 @227965.313s 4%: 0.30+243+0.26 ms clock, 6.1+2406/1188/0.97+5.2 ms cpu, 7004->7061->4083 MB, 7156 MB goal, 20 P
gc 62647 @227968.497s 4%: 0.14+239+0.25 ms clock, 2.8+2290/1170/0.024+5.1 ms cpu, 6993->7051->4083 MB, 7145 MB goal, 20 P
gc 62648 @227971.571s 4%: 0.26+251+0.26 ms clock, 5.3+2557/1233/0.29+5.2 ms cpu, 6994->7050->4085 MB, 7146 MB goal, 20 P
gc 62649 @227974.337s 4%: 0.16+242+0.24 ms clock, 3.2+2376/1175/1.8+4.8 ms cpu, 6997->7056->4086 MB, 7148 MB goal, 20 P
gc 62650 @227977.309s 4%: 0.35+291+0.31 ms clock, 7.1+2466/1422/1.3+6.3 ms cpu, 6999->7055->4084 MB, 7151 MB goal, 20 P

RLH · 2018-08-29T16:42:29Z

Can you provide the approximate size of the allocation that is causing the long sweep? Go uses a different allocation scheme if the object is > 32KBytes (or < 8 bytes).

…

On Mon, Aug 27, 2018 at 2:50 PM, Daniel Tahara ***@***.***> wrote: We're writing a MySQL wire-protocol compatible proxy at Dropbox, and it looks like we're hitting this same issue in production on go 1.10.3. Some "oddities" about the service -- being effectively a MySQL server, there is no connection multiplexing and there is a ton of per-connection buffers and state (see vitessio/vitess#4138 <vitessio/vitess#4138>). We're working on reducing them, but something like 70% of the in use heap memory at any given time is tied up in unreclaimable, effectively constant memory, and 50+% of the # of in use objects are from the same. Our traces look like this: [image: image 4] <https://user-images.githubusercontent.com/1313465/44678919-6c5aca00-a9ee-11e8-9970-795ef36b9487.png> Zooming in, the concurrent SWEEP happens on the first call to MakeString(...) (copying cgo memory into a go string). The gap between GC finishing and the SWEEPs is due to the fact that most of the goroutines (one per connection) don't have any data waiting on their sockets, so they don't require any allocations: [image: image] <https://user-images.githubusercontent.com/1313465/44679159-0589e080-a9ef-11e8-8382-ab8aa04667f9.png> GC happens every few seconds (GOGC=75, since we don't have much memory to spare), and concurrent mark and scan is itself expensive, but it still pales in comparison to the concurrent sweep stalls, which seem to tie up every core for about 100-200ms. gc 62645 @227962.290s 4%: 0.28+223+0.23 ms clock, 5.6+2319/1090/7.2+4.6 ms cpu, 6995->7052->4089 MB, 7146 MB goal, 20 P gc 62646 @227965.313s 4%: 0.30+243+0.26 ms clock, 6.1+2406/1188/0.97+5.2 ms cpu, 7004->7061->4083 MB, 7156 MB goal, 20 P gc 62647 @227968.497s 4%: 0.14+239+0.25 ms clock, 2.8+2290/1170/0.024+5.1 ms cpu, 6993->7051->4083 MB, 7145 MB goal, 20 P gc 62648 @227971.571s 4%: 0.26+251+0.26 ms clock, 5.3+2557/1233/0.29+5.2 ms cpu, 6994->7050->4085 MB, 7146 MB goal, 20 P gc 62649 @227974.337s 4%: 0.16+242+0.24 ms clock, 3.2+2376/1175/1.8+4.8 ms cpu, 6997->7056->4086 MB, 7148 MB goal, 20 P gc 62650 @227977.309s 4%: 0.35+291+0.31 ms clock, 7.1+2466/1422/1.3+6.3 ms cpu, 6999->7055->4084 MB, 7151 MB goal, 20 P — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18155 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA7Wnwn6uROCKSeeAz7r4RPtYzVQ23l6ks5uVD9ygaJpZM4LCNrC> .

danieltahara · 2018-08-29T18:26:47Z

The objects are in the range of a few hundred bytes to single digit KB. The best specific data I can get is for row size (as opposed to column size). That distribution (in bytes) is here:

p99	p90	p75	p50
84511.8414031	6485.345	1126.004	379.099280115

RLH · 2018-08-29T20:51:03Z

If I read this correctly p99 object size if > 84K bytes and if I assume a normal distribution and interpolate p95 or so is > 32 KBytes.

…

On Wed, Aug 29, 2018 at 2:27 PM, Daniel Tahara ***@***.***> wrote: The objects are in the range of a few hundred bytes to single digit KB. The best specific data I can get is for row size (as opposed to column size). That distribution (in bytes) is here: p99 p90 p75 p50 84511.8414031 6485.345 1126.004 379.099280115 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18155 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA7Wn75RwwYiuMcDLFQlFk3XI3o-NZMVks5uVtz0gaJpZM4LCNrC> .

danieltahara · 2018-08-29T21:31:53Z

Queries in last hour, in bytes:

p99	p95	p90	p75	p50	p25	p10
72700.6023324	10662.0823733	6609.87482699	1130.83317386	379.689311459	101.527534113	46.6143729713

To reiterate -- these are stats for row size, not column/value size.

RLH · 2018-08-29T22:05:23Z

I'm not sure what you are saying, I am looking for the size of the allocation. Is an entire row allocated with a single allocation?

…

On Wed, Aug 29, 2018 at 5:39 PM, Daniel Tahara ***@***.***> wrote: Queries in last hour, in bytes: p99 p95 p90 p75 p50 p25 p10 72700.6023324 10662.0823733 6609.87482699 1130.83317386 379.689311459 101.527534113 46.6143729713 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18155 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA7Wn671oG5bhpNTRp5aIOk3sX3JfpDXks5uVwoJgaJpZM4LCNrC> .

danieltahara · 2018-08-29T22:08:59Z

The column values are allocated one by one. Unfortunately, I don't have stats at the column level, so I am providing the row size as an upper bound.

LK4D4 · 2018-08-29T23:21:27Z

@RLH Actually, we took another look and looks like MakeString doesn't allocate that much:
https://github.com/dropbox/godropbox/blob/master/database/sqltypes/sqltypes.go#L69
Just structure with interface filed where interface implementation is

type String struct {
	data   []byte
	isUtf8 bool
}

This adds a bitmap indexed by page number that marks the starts of in-use spans. This will be used to quickly find in-use spans with no marked objects for sweeping. For #18155. Change-Id: Icee56f029cde502447193e136fa54a74c74326dd Reviewed-on: https://go-review.googlesource.com/c/138957 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Rick Hudson <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>

This adds a mark bit for each span that is set if any objects on the span are marked. This will be used for sweeping. For #18155. The impact of this is negligible for most benchmarks, and < 1% for GC-heavy benchmarks. name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.18ms ± 0% 2.20ms ± 1% +0.88% (p=0.000 n=16+18) (https://perf.golang.org/search?q=upload:20180928.1) name old time/op new time/op delta BinaryTree17-12 2.68s ± 1% 2.68s ± 1% ~ (p=0.707 n=17+19) Fannkuch11-12 2.28s ± 0% 2.39s ± 0% +4.95% (p=0.000 n=19+18) FmtFprintfEmpty-12 40.3ns ± 4% 39.4ns ± 2% -2.27% (p=0.000 n=17+18) FmtFprintfString-12 67.9ns ± 1% 68.3ns ± 1% +0.55% (p=0.000 n=18+19) FmtFprintfInt-12 75.7ns ± 1% 76.1ns ± 1% +0.44% (p=0.005 n=18+19) FmtFprintfIntInt-12 123ns ± 1% 121ns ± 1% -1.00% (p=0.000 n=18+18) FmtFprintfPrefixedInt-12 150ns ± 0% 148ns ± 0% -1.33% (p=0.000 n=16+13) FmtFprintfFloat-12 208ns ± 0% 204ns ± 0% -1.92% (p=0.000 n=13+17) FmtManyArgs-12 501ns ± 1% 498ns ± 0% -0.55% (p=0.000 n=19+17) GobDecode-12 6.24ms ± 0% 6.25ms ± 1% ~ (p=0.113 n=20+19) GobEncode-12 5.33ms ± 0% 5.29ms ± 1% -0.72% (p=0.000 n=20+18) Gzip-12 220ms ± 1% 218ms ± 1% -1.02% (p=0.000 n=19+19) Gunzip-12 35.5ms ± 0% 35.7ms ± 0% +0.45% (p=0.000 n=16+18) HTTPClientServer-12 77.9µs ± 1% 77.7µs ± 1% -0.30% (p=0.047 n=20+19) JSONEncode-12 8.82ms ± 0% 8.93ms ± 0% +1.20% (p=0.000 n=18+17) JSONDecode-12 47.3ms ± 0% 47.0ms ± 0% -0.49% (p=0.000 n=17+18) Mandelbrot200-12 3.69ms ± 0% 3.68ms ± 0% -0.25% (p=0.000 n=19+18) GoParse-12 3.13ms ± 1% 3.13ms ± 1% ~ (p=0.640 n=20+20) RegexpMatchEasy0_32-12 76.2ns ± 1% 76.2ns ± 1% ~ (p=0.818 n=20+19) RegexpMatchEasy0_1K-12 226ns ± 0% 226ns ± 0% -0.22% (p=0.001 n=17+18) RegexpMatchEasy1_32-12 71.9ns ± 1% 72.0ns ± 1% ~ (p=0.653 n=18+18) RegexpMatchEasy1_1K-12 355ns ± 1% 356ns ± 1% ~ (p=0.160 n=18+19) RegexpMatchMedium_32-12 106ns ± 1% 106ns ± 1% ~ (p=0.325 n=17+20) RegexpMatchMedium_1K-12 31.1µs ± 2% 31.2µs ± 0% +0.59% (p=0.007 n=19+15) RegexpMatchHard_32-12 1.54µs ± 2% 1.53µs ± 2% -0.78% (p=0.021 n=17+18) RegexpMatchHard_1K-12 46.0µs ± 1% 45.9µs ± 1% -0.31% (p=0.025 n=17+19) Revcomp-12 391ms ± 1% 394ms ± 2% +0.80% (p=0.000 n=17+19) Template-12 59.9ms ± 1% 59.9ms ± 1% ~ (p=0.428 n=20+19) TimeParse-12 304ns ± 1% 312ns ± 0% +2.88% (p=0.000 n=20+17) TimeFormat-12 318ns ± 0% 326ns ± 0% +2.64% (p=0.000 n=20+17) (https://perf.golang.org/search?q=upload:20180928.2) Change-Id: I336b9bf054113580a24103192904c8c76593e90e Reviewed-on: https://go-review.googlesource.com/c/138958 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Rick Hudson <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>

When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Rick Hudson <[email protected]>

Currently, there's no efficient way to iterate over the Go heap. We're going to need this for fast free page sweeping, so this CL adds a slice of all allocated heap arenas. This will also be useful for generational GC. For #18155. Change-Id: I58d126cfb9c3f61b3125d80b74ccb1b2169efbcc Reviewed-on: https://go-review.googlesource.com/c/138076 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Rick Hudson <[email protected]>

This adds a bitmap indexed by page number that marks the starts of in-use spans. This will be used to quickly find in-use spans with no marked objects for sweeping. For #18155. Change-Id: Icee56f029cde502447193e136fa54a74c74326dd Reviewed-on: https://go-review.googlesource.com/c/138957 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Rick Hudson <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>

This adds a mark bit for each span that is set if any objects on the span are marked. This will be used for sweeping. For #18155. The impact of this is negligible for most benchmarks, and < 1% for GC-heavy benchmarks. name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.18ms ± 0% 2.20ms ± 1% +0.88% (p=0.000 n=16+18) (https://perf.golang.org/search?q=upload:20180928.1) name old time/op new time/op delta BinaryTree17-12 2.68s ± 1% 2.68s ± 1% ~ (p=0.707 n=17+19) Fannkuch11-12 2.28s ± 0% 2.39s ± 0% +4.95% (p=0.000 n=19+18) FmtFprintfEmpty-12 40.3ns ± 4% 39.4ns ± 2% -2.27% (p=0.000 n=17+18) FmtFprintfString-12 67.9ns ± 1% 68.3ns ± 1% +0.55% (p=0.000 n=18+19) FmtFprintfInt-12 75.7ns ± 1% 76.1ns ± 1% +0.44% (p=0.005 n=18+19) FmtFprintfIntInt-12 123ns ± 1% 121ns ± 1% -1.00% (p=0.000 n=18+18) FmtFprintfPrefixedInt-12 150ns ± 0% 148ns ± 0% -1.33% (p=0.000 n=16+13) FmtFprintfFloat-12 208ns ± 0% 204ns ± 0% -1.92% (p=0.000 n=13+17) FmtManyArgs-12 501ns ± 1% 498ns ± 0% -0.55% (p=0.000 n=19+17) GobDecode-12 6.24ms ± 0% 6.25ms ± 1% ~ (p=0.113 n=20+19) GobEncode-12 5.33ms ± 0% 5.29ms ± 1% -0.72% (p=0.000 n=20+18) Gzip-12 220ms ± 1% 218ms ± 1% -1.02% (p=0.000 n=19+19) Gunzip-12 35.5ms ± 0% 35.7ms ± 0% +0.45% (p=0.000 n=16+18) HTTPClientServer-12 77.9µs ± 1% 77.7µs ± 1% -0.30% (p=0.047 n=20+19) JSONEncode-12 8.82ms ± 0% 8.93ms ± 0% +1.20% (p=0.000 n=18+17) JSONDecode-12 47.3ms ± 0% 47.0ms ± 0% -0.49% (p=0.000 n=17+18) Mandelbrot200-12 3.69ms ± 0% 3.68ms ± 0% -0.25% (p=0.000 n=19+18) GoParse-12 3.13ms ± 1% 3.13ms ± 1% ~ (p=0.640 n=20+20) RegexpMatchEasy0_32-12 76.2ns ± 1% 76.2ns ± 1% ~ (p=0.818 n=20+19) RegexpMatchEasy0_1K-12 226ns ± 0% 226ns ± 0% -0.22% (p=0.001 n=17+18) RegexpMatchEasy1_32-12 71.9ns ± 1% 72.0ns ± 1% ~ (p=0.653 n=18+18) RegexpMatchEasy1_1K-12 355ns ± 1% 356ns ± 1% ~ (p=0.160 n=18+19) RegexpMatchMedium_32-12 106ns ± 1% 106ns ± 1% ~ (p=0.325 n=17+20) RegexpMatchMedium_1K-12 31.1µs ± 2% 31.2µs ± 0% +0.59% (p=0.007 n=19+15) RegexpMatchHard_32-12 1.54µs ± 2% 1.53µs ± 2% -0.78% (p=0.021 n=17+18) RegexpMatchHard_1K-12 46.0µs ± 1% 45.9µs ± 1% -0.31% (p=0.025 n=17+19) Revcomp-12 391ms ± 1% 394ms ± 2% +0.80% (p=0.000 n=17+19) Template-12 59.9ms ± 1% 59.9ms ± 1% ~ (p=0.428 n=20+19) TimeParse-12 304ns ± 1% 312ns ± 0% +2.88% (p=0.000 n=20+17) TimeFormat-12 318ns ± 0% 326ns ± 0% +2.64% (p=0.000 n=20+17) (https://perf.golang.org/search?q=upload:20180928.2) Change-Id: I336b9bf054113580a24103192904c8c76593e90e Reviewed-on: https://go-review.googlesource.com/c/138958 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Rick Hudson <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>

When we attempt to allocate an N page span (either for a large allocation or when an mcentral runs dry), we first try to sweep spans to release N pages. Currently, this can be extremely expensive: sweeping a span to emptiness is the hardest thing to ask for and the sweeper generally doesn't know where to even look for potentially fruitful results. Since this is on the critical path of many allocations, this is unfortunate. This CL changes how we reclaim empty spans. Instead of trying lots of spans and hoping for the best, it uses the newly introduced span marks to efficiently find empty spans. The span marks (and in-use bits) are in a dense bitmap, so these spans can be found with an efficient sequential memory scan. This approach can scan for unmarked spans at about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We could probably significantly improve the rate at which is can free unmarked spans, but that's a separate issue. Like the current reclaimer, this is still linear in the number of spans that are swept, but the constant factor is now so vanishingly small that it doesn't matter. The benchmark in #18155 demonstrates both significant page reclaiming delays, and object reclaiming delays. With "-retain-count=20000000 -preallocate=true -loop-count=3", the benchmark demonstrates several page reclaiming delays on the order of 40ms. After this change, the page reclaims are insignificant. The longest sweeps are still ~150ms, but are object reclaiming delays. We'll address those in the next several CLs. Updates #18155. Fixes #21378 by completely replacing the logic that had that bug. Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064 Reviewed-on: https://go-review.googlesource.com/c/138959 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Rick Hudson <[email protected]>

aclements · 2019-04-02T16:09:17Z

Go 1.12 fixed a major source of this latency. While there are still allocation patterns that can in theory lead to long sweep latency, we haven't seen these in our internal tests. Does anybody on this issue still see long sweep latency with Go 1.12? If not, I'll go ahead and call this resolved.

un000 · 2019-04-02T17:53:13Z

@aclements Some pauses still happen when one goroutine allocates a huge block of memory. Is this relevant to the latest changes?

We have a cache inside an application. The cache consists of structs:

type Rates struct {
	a float64
	b float64
	c float64
	d float64
	e float64
}

type Cache struct {
	index []int
	// like map[positionFromindex][geo(static 255 items)]*Rates
        // size: 3M rows with 255 items
	rates [][]*Rates
}

It's ok, but have a lot of pointers(3M). To reduce them we made the slice of slices single-dimensioned because geo is statically defined with 255 items:

type Rates struct {
	Initialized bool
	a float64
	b float64
	c float64
	d float64
	e float64
}
type Cache struct {
	index []int
	rates []Rates
}

And when the application allocates all the slice (3M*255) in the background *the latency inside all response handlers increases *

	c.rates = make([]FixedRates, size)

Nothing changed between 1.11 and 1.12

un000 · 2019-04-02T21:11:32Z

Got an example:

package main

import (
	"fmt"
	"math/rand"
	"os"
	"runtime/trace"
	"time"
)

type Rates struct {
	a, b, c, d float64
}

var maxWait time.Duration

func main() {
	const size = 100e6

	out, err := os.OpenFile("prof", os.O_WRONLY|os.O_CREATE|os.O_TRUNC, os.ModePerm)
	if err != nil {
		panic(err)
	}
	defer func() {
		_ = out.Sync()
		_ = out.Close()
	}()
	err = trace.Start(out)
	if err != nil {
		panic(err)
	}

	go loop()

	time.Sleep(3 * time.Second)
	go func() {
		_ = make([]Rates, size)		// << huge alloc
	}()
	time.Sleep(3 * time.Second)

	trace.Stop()

	fmt.Println("maxWait =", maxWait.String())
}

func loop() {
	for {
		now := time.Now()
		r := make([]Rates, 0)
		for i := 0; i < 100; i++ {
			r = append(r, Rates{
				rand.Float64(), rand.Float64(), rand.Float64(), rand.Float64(),
			})
		}
		d := time.Since(now)
		if maxWait < d {
			maxWait = d
		}
	}
}

~ ᐅ go version
go version go1.12.1 darwin/amd64

So in production we have a bigger pauses

danieltahara · 2019-04-02T21:23:52Z

Our upgrade cycle takes a few months. So feel free to close if no one else surfaces problems, and we can revisit this thread if our 1.12 upgrade reveals anything further.

aclements · 2019-04-03T00:03:36Z

@un000, thanks for the reproducer. That's actually a different issue, since it's not sweep assists causing the delay but rather sweep termination. Could you open a new issue? I started digging into it and found that it's taking a long time to actually stop the world (rather than anything that's happening during sweep termination), but it wasn't obvious what was delaying it by so much.

un000 · 2019-04-03T09:29:05Z

@aclements got it #31222

WillSewell · 2019-04-03T22:28:48Z

When I run the original benchmark originally mentioned by @rhysh, I consistently get pause times of ~15ms with go 1.12.

gopherbot · 2019-07-27T18:36:42Z

Change https://golang.org/cl/187817 mentions this issue: runtime: bound small object sweeping to 100 spans

gopherbot · 2020-04-24T20:33:57Z

Change https://golang.org/cl/229998 mentions this issue: runtime: bound small object sweeping to 100 spans when allocating

Currently, the small object sweeper will sweep until it finds a free slot or there are no more spans of that size class to sweep. In dense heaps, this can cause sweeping for a given size class to take unbounded time, and gets worse with larger heaps. This CL limits the small object sweeper to try at most 100 spans before giving up and allocating a fresh span. Since it's already shown that 100 spans are completely full at that point, the space overhead of this fresh span is at most 1%. This CL is based on an experimental CL by Austin Clements (CL 187817) and is updated to be part of the mcentral implementation, gated by go115NewMCentralImpl. Updates #18155. Change-Id: I37a72c2dcc61dd6f802d1d0eac3683e6642b6ef8 Reviewed-on: https://go-review.googlesource.com/c/go/+/229998 Run-TryBot: Michael Knyszek <[email protected]> Reviewed-by: Austin Clements <[email protected]>

Currently, the small object sweeper will sweep until it finds a free slot or there are no more spans of that size class to sweep. In dense heaps, this can cause sweeping for a given size class to take unbounded time, and gets worse with larger heaps. This CL limits the small object sweeper to try at most 100 spans before giving up and allocating a fresh span. Since it's already shown that 100 spans are completely full at that point, the space overhead of this fresh span is at most 1%. This CL is based on an experimental CL by Austin Clements (CL 187817) and is updated to be part of the mcentral implementation, gated by go115NewMCentralImpl. Updates golang#18155. Change-Id: I37a72c2dcc61dd6f802d1d0eac3683e6642b6ef8 Reviewed-on: https://go-review.googlesource.com/c/go/+/229998 Run-TryBot: Michael Knyszek <[email protected]> Reviewed-by: Austin Clements <[email protected]>

rhysh mentioned this issue Dec 2, 2016

runtime: GC pause time 20% increase in 1.8beta1 compared to 1.7.3 #18161

Closed

aclements added the NeedsFix The path to resolution is known, but the work has not been done. label Dec 6, 2016

aclements added this to the Go1.8Maybe milestone Dec 6, 2016

aclements modified the milestones: Go1.9, Go1.8Maybe Dec 13, 2016

aclements modified the milestones: Go1.10, Go1.9 Jul 10, 2017

aclements mentioned this issue Aug 3, 2017

runtime: assisting mark and sweep make GC uncontrollable in the critical path #21107

Closed

rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

gopherbot modified the milestones: Go1.11, Unplanned May 23, 2018

dr2chase mentioned this issue Jan 14, 2020

runtime: GC causes latency spikes #14812

Closed

thepudds mentioned this issue Feb 7, 2020

runtime: 10ms-26ms latency from GC in go1.14rc1, possibly due to 'GC (idle)' work #37116

Closed

bobotu mentioned this issue May 31, 2020

statistics: avoid large CMSketch affecting the latency of normal query pingcap/tidb#17542

Merged

bobotu mentioned this issue Jun 1, 2020

The large CMSketch may affecting the latency of normal query pingcap/tidb#17546

Closed

loadlj mentioned this issue Jul 3, 2020

Golang 性能分析及优化 loadlj/blog#32

Open

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022

mknyszek added this to Go Compiler / Runtime Jul 13, 2022

mknyszek removed this from Go Compiler / Runtime Jul 13, 2022

gabyhelp mentioned this issue Sep 6, 2024

runtime: non-preemptible zeroing in clear, append, bytes.growSlice, etc #69327

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: latency in sweep assists when there's no garbage #18155

runtime: latency in sweep assists when there's no garbage #18155

rhysh commented Dec 2, 2016

aclements commented Dec 2, 2016

aclements commented Dec 6, 2016

gopherbot commented Dec 12, 2016

aclements commented Dec 12, 2016

rhysh commented Jan 20, 2017

WillSewell commented Feb 14, 2017 •

edited

Loading

gopherbot commented Aug 31, 2017

cespare commented Feb 24, 2018

aclements commented Feb 24, 2018 via email

cespare commented Feb 28, 2018

danieltahara commented Aug 27, 2018 •

edited

Loading

RLH commented Aug 29, 2018 via email

danieltahara commented Aug 29, 2018

RLH commented Aug 29, 2018 via email

danieltahara commented Aug 29, 2018 •

edited

Loading

RLH commented Aug 29, 2018 via email

danieltahara commented Aug 29, 2018 via email •

edited

Loading

LK4D4 commented Aug 29, 2018

aclements commented Apr 2, 2019

un000 commented Apr 2, 2019 •

edited

Loading

un000 commented Apr 2, 2019 •

edited

Loading

danieltahara commented Apr 2, 2019

aclements commented Apr 3, 2019

un000 commented Apr 3, 2019

WillSewell commented Apr 3, 2019

gopherbot commented Jul 27, 2019

gopherbot commented Apr 24, 2020

runtime: latency in sweep assists when there's no garbage #18155

runtime: latency in sweep assists when there's no garbage #18155

Comments

rhysh commented Dec 2, 2016

aclements commented Dec 2, 2016

aclements commented Dec 6, 2016

gopherbot commented Dec 12, 2016

aclements commented Dec 12, 2016

rhysh commented Jan 20, 2017

WillSewell commented Feb 14, 2017 • edited Loading

gopherbot commented Aug 31, 2017

cespare commented Feb 24, 2018

aclements commented Feb 24, 2018 via email

cespare commented Feb 28, 2018

danieltahara commented Aug 27, 2018 • edited Loading

RLH commented Aug 29, 2018 via email

danieltahara commented Aug 29, 2018

RLH commented Aug 29, 2018 via email

danieltahara commented Aug 29, 2018 • edited Loading

RLH commented Aug 29, 2018 via email

danieltahara commented Aug 29, 2018 via email • edited Loading

LK4D4 commented Aug 29, 2018

aclements commented Apr 2, 2019

un000 commented Apr 2, 2019 • edited Loading

un000 commented Apr 2, 2019 • edited Loading

danieltahara commented Apr 2, 2019

aclements commented Apr 3, 2019

un000 commented Apr 3, 2019

WillSewell commented Apr 3, 2019

gopherbot commented Jul 27, 2019

gopherbot commented Apr 24, 2020

WillSewell commented Feb 14, 2017 •

edited

Loading

danieltahara commented Aug 27, 2018 •

edited

Loading

danieltahara commented Aug 29, 2018 •

edited

Loading

danieltahara commented Aug 29, 2018 via email •

edited

Loading

un000 commented Apr 2, 2019 •

edited

Loading

un000 commented Apr 2, 2019 •

edited

Loading