runtime: GC should wake up idle Ps #14179

aclements · 2016-02-01T19:38:31Z

Currently the GC doesn't always wake up idle Ps, and hence may not take full advantage of idle marking during the concurrent mark phase. This can happen during mark 2 because mark 1 completion preempts all workers; if the Ps running those workers have nothing else to do they will simply park, and there's no mechanism to wake them up after we allow mark workers to start again. It's possible this can happen during mark 1 as well, though it may be since we allow mark workers to run before starting the world for mark 1 that all of the Ps start running.

/cc @RLH

rsc · 2016-05-18T01:07:06Z

ping @aclements for 1.7 vs 1.8 triage

aclements · 2016-05-18T02:46:27Z

Not a serious issue, so moving to 1.8.

aclements · 2016-06-01T19:03:24Z

This is in fact an issue during mark 1, but for a different reason: I believe we do always start the idle workers for mark 1, but it looks like they can sometimes run out of work very early and exit, causing the Ps to go idle again. We don't wake those up once more work is enqueued, so they stay idle.

quentinmit · 2016-09-28T22:28:48Z

@aclements Are we still on track to do this for Go1.8?

gopherbot · 2016-10-31T03:00:59Z

CL https://golang.org/cl/32434 mentions this issue.

We have seen one instance of a production job suddenly spinning to 100% CPU and becoming unresponsive. In that one instance, a SIGQUIT was sent after 328 minutes of spinning, and the stacks showed a single goroutine in "IO wait (scan)" state. Looking for things that might get stuck if a goroutine got stuck in scanning a stack, we found that injectglist does: lock(&sched.lock) var n int for n = 0; glist != nil; n++ { gp := glist glist = gp.schedlink.ptr() casgstatus(gp, _Gwaiting, _Grunnable) globrunqput(gp) } unlock(&sched.lock) and that casgstatus spins on gp.atomicstatus until the _Gscan bit goes away. Essentially, this code locks sched.lock and then while holding sched.lock, waits to lock gp.atomicstatus. The code that is doing the scan is: if castogscanstatus(gp, s, s|_Gscan) { if !gp.gcscandone { scanstack(gp, gcw) gp.gcscandone = true } restartg(gp) break loop } More analysis showed that scanstack can, in a rare case, end up calling back into code that acquires sched.lock. For example: runtime.scanstack at proc.go:866 calls runtime.gentraceback at mgcmark.go:842 calls runtime.scanstack$1 at traceback.go:378 calls runtime.scanframeworker at mgcmark.go:819 calls runtime.scanblock at mgcmark.go:904 calls runtime.greyobject at mgcmark.go:1221 calls (*runtime.gcWork).put at mgcmark.go:1412 calls (*runtime.gcControllerState).enlistWorker at mgcwork.go:127 calls runtime.wakep at mgc.go:632 calls runtime.startm at proc.go:1779 acquires runtime.sched.lock at proc.go:1675 This path was found with an automated deadlock-detecting tool. There are many such paths but they all go through enlistWorker -> wakep. The evidence strongly suggests that one of these paths is what caused the deadlock we observed. We're running those jobs with GOTRACEBACK=crash now to try to get more information if it happens again. Further refinement and analysis shows that if we drop the wakep call from enlistWorker, the remaining few deadlock cycles found by the tool are all false positives caused by not understanding the effect of calls to func variables. The enlistWorker -> wakep call was intended only as a performance optimization, it rarely executes, and if it does execute at just the wrong time it can (and plausibly did) cause the deadlock we saw. Comment it out, to avoid the potential deadlock. Fixes #19112. Unfixes #14179. Change-Id: I6f7e10b890b991c11e79fab7aeefaf70b5d5a07b Reviewed-on: https://go-review.googlesource.com/37093 Run-TryBot: Russ Cox <[email protected]> Reviewed-by: Austin Clements <[email protected]>

gopherbot · 2017-02-15T23:55:11Z

CL https://golang.org/cl/37022 mentions this issue.

rsc · 2017-02-16T03:05:32Z

For the record, unfixing this in Go 1.8 slowed the garbage benchmark by about 4%, STW time by about 3%. It does seem worthwhile to fix again for Go 1.9.

rsc · 2017-02-16T03:27:53Z

FWIW, the previous claim was based on 40+ runs of each benchmark, but a second run of benchmark results is not showing any difference. So possibly no win at all to fixing this. Easy to run the experiment; hard to interpret the results.

…to avoid possible deadlock We have seen one instance of a production job suddenly spinning to 100% CPU and becoming unresponsive. In that one instance, a SIGQUIT was sent after 328 minutes of spinning, and the stacks showed a single goroutine in "IO wait (scan)" state. Looking for things that might get stuck if a goroutine got stuck in scanning a stack, we found that injectglist does: lock(&sched.lock) var n int for n = 0; glist != nil; n++ { gp := glist glist = gp.schedlink.ptr() casgstatus(gp, _Gwaiting, _Grunnable) globrunqput(gp) } unlock(&sched.lock) and that casgstatus spins on gp.atomicstatus until the _Gscan bit goes away. Essentially, this code locks sched.lock and then while holding sched.lock, waits to lock gp.atomicstatus. The code that is doing the scan is: if castogscanstatus(gp, s, s|_Gscan) { if !gp.gcscandone { scanstack(gp, gcw) gp.gcscandone = true } restartg(gp) break loop } More analysis showed that scanstack can, in a rare case, end up calling back into code that acquires sched.lock. For example: runtime.scanstack at proc.go:866 calls runtime.gentraceback at mgcmark.go:842 calls runtime.scanstack$1 at traceback.go:378 calls runtime.scanframeworker at mgcmark.go:819 calls runtime.scanblock at mgcmark.go:904 calls runtime.greyobject at mgcmark.go:1221 calls (*runtime.gcWork).put at mgcmark.go:1412 calls (*runtime.gcControllerState).enlistWorker at mgcwork.go:127 calls runtime.wakep at mgc.go:632 calls runtime.startm at proc.go:1779 acquires runtime.sched.lock at proc.go:1675 This path was found with an automated deadlock-detecting tool. There are many such paths but they all go through enlistWorker -> wakep. The evidence strongly suggests that one of these paths is what caused the deadlock we observed. We're running those jobs with GOTRACEBACK=crash now to try to get more information if it happens again. Further refinement and analysis shows that if we drop the wakep call from enlistWorker, the remaining few deadlock cycles found by the tool are all false positives caused by not understanding the effect of calls to func variables. The enlistWorker -> wakep call was intended only as a performance optimization, it rarely executes, and if it does execute at just the wrong time it can (and plausibly did) cause the deadlock we saw. Comment it out, to avoid the potential deadlock. Fixes #19112. Unfixes #14179. Change-Id: I6f7e10b890b991c11e79fab7aeefaf70b5d5a07b Reviewed-on: https://go-review.googlesource.com/37093 Run-TryBot: Russ Cox <[email protected]> Reviewed-by: Austin Clements <[email protected]> Reviewed-on: https://go-review.googlesource.com/37022 TryBot-Result: Gobot Gobot <[email protected]>

…to avoid possible deadlock We have seen one instance of a production job suddenly spinning to 100% CPU and becoming unresponsive. In that one instance, a SIGQUIT was sent after 328 minutes of spinning, and the stacks showed a single goroutine in "IO wait (scan)" state. Looking for things that might get stuck if a goroutine got stuck in scanning a stack, we found that injectglist does: lock(&sched.lock) var n int for n = 0; glist != nil; n++ { gp := glist glist = gp.schedlink.ptr() casgstatus(gp, _Gwaiting, _Grunnable) globrunqput(gp) } unlock(&sched.lock) and that casgstatus spins on gp.atomicstatus until the _Gscan bit goes away. Essentially, this code locks sched.lock and then while holding sched.lock, waits to lock gp.atomicstatus. The code that is doing the scan is: if castogscanstatus(gp, s, s|_Gscan) { if !gp.gcscandone { scanstack(gp, gcw) gp.gcscandone = true } restartg(gp) break loop } More analysis showed that scanstack can, in a rare case, end up calling back into code that acquires sched.lock. For example: runtime.scanstack at proc.go:866 calls runtime.gentraceback at mgcmark.go:842 calls runtime.scanstack$1 at traceback.go:378 calls runtime.scanframeworker at mgcmark.go:819 calls runtime.scanblock at mgcmark.go:904 calls runtime.greyobject at mgcmark.go:1221 calls (*runtime.gcWork).put at mgcmark.go:1412 calls (*runtime.gcControllerState).enlistWorker at mgcwork.go:127 calls runtime.wakep at mgc.go:632 calls runtime.startm at proc.go:1779 acquires runtime.sched.lock at proc.go:1675 This path was found with an automated deadlock-detecting tool. There are many such paths but they all go through enlistWorker -> wakep. The evidence strongly suggests that one of these paths is what caused the deadlock we observed. We're running those jobs with GOTRACEBACK=crash now to try to get more information if it happens again. Further refinement and analysis shows that if we drop the wakep call from enlistWorker, the remaining few deadlock cycles found by the tool are all false positives caused by not understanding the effect of calls to func variables. The enlistWorker -> wakep call was intended only as a performance optimization, it rarely executes, and if it does execute at just the wrong time it can (and plausibly did) cause the deadlock we saw. Comment it out, to avoid the potential deadlock. Fixes golang#19112. Unfixes golang#14179. Change-Id: I6f7e10b890b991c11e79fab7aeefaf70b5d5a07b Reviewed-on: https://go-review.googlesource.com/37093 Run-TryBot: Russ Cox <[email protected]> Reviewed-by: Austin Clements <[email protected]> Reviewed-on: https://go-review.googlesource.com/37022 TryBot-Result: Gobot Gobot <[email protected]>

aclements self-assigned this Feb 1, 2016

aclements added this to the Go1.7 milestone Feb 1, 2016

aclements modified the milestones: Go1.8, Go1.7, Go1.8Early May 18, 2016

quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Sep 28, 2016

rsc modified the milestones: Go1.8, Go1.8Early Oct 20, 2016

gopherbot closed this as completed in 0bae74e Nov 20, 2016

rhysh mentioned this issue Nov 28, 2016

runtime: aggressive GC completion is disruptive to co-tenants #17969

Open

rsc reopened this Feb 15, 2017

aclements modified the milestones: Go1.9Early, Go1.8 Feb 15, 2017

bradfitz modified the milestones: Go1.9Maybe, Go1.9Early May 3, 2017

bradfitz modified the milestones: Go1.10, Go1.9Maybe Jul 14, 2017

aclements mentioned this issue Jul 20, 2017

runtime: network blips during concurrent GC #20457

Closed

rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

bradfitz modified the milestones: Go1.11, Go1.12 Jun 20, 2018

aclements modified the milestones: Go1.12, Go1.13 Dec 18, 2018

aclements added the Performance label May 28, 2019

aclements modified the milestones: Go1.13, Go1.14 Jun 25, 2019

rsc modified the milestones: Go1.14, Backlog Oct 9, 2019

thepudds mentioned this issue Feb 7, 2020

runtime: 10ms-26ms latency from GC in go1.14rc1, possibly due to 'GC (idle)' work #37116

Closed

rsc unassigned aclements Jun 23, 2022

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022

mknyszek added this to Go Compiler / Runtime Jul 7, 2022

mknyszek moved this to Triage Backlog in Go Compiler / Runtime Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: GC should wake up idle Ps #14179

runtime: GC should wake up idle Ps #14179

aclements commented Feb 1, 2016

rsc commented May 18, 2016

aclements commented May 18, 2016

aclements commented Jun 1, 2016

quentinmit commented Sep 28, 2016

gopherbot commented Oct 31, 2016

gopherbot commented Feb 15, 2017

rsc commented Feb 16, 2017

rsc commented Feb 16, 2017

runtime: GC should wake up idle Ps #14179

runtime: GC should wake up idle Ps #14179

Comments

aclements commented Feb 1, 2016

rsc commented May 18, 2016

aclements commented May 18, 2016

aclements commented Jun 1, 2016

quentinmit commented Sep 28, 2016

gopherbot commented Oct 31, 2016

gopherbot commented Feb 15, 2017

rsc commented Feb 16, 2017

rsc commented Feb 16, 2017