-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: lock cycle between malloc and execution tracer #53979
Comments
Change https://go.dev/cl/418716 mentions this issue: |
Change https://go.dev/cl/418720 mentions this issue: |
Change https://go.dev/cl/418957 mentions this issue: |
Change https://go.dev/cl/418955 mentions this issue: |
Change https://go.dev/cl/418956 mentions this issue: |
We're missing lock edges to trace.lock that happen only rarely. Any trace event can potentially fill up a trace buffer and acquire trace.lock in order to flush the buffer, but this happens relatively rarely, so we simply haven't seen some of these lock edges that could happen. With this change, we promote "fin, notifyList < traceStackTab" to "fin, notifyList < trace" and now everything that emits trace events with a P enters the tracer lock ranks via "trace", rather than some things entering at "trace" and others at "traceStackTab". This was found by inspecting the rank graph for things that didn't make sense. Ideally we would add a mayAcquire annotation that any trace event can potentially acquire trace.lock, but there are actually cases that violate this ranking right now. This is #53979. The chance of a lock cycle is extremely low given the number of conditions that have to happen simultaneously. For #53789. Change-Id: Ic65947d27dee88d2daf639b21b2c9d37552f0ac0 Reviewed-on: https://go-review.googlesource.com/c/go/+/418716 Reviewed-by: Michael Pratt <[email protected]> Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
We're missing lock edges to trace.lock that happen only rarely. Any trace event can potentially fill up a trace buffer and acquire trace.lock in order to flush the buffer, but this happens relatively rarely, so we simply haven't seen some of these lock edges that could happen. With this change, we promote "fin, notifyList < traceStackTab" to "fin, notifyList < trace" and now everything that emits trace events with a P enters the tracer lock ranks via "trace", rather than some things entering at "trace" and others at "traceStackTab". This was found by inspecting the rank graph for things that didn't make sense. Ideally we would add a mayAcquire annotation that any trace event can potentially acquire trace.lock, but there are actually cases that violate this ranking right now. This is golang#53979. The chance of a lock cycle is extremely low given the number of conditions that have to happen simultaneously. For golang#53789. Change-Id: Ic65947d27dee88d2daf639b21b2c9d37552f0ac0 Reviewed-on: https://go-review.googlesource.com/c/go/+/418716 Reviewed-by: Michael Pratt <[email protected]> Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
Change https://go.dev/cl/422955 mentions this issue: |
Change https://go.dev/cl/422954 mentions this issue: |
Currently, the stack frame of (*traceStackTable).dump is 68KiB. We're about to move (*traceStackTable).dump to the system stack, where we often don't have this much room. 5140 bytes of this is an on-stack temporary buffer for constructing potentially large trace events before copying these out to the actual trace buffer. Reduce the stack frame size by writing these events directly to the trace buffer rather than temporary space. This introduces a couple complications: - The trace event starts with a varint encoding the event payload's length in bytes. These events are large and somewhat complicated, so it's hard to know the size ahead of time. That's not a problem with the temporary buffer because we can just construct the event and see how long it is. In order to support writing directly to the trace buffer, we reserve enough bytes for a maximum size varint and add support for populating a reserved space after the fact. - Emitting a stack event calls traceFrameForPC, which can itself emit string events. If these were emitted in the middle of the stack event, it would corrupt the stream. We already allocate a []Frame to convert the PC slice to frames, and then convert each Frame into a traceFrame with trace string IDs, so we address this by combining these two steps into one so that all trace string events are emitted before we start constructing the stack event. For #53979. Change-Id: Ie60704be95199559c426b551f8e119b14e06ddac Reviewed-on: https://go-review.googlesource.com/c/go/+/422954 Run-TryBot: Austin Clements <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
Following up on the previous CL, this CL removes a unnecessary stack copy of a large object in a range loop. This drops another 64 KiB from (*traceStackTable).dump's stack frame so it is now roughly 80 bytes depending on architecture, which will easily fit on the system stack. For #53979. Change-Id: I16f642f6f1982d0ed0a62371bf2e19379e5870eb Reviewed-on: https://go-review.googlesource.com/c/go/+/422955 Reviewed-by: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Austin Clements <[email protected]>
We're about to require that all uses of trace.lock be on the system stack. That's mostly easy, except that it's involving parking the trace reader. Fix this by changing that parking protocol so it instead synchronizes through an atomic. For #53979. Change-Id: Icd6db8678dd01094029d7ad1c612029f571b4cbb Reviewed-on: https://go-review.googlesource.com/c/go/+/418955 Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Michael Pratt <[email protected]> Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
Currently, trace.lock can be acquired while on a user G and stack splits can happen while holding trace.lock. That means every lock used by the stack allocator must be okay to acquire while holding trace.lock, including various locks related to span allocation. In turn, we cannot safely emit trace events while holding any allocation-related locks because this would cause a cycle in the lock rank graph. To fix this, require that trace.lock only be acquired on the system stack, like mheap.lock. This pushes it into the "bottom half" and eliminates the lock rank relationship between tracing and stack allocation, making it safe to emit trace events in many more places. One subtlety is that the trace code has race annotations and uses maps, which have race annotations. By default, we can't have race annotations on the system stack, so we borrow the user race context for these situations. We'll update the lock graph itself in the next CL. For #53979. This CL technically fixes the problem, but the lock rank checker doesn't know that yet. Change-Id: I9f5187a9c52a67bee4f7064db124b1ad53e5178f Reviewed-on: https://go-review.googlesource.com/c/go/+/418956 Reviewed-by: Michael Knyszek <[email protected]> Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
Now that we've moved the trace locks to the leaf of the lock graph, we can safely annotate that any trace event may acquire trace.lock even if dynamically it turns out a particular event doesn't need to flush and acquire this lock. This reveals a new edge where we can trace while holding the mheap lock, so we add this to the lock graph. For #53789. Updates #53979. Change-Id: I13e2f6cd1b621cca4bed0cc13ef12e64d05c89a7 Reviewed-on: https://go-review.googlesource.com/c/go/+/418720 Reviewed-by: Michael Knyszek <[email protected]> Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
What version of Go are you using (
go version
)?Current HEAD (2aa473c)
Does this issue reproduce with the latest release?
Yes, though via a different code path.
What did you do?
If the execution tracer is enabled, there's a potential though rare deadlock via a rank cycle on
mheap_.lock
andtrace.lock
:A:
setGCPercent
orsetMemoryLimit
acquiremheap_.lock
->gcControllerCommit
->traceHeapGoal
->traceEvent
->traceEventLocked
->traceFlush
-> acquirestrace.lock
B:
traceFlush
acquirestrace.lock
-> triggers stack growth -> stack allocator callsmheap.allocManual
->mheap.allocSpan
-> acquiresmheap_.lock
Path "A" violates the current lock ranking. I discovered this when I added a "may acquire" annotation on
traceEvent
. But I think path "B" may be the real problem. Because stack growth can happen while holdingtrace.lock
, it's pretty high in the ranking (has a low rank value). But this means that anything that holds any locks further down in the ranking, like the memory allocator, can't safely create trace events.I wonder if, like
mheap.lock
, we should say thattrace.lock
can only be acquired on the system stack so stack growth can never happen. I think that would push tracing down to the leaves of the rank graph, rather than it being smack in the middle./cc @mknyszek @golang/runtime
The text was updated successfully, but these errors were encountered: