Skip to content

Commit 0ecc9c7

Browse files
rhyshgopherbot
authored andcommitted
design/68578-mutex-spinbit.md: describe protocol
Add a design doc describing the general approach of the "spinbit" mutex protocol, and the details of the two drafts that implement it. Based on futex, for linux/amd64: https://go.dev/cl/601597 For all GOOS values and four architectures: https://go.dev/cl/620435 For golang/go#68578 Change-Id: Ie9665085c9b8cf1741deeb431acfa12fba550b63 Reviewed-on: https://go-review.googlesource.com/c/proposal/+/617618 Auto-Submit: Rhys Hiltner <[email protected]> Reviewed-by: Rhys Hiltner <[email protected]> Commit-Queue: Rhys Hiltner <[email protected]>
1 parent 3d7db7d commit 0ecc9c7

File tree

3 files changed

+803
-0
lines changed

3 files changed

+803
-0
lines changed

design/68578-mutex-spinbit.md

+258
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
# Proposal: Improve scalability of runtime.lock2
2+
3+
Author(s): Rhys Hiltner
4+
5+
Last updated: 2024-10-16
6+
7+
Discussion at https://go.dev/issue/68578.
8+
9+
## Abstract
10+
11+
Improve multi-core scalability of the runtime's internal mutex implementation
12+
by minimizing wakeups of waiting threads.
13+
14+
Avoiding wakeups of threads that are waiting for the lock allows those threads
15+
to sleep for longer.
16+
That reduces the number of concurrent threads that are attempting to read the
17+
mutex's state word.
18+
Fewer reads of that cache line mean less cache coherency traffic within the
19+
processor when a thread needs to make an update.
20+
Fast updates (to acquire and release the lock) even when many threads need the
21+
lock means better scalability.
22+
23+
This is not an API change, so is not part of the formal proposal process.
24+
25+
## Background
26+
27+
One of the simplest mutex designs is a single bit that is "0" when unlocked or
28+
"1" when locked.
29+
To acquire the lock, a thread attempts to swap in a "1",
30+
looping until the result it gets is "0".
31+
To unlock, the thread swaps in a "0".
32+
33+
The performance of such a spinlock is poor in at least two ways.
34+
First, threads that are trying to acquire an already-held lock waste their own
35+
on-CPU time.
36+
Second, those software threads execute on hardware resources that need a local
37+
copy of the mutex state word in cache.
38+
39+
Having the state word in cache for read access requires it not be writeable by
40+
any other processors.
41+
Writing to that memory location requires the hardware to invalidate all cached
42+
copies of that memory, one in each processor that had loaded it for reading.
43+
The hardware-internal communication necessary to implement those guarantees
44+
has a cost, which appears as a slowdown when writing to that memory location.
45+
46+
Go's current mutex design is several steps more advanced than the simple
47+
spinlock, but under certain conditions its performance can degrade in a similar way.
48+
First, when `runtime.lock2` is unable to immediately obtain the mutex it will
49+
pause for a moment before retrying, primarily using hardware-level delay
50+
instructions (such as `PAUSE` on 386 and amd64).
51+
Then, if it's unable to acquire the mutex after several retries it will ask
52+
the OS to put it to sleep until another thread requests a wakeup.
53+
On Linux, we use the `futex` syscall to sleep directly on the mutex address,
54+
implemented in src/runtime/lock_futex.go.
55+
On many other platforms (including Windows and macOS),the waiting threads
56+
form a LIFO stack with the mutex state word as a pointer to the top of the
57+
stack, implemented in src/runtime/lock_sema.go.
58+
59+
When the `futex` syscall is available,
60+
the OS maintains a list of waiting threads and will choose which it wakes.
61+
Otherwise, the Go runtime maintains that list and names a specific thread
62+
when it asks the OS to do a wakeup.
63+
To avoid a `futex` syscall when there's no contention,
64+
we split the "locked" state into two variants:
65+
1 meaning "locked with no contention" and
66+
2 meaning "locked, and a thread may be asleep".
67+
(With the semaphore-based implementation,
68+
the Go runtime can--and must--know for itself whether a thread is asleep.)
69+
Go's mutex implementation has those three logical states
70+
(unlocked, locked, locked-with-sleepers) on all multi-threaded platforms.
71+
For the purposes of the Go runtime, I'm calling this design "tristate".
72+
73+
After releasing the mutex,
74+
`runtime.unlock2` will wake a thread whenever one is sleeping.
75+
It does not consider whether one of the waiting threads is already awake.
76+
If a waiting thread is already awake, it's not necessary to wake another.
77+
78+
Waking additional threads results in higher concurrent demand for the mutex
79+
state word's cache line.
80+
Every thread that is awake and spinning in a loop to reload the state word
81+
leads to more cache coherency traffic within the processor,
82+
and to slower writes to that cache line.
83+
84+
Consider the case where many threads all need to use the same mutex many times
85+
in a row.
86+
Furthermore, consider that the critical section is short relative to the time
87+
it takes a thread to give up on spinning and go (back) to sleep.
88+
At the end of each critical section, the thread that is releasing the mutex
89+
will see that a waiting thread is asleep, and will wake it.
90+
It takes a relatively long time for a thread to decide to go to sleep,
91+
and there's a relatively short time until the next `runtime.unlock2` call will
92+
wake it.
93+
Many threads will be awake, all reloading the state word in a loop,
94+
all slowing down updates to its value.
95+
96+
Without a limit on the number of threads that can spin on the state word,
97+
higher demand for a mutex value degrades its performance.
98+
99+
See also https://go.dev/issue/68578.
100+
101+
## Proposal
102+
103+
Expand the mutex state word to include a new flag, "spinning".
104+
Use the "spinning" bit to communicate whether one of the waiting threads is
105+
awake and looping while trying to acquire the lock.
106+
Threads mutually exclude each other from the "spinning" state,
107+
but they won't block while attempting to acquire the bit.
108+
Only the thread that owns the "spinning" bit is allowed to reload the state
109+
word in a loop.
110+
It releases the "spinning" bit before going to sleep.
111+
The other waiting threads go directly to sleep.
112+
The thread that unlocks a mutex can avoid waking a thread if it sees that one
113+
is already awake and spinning.
114+
For the purposes of the Go runtime, I'm calling this design "spinbit".
115+
116+
### futex-based option, https://go.dev/cl/601597
117+
118+
I've prepared https://go.dev/cl/601597,
119+
which implements the "spinbit" design for GOOS=linux and GOARCH=amd64.
120+
I've prepared a matching [TLA+ model](./68578/spinbit.tla)
121+
to check for lost wakeups.
122+
(When relying on the `futex` syscall to maintain the list of sleeping Ms,
123+
it's easy to write lost-wakeup bugs.)
124+
125+
It uses an atomic `Xchg8` operation on two different bytes of the mutex state
126+
word.
127+
The low byte records whether the mutex is locked,
128+
and whether one or more waiting Ms may be asleep.
129+
The "spinning" flag is in a separate byte and so can be independently
130+
manipulated with atomic `Xchg8` operations.
131+
The two bytes are within a single uintptr field (`runtime.mutex.key`).
132+
When the spinning M attempts to acquire the lock,
133+
it can do a CAS on the entire state word,
134+
setting the "locked" flag and clearing the "spinning" flag
135+
in a single operation.
136+
137+
### Cross-OS option, https://go.dev/cl/620435
138+
139+
I've also prepared https://go.dev/cl/620435 which unifies the lock_sema.go and
140+
lock_futex.go implementations and so supports all GOOS values for which Go
141+
supports multiple threads.
142+
(It uses `futex` to implement the `runtime.sema{create,sleep,wakeup}`
143+
functions for lock_futex.go platforms.)
144+
Go's development branch now includes `Xchg8` support for
145+
GOARCH=amd64,arm64,ppc64,ppc64le,
146+
and so that CL supports all of those architectures.
147+
148+
The fast path for `runtime.lock2` and `runtime.unlock2` use `Xchg8` operations
149+
to interact with the "locked" flag.
150+
The lowest byte of the state word is dedicated to use with those `Xchg8`
151+
operations.
152+
Most of the upper bytes hold a partial pointer to an M.
153+
(The `runtime.m` datastructure is large enough to allow reconstructing the low
154+
bits from the partial pointer,
155+
with special handling for the non-heap-allocated `runtime.m0` value.)
156+
Beyond the 8 bits needed for use with `Xchg8`,
157+
a few more low bits are available for use as flags.
158+
One of those bits holds the "spinning" flag,
159+
which is manipulated with pointer-length `Load` and `CAS` operations.
160+
161+
When Ms go to sleep they form a LIFO stack linked via `runtime.m.nextwaitm`
162+
pointers, as lock_sema.go does today.
163+
The list of waiting Ms is a multi-producer, single-consumer stack.
164+
Each M can add itself,
165+
but inspecting or removing Ms requires exclusive access.
166+
Today, lock_sema.go's `runtime.unlock2` uses the mutex itself to control that
167+
ownership.
168+
That puts any use of the sleeping M list in the critical path of the mutex.
169+
170+
My proposal uses another bit of the state word as a try-lock to control
171+
inspecting and removing Ms from the list.
172+
This allows additional list-management code without slowing the critical path
173+
of a busy mutex, and use of efficient `Xchg8` operations in the fast paths.
174+
We'll need access to the list in order to attribute contention delay to the
175+
right critical section in the [mutex profile](https://go.dev/issue/66999).
176+
Access to the list will also let us periodically wake an M even when it's not
177+
strictly necessary, to combat tail latency that may be introduced by the
178+
reduction in wakeups.
179+
180+
Here's the full layout of the `runtime.mutex.key` state word:
181+
Bit 0 holds the "locked" flag, the primary purpose of the mutex.
182+
Bit 1 is the "sleeping" flag, and is set when the upper bits point to an M.
183+
Bits 2 through 7 are unused, since they're lost with every `Xchg8` operation.
184+
Bit 8 holds the "spinning" try-lock, allowing the holder to reload the state
185+
word in a loop.
186+
Bit 9 holds the "stack" try-lock, allowing the holder to inspect and remove
187+
sleeping Ms from the LIFO stack.
188+
Bits 10 and higher of the state word hold bits 10 and higher of a pointer to
189+
the M at the top of the LIFO stack of sleeping waiters.
190+
191+
## Rationale
192+
193+
The status quo is a `runtime.lock2` implementation that experiences congestion
194+
collapse under high contention on machines with many hardware threads.
195+
Addressing that will require fewer threads loading the same cache line in a
196+
loop.
197+
198+
The literature presents several options for scalable / non-collapsing mutexes.
199+
Some require an additional memory footprint for each mutex in proportion to
200+
the number of threads that may seek to acquire the lock.
201+
Some require threads to store a reference to a value that they will use to
202+
release each lock they hold.
203+
Go includes a `runtime.mutex` as part of every `chan`, and in some
204+
applications those values are the ones with the most contention.
205+
Coupled with `select`, there's no limit to the number of mutexes that an M can
206+
hold.
207+
That means neither of those forms of increased memory footprint is acceptable.
208+
209+
The performance of fully uncontended `runtime.lock2`/`runtime.unlock2` pairs
210+
is also important to the project.
211+
That limits the use of many of the literature's proposed locking algorithms,
212+
if they include FIFO queue handoff behavior.
213+
On my test hardware
214+
(a linux/amd64 machine with i7-13700H, and a darwin/arm64 M1),
215+
a `runtime.mutex` value with zero or moderate contention can support
216+
50,000,000 uses per second on any threads,
217+
or can move between active threads 10,000,000 times per second,
218+
or can move between inactive threads (with sleep mediated by the OS)
219+
about 40,000 to 500,000 times per second (depending on the OS).
220+
Some amount of capture or barging, rather than queueing, is required to
221+
maintain the level of throughput that Go users have come to expect.
222+
223+
Keeping the size of `runtime.mutex` values as they are today but allowing
224+
threads to sleep with fewer interruptions seems like fulfilling the goal of
225+
the original design.
226+
The main disadvantage I know of is the risk of increased tail latency:
227+
A small set of threads may be able to capture a contended mutex,
228+
passing it back and forth among themselves while the other threads sleep
229+
indefinitely.
230+
That's already a risk of the current lock_sema.go implementation,
231+
but the high volume of wakeups means threads are unlikely to sleep for long,
232+
and the list of sleeping threads may regularly dip to zero.
233+
234+
The "cross-OS" option has an edge here:
235+
with it, the Go runtime maintains an explicit list of sleeping Ms and so can do
236+
targeted wakes or even direct handoffs to reduce starvation.
237+
238+
## Compatibility
239+
240+
There is no change in exported APIs.
241+
242+
## Implementation
243+
244+
I've prepared two options for the Go 1.24 release cycle.
245+
One relies on the `futex` syscall and the `Xchg8` operation, and so initially
246+
supports GOOS=linux and GOARCH=amd64: https://go.dev/cl/601597.
247+
The other relies on only the `Xchg8` operation and works with any GOOS value
248+
that supports threads: https://go.dev/cl/620435.
249+
Both are controlled by `GOEXPERIMENT=spinbitmutex`,
250+
enabled by default on supported platforms.
251+
252+
## Open issues (if applicable)
253+
254+
I appreciate feedback on the balance between simplicity,
255+
performance at zero or low contention,
256+
performance under extreme contention,
257+
both the performance and maintenance burden for non-first-class ports,
258+
and the accuracy of contention profiles.

design/68578/spinbit.cfg

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
SPECIFICATION Spec
2+
3+
INVARIANT TypeInvariant
4+
PROPERTY NoLostWakeups
5+
PROPERTY HaveAcquisitions
6+
7+
CONSTANT
8+
NULL = NULL
9+
NumThreads = 3
10+
NumAcquires = 1
11+
NumSpins = 0
12+
WakeAny = TRUE

0 commit comments

Comments
 (0)