|
| 1 | +# Proposal: Improve scalability of runtime.lock2 |
| 2 | + |
| 3 | +Author(s): Rhys Hiltner |
| 4 | + |
| 5 | +Last updated: 2024-10-16 |
| 6 | + |
| 7 | +Discussion at https://go.dev/issue/68578. |
| 8 | + |
| 9 | +## Abstract |
| 10 | + |
| 11 | +Improve multi-core scalability of the runtime's internal mutex implementation |
| 12 | +by minimizing wakeups of waiting threads. |
| 13 | + |
| 14 | +Avoiding wakeups of threads that are waiting for the lock allows those threads |
| 15 | +to sleep for longer. |
| 16 | +That reduces the number of concurrent threads that are attempting to read the |
| 17 | +mutex's state word. |
| 18 | +Fewer reads of that cache line mean less cache coherency traffic within the |
| 19 | +processor when a thread needs to make an update. |
| 20 | +Fast updates (to acquire and release the lock) even when many threads need the |
| 21 | +lock means better scalability. |
| 22 | + |
| 23 | +This is not an API change, so is not part of the formal proposal process. |
| 24 | + |
| 25 | +## Background |
| 26 | + |
| 27 | +One of the simplest mutex designs is a single bit that is "0" when unlocked or |
| 28 | +"1" when locked. |
| 29 | +To acquire the lock, a thread attempts to swap in a "1", |
| 30 | +looping until the result it gets is "0". |
| 31 | +To unlock, the thread swaps in a "0". |
| 32 | + |
| 33 | +The performance of such a spinlock is poor in at least two ways. |
| 34 | +First, threads that are trying to acquire an already-held lock waste their own |
| 35 | +on-CPU time. |
| 36 | +Second, those software threads execute on hardware resources that need a local |
| 37 | +copy of the mutex state word in cache. |
| 38 | + |
| 39 | +Having the state word in cache for read access requires it not be writeable by |
| 40 | +any other processors. |
| 41 | +Writing to that memory location requires the hardware to invalidate all cached |
| 42 | +copies of that memory, one in each processor that had loaded it for reading. |
| 43 | +The hardware-internal communication necessary to implement those guarantees |
| 44 | +has a cost, which appears as a slowdown when writing to that memory location. |
| 45 | + |
| 46 | +Go's current mutex design is several steps more advanced than the simple |
| 47 | +spinlock, but under certain conditions its performance can degrade in a similar way. |
| 48 | +First, when `runtime.lock2` is unable to immediately obtain the mutex it will |
| 49 | +pause for a moment before retrying, primarily using hardware-level delay |
| 50 | +instructions (such as `PAUSE` on 386 and amd64). |
| 51 | +Then, if it's unable to acquire the mutex after several retries it will ask |
| 52 | +the OS to put it to sleep until another thread requests a wakeup. |
| 53 | +On Linux, we use the `futex` syscall to sleep directly on the mutex address, |
| 54 | +implemented in src/runtime/lock_futex.go. |
| 55 | +On many other platforms (including Windows and macOS),the waiting threads |
| 56 | +form a LIFO stack with the mutex state word as a pointer to the top of the |
| 57 | +stack, implemented in src/runtime/lock_sema.go. |
| 58 | + |
| 59 | +When the `futex` syscall is available, |
| 60 | +the OS maintains a list of waiting threads and will choose which it wakes. |
| 61 | +Otherwise, the Go runtime maintains that list and names a specific thread |
| 62 | +when it asks the OS to do a wakeup. |
| 63 | +To avoid a `futex` syscall when there's no contention, |
| 64 | +we split the "locked" state into two variants: |
| 65 | +1 meaning "locked with no contention" and |
| 66 | +2 meaning "locked, and a thread may be asleep". |
| 67 | +(With the semaphore-based implementation, |
| 68 | +the Go runtime can--and must--know for itself whether a thread is asleep.) |
| 69 | +Go's mutex implementation has those three logical states |
| 70 | +(unlocked, locked, locked-with-sleepers) on all multi-threaded platforms. |
| 71 | +For the purposes of the Go runtime, I'm calling this design "tristate". |
| 72 | + |
| 73 | +After releasing the mutex, |
| 74 | +`runtime.unlock2` will wake a thread whenever one is sleeping. |
| 75 | +It does not consider whether one of the waiting threads is already awake. |
| 76 | +If a waiting thread is already awake, it's not necessary to wake another. |
| 77 | + |
| 78 | +Waking additional threads results in higher concurrent demand for the mutex |
| 79 | +state word's cache line. |
| 80 | +Every thread that is awake and spinning in a loop to reload the state word |
| 81 | +leads to more cache coherency traffic within the processor, |
| 82 | +and to slower writes to that cache line. |
| 83 | + |
| 84 | +Consider the case where many threads all need to use the same mutex many times |
| 85 | +in a row. |
| 86 | +Furthermore, consider that the critical section is short relative to the time |
| 87 | +it takes a thread to give up on spinning and go (back) to sleep. |
| 88 | +At the end of each critical section, the thread that is releasing the mutex |
| 89 | +will see that a waiting thread is asleep, and will wake it. |
| 90 | +It takes a relatively long time for a thread to decide to go to sleep, |
| 91 | +and there's a relatively short time until the next `runtime.unlock2` call will |
| 92 | +wake it. |
| 93 | +Many threads will be awake, all reloading the state word in a loop, |
| 94 | +all slowing down updates to its value. |
| 95 | + |
| 96 | +Without a limit on the number of threads that can spin on the state word, |
| 97 | +higher demand for a mutex value degrades its performance. |
| 98 | + |
| 99 | +See also https://go.dev/issue/68578. |
| 100 | + |
| 101 | +## Proposal |
| 102 | + |
| 103 | +Expand the mutex state word to include a new flag, "spinning". |
| 104 | +Use the "spinning" bit to communicate whether one of the waiting threads is |
| 105 | +awake and looping while trying to acquire the lock. |
| 106 | +Threads mutually exclude each other from the "spinning" state, |
| 107 | +but they won't block while attempting to acquire the bit. |
| 108 | +Only the thread that owns the "spinning" bit is allowed to reload the state |
| 109 | +word in a loop. |
| 110 | +It releases the "spinning" bit before going to sleep. |
| 111 | +The other waiting threads go directly to sleep. |
| 112 | +The thread that unlocks a mutex can avoid waking a thread if it sees that one |
| 113 | +is already awake and spinning. |
| 114 | +For the purposes of the Go runtime, I'm calling this design "spinbit". |
| 115 | + |
| 116 | +### futex-based option, https://go.dev/cl/601597 |
| 117 | + |
| 118 | +I've prepared https://go.dev/cl/601597, |
| 119 | +which implements the "spinbit" design for GOOS=linux and GOARCH=amd64. |
| 120 | +I've prepared a matching [TLA+ model](./68578/spinbit.tla) |
| 121 | +to check for lost wakeups. |
| 122 | +(When relying on the `futex` syscall to maintain the list of sleeping Ms, |
| 123 | +it's easy to write lost-wakeup bugs.) |
| 124 | + |
| 125 | +It uses an atomic `Xchg8` operation on two different bytes of the mutex state |
| 126 | +word. |
| 127 | +The low byte records whether the mutex is locked, |
| 128 | +and whether one or more waiting Ms may be asleep. |
| 129 | +The "spinning" flag is in a separate byte and so can be independently |
| 130 | +manipulated with atomic `Xchg8` operations. |
| 131 | +The two bytes are within a single uintptr field (`runtime.mutex.key`). |
| 132 | +When the spinning M attempts to acquire the lock, |
| 133 | +it can do a CAS on the entire state word, |
| 134 | +setting the "locked" flag and clearing the "spinning" flag |
| 135 | +in a single operation. |
| 136 | + |
| 137 | +### Cross-OS option, https://go.dev/cl/620435 |
| 138 | + |
| 139 | +I've also prepared https://go.dev/cl/620435 which unifies the lock_sema.go and |
| 140 | +lock_futex.go implementations and so supports all GOOS values for which Go |
| 141 | +supports multiple threads. |
| 142 | +(It uses `futex` to implement the `runtime.sema{create,sleep,wakeup}` |
| 143 | +functions for lock_futex.go platforms.) |
| 144 | +Go's development branch now includes `Xchg8` support for |
| 145 | +GOARCH=amd64,arm64,ppc64,ppc64le, |
| 146 | +and so that CL supports all of those architectures. |
| 147 | + |
| 148 | +The fast path for `runtime.lock2` and `runtime.unlock2` use `Xchg8` operations |
| 149 | +to interact with the "locked" flag. |
| 150 | +The lowest byte of the state word is dedicated to use with those `Xchg8` |
| 151 | +operations. |
| 152 | +Most of the upper bytes hold a partial pointer to an M. |
| 153 | +(The `runtime.m` datastructure is large enough to allow reconstructing the low |
| 154 | +bits from the partial pointer, |
| 155 | +with special handling for the non-heap-allocated `runtime.m0` value.) |
| 156 | +Beyond the 8 bits needed for use with `Xchg8`, |
| 157 | +a few more low bits are available for use as flags. |
| 158 | +One of those bits holds the "spinning" flag, |
| 159 | +which is manipulated with pointer-length `Load` and `CAS` operations. |
| 160 | + |
| 161 | +When Ms go to sleep they form a LIFO stack linked via `runtime.m.nextwaitm` |
| 162 | +pointers, as lock_sema.go does today. |
| 163 | +The list of waiting Ms is a multi-producer, single-consumer stack. |
| 164 | +Each M can add itself, |
| 165 | +but inspecting or removing Ms requires exclusive access. |
| 166 | +Today, lock_sema.go's `runtime.unlock2` uses the mutex itself to control that |
| 167 | +ownership. |
| 168 | +That puts any use of the sleeping M list in the critical path of the mutex. |
| 169 | + |
| 170 | +My proposal uses another bit of the state word as a try-lock to control |
| 171 | +inspecting and removing Ms from the list. |
| 172 | +This allows additional list-management code without slowing the critical path |
| 173 | +of a busy mutex, and use of efficient `Xchg8` operations in the fast paths. |
| 174 | +We'll need access to the list in order to attribute contention delay to the |
| 175 | +right critical section in the [mutex profile](https://go.dev/issue/66999). |
| 176 | +Access to the list will also let us periodically wake an M even when it's not |
| 177 | +strictly necessary, to combat tail latency that may be introduced by the |
| 178 | +reduction in wakeups. |
| 179 | + |
| 180 | +Here's the full layout of the `runtime.mutex.key` state word: |
| 181 | +Bit 0 holds the "locked" flag, the primary purpose of the mutex. |
| 182 | +Bit 1 is the "sleeping" flag, and is set when the upper bits point to an M. |
| 183 | +Bits 2 through 7 are unused, since they're lost with every `Xchg8` operation. |
| 184 | +Bit 8 holds the "spinning" try-lock, allowing the holder to reload the state |
| 185 | +word in a loop. |
| 186 | +Bit 9 holds the "stack" try-lock, allowing the holder to inspect and remove |
| 187 | +sleeping Ms from the LIFO stack. |
| 188 | +Bits 10 and higher of the state word hold bits 10 and higher of a pointer to |
| 189 | +the M at the top of the LIFO stack of sleeping waiters. |
| 190 | + |
| 191 | +## Rationale |
| 192 | + |
| 193 | +The status quo is a `runtime.lock2` implementation that experiences congestion |
| 194 | +collapse under high contention on machines with many hardware threads. |
| 195 | +Addressing that will require fewer threads loading the same cache line in a |
| 196 | +loop. |
| 197 | + |
| 198 | +The literature presents several options for scalable / non-collapsing mutexes. |
| 199 | +Some require an additional memory footprint for each mutex in proportion to |
| 200 | +the number of threads that may seek to acquire the lock. |
| 201 | +Some require threads to store a reference to a value that they will use to |
| 202 | +release each lock they hold. |
| 203 | +Go includes a `runtime.mutex` as part of every `chan`, and in some |
| 204 | +applications those values are the ones with the most contention. |
| 205 | +Coupled with `select`, there's no limit to the number of mutexes that an M can |
| 206 | +hold. |
| 207 | +That means neither of those forms of increased memory footprint is acceptable. |
| 208 | + |
| 209 | +The performance of fully uncontended `runtime.lock2`/`runtime.unlock2` pairs |
| 210 | +is also important to the project. |
| 211 | +That limits the use of many of the literature's proposed locking algorithms, |
| 212 | +if they include FIFO queue handoff behavior. |
| 213 | +On my test hardware |
| 214 | +(a linux/amd64 machine with i7-13700H, and a darwin/arm64 M1), |
| 215 | +a `runtime.mutex` value with zero or moderate contention can support |
| 216 | +50,000,000 uses per second on any threads, |
| 217 | +or can move between active threads 10,000,000 times per second, |
| 218 | +or can move between inactive threads (with sleep mediated by the OS) |
| 219 | +about 40,000 to 500,000 times per second (depending on the OS). |
| 220 | +Some amount of capture or barging, rather than queueing, is required to |
| 221 | +maintain the level of throughput that Go users have come to expect. |
| 222 | + |
| 223 | +Keeping the size of `runtime.mutex` values as they are today but allowing |
| 224 | +threads to sleep with fewer interruptions seems like fulfilling the goal of |
| 225 | +the original design. |
| 226 | +The main disadvantage I know of is the risk of increased tail latency: |
| 227 | +A small set of threads may be able to capture a contended mutex, |
| 228 | +passing it back and forth among themselves while the other threads sleep |
| 229 | +indefinitely. |
| 230 | +That's already a risk of the current lock_sema.go implementation, |
| 231 | +but the high volume of wakeups means threads are unlikely to sleep for long, |
| 232 | +and the list of sleeping threads may regularly dip to zero. |
| 233 | + |
| 234 | +The "cross-OS" option has an edge here: |
| 235 | +with it, the Go runtime maintains an explicit list of sleeping Ms and so can do |
| 236 | +targeted wakes or even direct handoffs to reduce starvation. |
| 237 | + |
| 238 | +## Compatibility |
| 239 | + |
| 240 | +There is no change in exported APIs. |
| 241 | + |
| 242 | +## Implementation |
| 243 | + |
| 244 | +I've prepared two options for the Go 1.24 release cycle. |
| 245 | +One relies on the `futex` syscall and the `Xchg8` operation, and so initially |
| 246 | +supports GOOS=linux and GOARCH=amd64: https://go.dev/cl/601597. |
| 247 | +The other relies on only the `Xchg8` operation and works with any GOOS value |
| 248 | +that supports threads: https://go.dev/cl/620435. |
| 249 | +Both are controlled by `GOEXPERIMENT=spinbitmutex`, |
| 250 | +enabled by default on supported platforms. |
| 251 | + |
| 252 | +## Open issues (if applicable) |
| 253 | + |
| 254 | +I appreciate feedback on the balance between simplicity, |
| 255 | +performance at zero or low contention, |
| 256 | +performance under extreme contention, |
| 257 | +both the performance and maintenance burden for non-first-class ports, |
| 258 | +and the accuracy of contention profiles. |
0 commit comments