Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: improve scaling of lock2 #68578

Closed
rhysh opened this issue Jul 24, 2024 · 22 comments
Closed

runtime: improve scaling of lock2 #68578

rhysh opened this issue Jul 24, 2024 · 22 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Jul 24, 2024

Part of my experience with runtime.mutex values is that process-wide demand for a single mutex can slow the program to a crawl. I don't think that alone will come as a surprise, though the magnitude of slowdown is more than I'd expect. The main surprise is that programs can have a hard time recovering from that performance cliff once they've fallen off.

I've described a case of "larger magnitude than I expected" as #57070. I've described (perhaps indirectly) "performance cliff that defies self-recovery" as #61426, #61427, and #61428. The runtime's ChanContended benchmark shows clear degradation with increasing GOMAXPROCS, which I suspect is the same problem.


Here's what I think is happening:

For some users of runtime.mutex values, the critical section is very fast. When exiting the critical section, unlock2 checks to see if there's another thread that has gone to sleep while waiting for the mutex. If unlock2 finds one, it wakes a single lock2 sleeper. While waking the other thread, unlock2 removes that sleeper's note within the mutex. That looks different in lock_sema.go and lock_futex.go, either popping a single entry off of the linked list or a full state reset to 0.

Now, the previous user of the mutex (recent caller of unlock2) is awake, and the lock2 caller that it woke is awake. If the previous user needs the same mutex again, it enters lock2 and we now have two non-sleeping threads attempting to acquire the mutex. Whichever one wins, it will soon call unlock2. If the caller of unlock2 finds that another thread is sleeping in lock2 waiting for this mutex, it will wake that thread.

Given enough time, a caller of lock2 will give up on the procyield/osyield section and go to sleep. But, how long does it take a thread to give up? And how long does it take a thread to complete another pass through the critical section, call unlock2, and wake an additional thread?

It looks like the time a thread in lock2 takes to go to sleep is over 100 times longer than the runtime's fastest critical sections. Preliminary measurements on an M1 MacBook Air with darwin/arm64 show ~6000 ns of spinning between sleeps and ~14 ns for an uncontended channel operation. That means, I think, that nearly all of the threads that need a runtime.mutex for a quick critical section will be awake the whole time. Being awake means repeatedly loading (about 5 times per wake cycle) and attempting CAS on the mutex value.

As more threads demand access to the mutex's cache line, updating it becomes more expensive. Acquiring the mutex involves updating that cache line. Going to sleep and waking sleepers also require updates to that cache line.


BenchmarkChanContended does 100 sends and 100 receives on a buffered channel with no channel-based waiting. That's 200 lock2/unlock2 pairs per "op". Each additional thread allowed by GOMAXPROCS reduces the whole-process throughput (with minor exceptions near the limit of physical cores, of around 1%).

Below is the behavior I see on an Intel i7-13700H (linux/amd64, 6 P cores with 2 threads each plus 8 E cores). When the process is allowed to use 4 threads, the whole-process throughput is 1/2 of what it was with 1 thread. With 8 threads, throughput is halved again. With 12, it's halved yet again. At GOMAXPROCS=20 the 200 channel operations take an average of 44 µs, so on average there's an unlock2 call every 220 ns, each with a fresh opportunity to wake a sleeping thread.

$ go version
go version go1.23rc2 linux/amd64
$ go test runtime -test.run='^$' -test.bench=ChanContended -test.cpu="$(seq 1 20 | tr '\n' ',')" -test.count=10
goos: linux
goarch: amd64
pkg: runtime
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
                 │      -      │
                 │   sec/op    │
ChanContended      3.167µ ± 1%
ChanContended-2    4.723µ ± 6%
ChanContended-3    5.563µ ± 3%
ChanContended-4    6.527µ ± 2%
ChanContended-5    7.848µ ± 4%
ChanContended-6    8.683µ ± 1%
ChanContended-7    11.19µ ± 0%
ChanContended-8    13.96µ ± 1%
ChanContended-9    17.15µ ± 5%
ChanContended-10   20.17µ ± 3%
ChanContended-11   24.14µ ± 0%
ChanContended-12   29.16µ ± 1%
ChanContended-13   32.72µ ± 4%
ChanContended-14   36.23µ ± 0%
ChanContended-15   36.10µ ± 1%
ChanContended-16   37.73µ ± 5%
ChanContended-17   37.59µ ± 4%
ChanContended-18   41.37µ ± 6%
ChanContended-19   42.87µ ± 1%
ChanContended-20   44.36µ ± 2%
geomean            17.43µ

And here's the behavior I see on an M1 MacBook Air (darwin/arm64, 4 P cores plus 4 E cores). With 5 threads, throughput is less than half of what the program can do with one thread.

$ go version
go version go1.23rc2 darwin/arm64
$ go test runtime -test.run='^$' -test.bench=ChanContended -test.cpu="$(seq 1 8 | tr '\n' ',')" -test.count=10
goos: darwin
goarch: arm64
pkg: runtime
cpu: Apple M1
                │      -      │
                │   sec/op    │
ChanContended     2.639µ ± 1%
ChanContended-2   3.303µ ± 6%
ChanContended-3   4.548µ ± 1%
ChanContended-4   5.041µ ± 0%
ChanContended-5   5.788µ ± 1%
ChanContended-6   6.171µ ± 0%
ChanContended-7   6.470µ ± 0%
ChanContended-8   6.405µ ± 0%
geomean           4.829µ

Another perspective is to consider how much on-CPU time the process has. The data below show that during 1.78 seconds of wall-clock time, the process's 20 threads were on-CPU for a total of 27.74 seconds within lock2 calls. Those threads were not sleeping!

$ go test runtime -test.run='^$' -test.bench=ChanContended -test.cpu=20 -test.count=1 -test.cpuprofile=/tmp/p
goos: linux
goarch: amd64
pkg: runtime
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
BenchmarkChanContended-20    	   26667	     44404 ns/op
PASS
ok  	runtime	1.785s

$ go tool pprof -peek runtime.lock2 /tmp/p
File: runtime.test
Type: cpu
Time: Jul 24, 2024 at 8:45pm (UTC)
Duration: 1.78s, Total samples = 31.32s (1759.32%)
Showing nodes accounting for 31.32s, 100% of 31.32s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            27.74s   100% |   runtime.lockWithRank
     4.57s 14.59% 14.59%     27.74s 88.57%                | runtime.lock2
                                            19.50s 70.30% |   runtime.procyield
                                             2.74s  9.88% |   runtime.futexsleep
                                             0.84s  3.03% |   runtime.osyield
                                             0.07s  0.25% |   runtime.(*lockTimer).begin
                                             0.02s 0.072% |   runtime.(*lockTimer).end
----------------------------------------------------------+-------------

In short, we have lock2 implementations that appear on the page to let threads sleep but which in practice make them all spin, and spinning threads at least correlate with (and I think also cause) slower lock handoffs.

I think we can do better. I have some ideas on how, but at this point I'd love feedback on whether others agree with my analysis and on the state of the art here today.

CC @golang/runtime

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 24, 2024
@mknyszek
Copy link
Contributor

Nice analysis! It looks sound to me. In theory there's a latency benefit to spinning, but if you have tons of threads spinning then that can create contention that's much worse than if a few of the threads just went to sleep.

I think it's been a looong time since anyone had a crack at the runtime.mutex code. I'd love to hear your ideas for improving the lock path.

I also wonder if there's a sweet spot on spinning, wherein for very short critical sections it's beneficial to have maybe one or two threads spinning (so some small limit), and the unlock path just wakes another spinning thread, expecting that one of the current spinners will win. It certainly wouldn't be a fair lock, but maybe it would be better?

@dmitshur dmitshur added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jul 25, 2024
@dmitshur dmitshur added this to the Backlog milestone Jul 25, 2024
@rhysh
Copy link
Contributor Author

rhysh commented Jul 25, 2024

Thanks for taking a look, @mknyszek .

Yes, I think that limiting the number of spinning threads to one or two is part of the answer. If we can afford the bookkeeping (and we may need to!), I think the unlock path should be able to detect the presence of a spinning thread and should only wake a sleeping waiter if it knows that no threads are spinning.

An analogy on my mind for how to solve this is a baseball lineup: Most of the threads are "in the dugout", sleeping. One thread is "at bat", maybe doing many lock/unlock pairs in a row. And one thread is "on deck", spinning and ready to jump in when the "at bat" thread releases the mutex for a bit too long to recapture. When a thread calls unlock (it must have been "at bat"), it checks to see if there's a thread "on deck". If so, no further action is needed. If not, it does a wakeup. A new thread that becomes interested in the mutex could start off by spinning (as they do today), and then sleep after a few rounds of disappointment (also as they do today, but with "sleep" now being the improved "in the dugout" version).

I don't know if that's a solution, or if it just pushes the problem back one level. But maybe that's close enough to be a solution in practice? Maybe the "dugout" is an MCS or CLH queue. (MCS seems good, except for requiring a spin during unlock to wait for the second half of a new sleeper's two-part registration?) And maybe threads make a point of going to the end of the queue if they haven't done that for any of their last N acquisitions, in the name of fairness.

I'm working from The Art of Multiprocessor Programming. I've got the "revised first edition", which appears to be a 2012 update to a 2008 original. Are MCS and CLH still the best available algorithms for workloads like runtime.mutex has? @aclements , if you know (and you often do)? Or @chabbimilind , I see you've published in this area since 2012. Thanks!

About fairness: In lock_sema.go (GOOS=darwin, windows, and several others), new waiters push themselves onto a stack so the thread at the bottom of the stack will stay there until all of the other threads have had their turn. So maybe unfairness isn't a problem? Except that when the critical section is short, threads wake up very often and so the stack probably gets cleared out on a regular basis.

A thread being able to capture the mutex for short periods of time (rather than always queueing behind other waiting threads) looks important for performance; we wouldn't be able to average ~15 ns per channel operation if the channel's mutex needs to be handed to a different thread each time. But I suspect we also don't want the full amount of unfairness we'd get if the stack in lock_sema.go worked as the code appears to intend.

@chabbimilind
Copy link

The basic MCS lock does not allow you to exist from the wait queue. You may be looking for one of these:
https://www.mcs.anl.gov/~aamer/papers/ppopp17_hmcst.pdf (one level variant)
https://www.cs.rochester.edu/u/scott/papers/2002_PODC_nb_timeout.pdf
https://www.cs.rochester.edu/u/scott/papers/2001_PPoPP_Timeout.pdf

I noticed something in the current https://github.com/golang/go/blob/master/src/runtime/lock_futex.go code in lock2: why is the l.key == mutex_unlocked condition check in a loop in the code below? If the objective is to make it behave as a test-and-test-and-set, I would expect the inner loop to be an if condition, not a loop. May be there a s reason why it is the way it is but I could not understand easily.

		for i := 0; i < spin; i++ {
			for l.key == mutex_unlocked { // loop?
				if atomic.Cas(key32(&l.key), mutex_unlocked, wait) {
					timer.end()
					return
				}
			}
			procyield(active_spin_cnt)
		}

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/601396 mentions this issue: runtime: measure speed of procyield and osyield

@rhysh
Copy link
Contributor Author

rhysh commented Jul 26, 2024

Thanks for the references! It'll take me a little while to digest those. I don't think that lock2 needs to allow timeouts.

It looks like the more advanced mutex implementations also require more than one word of memory (though we can usually borrow some from the M while it's in a lock2 call). There's a mutex in each chan value, so I don't think we can balloon those too much.

Sometimes the limiting factor in an app is a channel's mutex (#57070), but often it's an explicit process-wide singleton like sched.lock or mheap_.lock. I wonder if, for those few special cases, we should consider using an especially scalable / collapse-resistant mutex implementation with a higher memory footprint. Maybe once we have a new single-word lock2 implementation and have used it for a few releases. :)

why is the l.key == mutex_unlocked condition check in a loop in the code below?

It's been like that since at least the 1.0 release: https://github.com/golang/go/blob/go1/src/pkg/runtime/lock_futex.c#L66

It's also unusual that the access of l.key there isn't an explicit atomic load. I think that's due to amd64 being the main target for production workloads at the time. (See also, the netpoller problems from a few years ago due to the same habits.) Another thing I plan to fix / clean up (or to better understand and then document).

@rhysh
Copy link
Contributor Author

rhysh commented Jul 26, 2024

I was curious about the relative speed of procyield and osyield, and the results I got on my machines looked odd. I've shared the test code in https://go.dev/cl/601396 . In both lock_futex.go and lock_sema.go, lock2 does 4 calls of procyield(30), 1 call of osyield(), and then attempts to sleep (waiting for another thread to wake it when the lock might be available).

On amd64, procyield is the PAUSE instruction, https://www.felixcloutier.com/x86/pause.html . On my "13th Gen Intel(R) Core(TM) i7-13700H", I measure procyield(30) as taking 750 ns on the single-threaded efficiency cores, 2300 ns on the two-way SMT performance cores, and 2200 ns across the machine as a whole.

On arm64, procyield is the YIELD instruction, https://developer.arm.com/documentation/100076/0100/A64-Instruction-Set-Reference/A64-General-Instructions/YIELD . On my "M1 MacBook Air", I measure procyield(30) as taking 12 ns. (Since macOS can also run amd64 programs I measured with GOARCH=amd64 too, finding procyield(30) to take 13 ns.) On my "Raspberry Pi 5", I measure procyield(30) as taking 16 ns.

Maybe the speed of those changes based on how much bus activity there is? But I'm surprised to see more than a 100x difference between Intel and ARM processor performance there, with lock2 not accounting for that at all.

On my i7 Linux machine, osyield takes between 150 ns and 200 ns (performance vs efficiency cores). On my rpi5 Linux machine, it takes 1100 ns. On my M1 macOS machine, osyield takes 2800 ns.

I'm not surprised by a 5–7x difference in Linux performance on different hardware. I'm not surprised by a 19x difference between how Linux and macOS implement a yield to the OS scheduler. What surprises me is that on my i7 Linux machine, procyield(30) is slower than osyield, that it's slower by 15x, and that because there are 4 calls to procyield before the one call to osyield, that the machine spends 60x more time in procyield.

In BenchmarkChanContended, a single channel operation takes around 15 ns. So the lock can be acquired, used, and released over 100 times before an i7 performance core finishes a single procyield(30) call (2300 ns).

At the very least, it's not tuned well for the hardware I happen to have.

Details from benchstat below:

goos: darwin
goarch: amd64
pkg: runtime
cpu: VirtualApple @ 2.50GHz
                 │ yield-m1-amd64 │
                 │     sec/op     │
ProcYield/1-8         3.066n ± 0%
ProcYield/10-8        6.961n ± 1%
ProcYield/30-8        13.23n ± 1%
ProcYield/100-8       35.08n ± 0%
ProcYield/1000-8      328.2n ± 1%
OSYield-8             2.789µ ± 1%
geomean               45.66n

goarch: arm64
cpu: Apple M1
                 │  yield-m1   │
                 │   sec/op    │
ProcYield/1-8      2.234n ± 4%
ProcYield/10-8     5.631n ± 0%
ProcYield/30-8     11.90n ± 1%
ProcYield/100-8    33.80n ± 0%
ProcYield/1000-8   324.6n ± 1%
OSYield-8          2.769µ ± 1%
geomean            40.71n

goos: linux
goarch: amd64
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
                  │   yield-i7   │     yield-i7-e     │     yield-i7-p      │
                  │    sec/op    │   sec/op     vs base   │    sec/op     vs base   │
ProcYield/1-20      70.63n ± 17%
ProcYield/10-20     735.7n ±  9%
ProcYield/30-20     2.209µ ± 17%
ProcYield/100-20    6.143µ ± 23%
ProcYield/1000-20   74.11µ ± 17%
OSYield-20          150.7n ± 10%
ProcYield/1-8                      26.95n ± 0%
ProcYield/10-8                     251.5n ± 0%
ProcYield/30-8                     750.5n ± 0%
ProcYield/100-8                    2.506µ ± 0%
ProcYield/1000-8                   24.96µ ± 0%
OSYield-8                          197.4n ± 1%
ProcYield/1-12                                          82.73n ±  8%
ProcYield/10-12                                         638.2n ± 19%
ProcYield/30-12                                         2.262µ ± 19%
ProcYield/100-12                                        6.135µ ± 23%
ProcYield/1000-12                                       75.06µ ± 19%
OSYield-12                                              166.0n ± 18%
geomean             1.410µ         630.5n       ? ¹ ²   1.446µ        ? ¹ ²
¹ benchmark set differs from baseline; geomeans may not be comparable
² ratios must be >0 to compute geomean

goarch: arm64
cpu: 
                 │  yield-pi5  │
                 │   sec/op    │
ProcYield/1-4      3.023n ± 1%
ProcYield/10-4     7.931n ± 0%
ProcYield/30-4     16.28n ± 0%
ProcYield/100-4    45.50n ± 0%
ProcYield/1000-4   434.4n ± 0%
OSYield-4          1.073µ ± 0%
geomean            44.97n

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/601597 mentions this issue: runtime: allow deeper sleep in futex-based mutexes

@rhysh
Copy link
Contributor Author

rhysh commented Jul 29, 2024

I wrote https://go.dev/cl/601597 to implement the idea of "a spinning thread announces itself, which allows unlock2 to skip the wake, which allows lock2 to truly sleep". That sketch (at PS 1) is a bit slower than the current three-state mutex when there's no contention, and it degrades quickly at first.

But then it stops getting slower; "it scales". Its throughput holds steady at 1/4 of the zero-contention speed, even as the number of contending threads increases to 20 (the size of my test machine). In contrast, the performance of the baseline lock_futex.go degrades with each additional contending thread. At 8 threads, the two implementations are equal, and at 20 the performance of the baseline has slowed an additional 3x.

Another benefit that's not shown here: As the number of contending threads increases, nearly all of them are asleep. They're using less electricity and the OS can put them to use running other programs.

The microbenchmark of course has its limits: it would be easy for us to increase the argument to procyield, to slow down the spinning thread and thus makes it harder for it to acquire the mutex, to better allows the previous holder to capture it, to improve throughput at a (hidden) cost of latency. I've held off on any tuning like that. As for macro-benchmarks, perf_vs_parent builders show motion in several of their results, in both directions. I think it has potential.

Below are the benchstat results of CL 601597 PS 1 versus its parent on my test machine, in text and visual form. In the chart, I've converted the number of microseconds per benchmark iteration into the number of channel operations per second (each ChanContended op is 100 sends and 100 receives). The experiment holds steady around 13.5 million channel operations per second, while the baseline continues to slow down with each additional contending thread (to 4.5 million ops per second at 20 threads).

goos: linux
goarch: amd64
pkg: runtime
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
                │     old     │                  new                  │
                │   sec/op    │    sec/op     vs base                 │
ChanContended      3.147µ ± 0%    3.703µ ± 0%   +17.65% (p=0.000 n=10)
ChanContended-2    4.511µ ± 2%    5.280µ ± 7%   +17.06% (p=0.000 n=10)
ChanContended-3    5.726µ ± 2%   12.125µ ± 2%  +111.75% (p=0.000 n=10)
ChanContended-4    6.574µ ± 1%   13.356µ ± 4%  +103.16% (p=0.000 n=10)
ChanContended-5    7.706µ ± 1%   13.717µ ± 3%   +78.00% (p=0.000 n=10)
ChanContended-6    8.830µ ± 1%   13.674µ ± 2%   +54.85% (p=0.000 n=10)
ChanContended-7    11.07µ ± 0%    13.59µ ± 2%   +22.77% (p=0.000 n=10)
ChanContended-8    13.99µ ± 1%    14.06µ ± 1%         ~ (p=0.190 n=10)
ChanContended-9    16.93µ ± 2%    14.04µ ± 3%   -17.04% (p=0.000 n=10)
ChanContended-10   20.12µ ± 4%    14.12µ ± 1%   -29.80% (p=0.000 n=10)
ChanContended-11   23.96µ ± 2%    14.44µ ± 3%   -39.74% (p=0.000 n=10)
ChanContended-12   29.65µ ± 6%    14.61µ ± 3%   -50.74% (p=0.000 n=10)
ChanContended-13   33.98µ ± 7%    14.69µ ± 3%   -56.76% (p=0.000 n=10)
ChanContended-14   37.90µ ± 1%    14.69µ ± 3%   -61.23% (p=0.000 n=10)
ChanContended-15   37.94µ ± 4%    14.89µ ± 5%   -60.75% (p=0.000 n=10)
ChanContended-16   39.56µ ± 0%    13.89µ ± 1%   -64.89% (p=0.000 n=10)
ChanContended-17   39.56µ ± 0%    14.45µ ± 4%   -63.47% (p=0.000 n=10)
ChanContended-18   41.24µ ± 2%    13.95µ ± 3%   -66.17% (p=0.000 n=10)
ChanContended-19   42.77µ ± 5%    13.80µ ± 2%   -67.74% (p=0.000 n=10)
ChanContended-20   44.26µ ± 2%    13.74µ ± 1%   -68.96% (p=0.000 n=10)
geomean            17.60µ         12.46µ        -29.22%
Screenshot 2024-07-29 at 1 00 44 PM

I read Rust's implementation, https://github.com/rust-lang/rust/blob/1.80.0/library/std/src/sys/sync/mutex/futex.rs . It looks similar in structure to our current lock_futex.go implementation.

It does not do osyield. @prattmic pointed me to https://go.dev/cl/473656, which links to https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752, which describes why it's best to avoid osyield (which on Linux is a call to sched_yield(2)). Maybe we should remove that (as a separate change).

On arm64, Rust's Mutex implements the busy loop pause as ISB SY, https://github.com/rust-lang/rust/blob/1.80.0/library/core/src/hint.rs#L243 . They used to do YIELD as we do today, but stopped at rust-lang/rust@c064b65 . Maybe that's right for Go too (to be measured, and done as a separate change).

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/602176 mentions this issue: runtime: benchmark mutex handoffs

gopherbot pushed a commit that referenced this issue Aug 1, 2024
These are delay primitives for lock2. If a mutex isn't immediately
available, we can use procyield to tell the processor to wait for a
moment, or osyield to allow the OS to run a different process or thread
if one is waiting. We expect a processor-level yield to be faster than
an os-level yield, and for both of them to be fast relative to entering
a full sleep (via futexsleep or semasleep).

Each architecture has its own way of hinting to the processor that it's
in a spin-wait loop, so procyield presents an architecture-independent
interface for use in lock_futex.go and lock_sema.go.

Measure the (single-threaded) speed of these to confirm.

For #68578

Change-Id: I90cd46ea553f2990395aceb048206285558c877e
Reviewed-on: https://go-review.googlesource.com/c/go/+/601396
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
gopherbot pushed a commit that referenced this issue Aug 2, 2024
The speed of handing off a mutex to a waiting thread is sensitive to the
configuration of the spinning section of lock2. Measure that latency
directly, to complement our existing benchmarks of mutex throughput.

For #68578

Change-Id: I7637684bcff62eb05cc008491f095f653d13af4b
Reviewed-on: https://go-review.googlesource.com/c/go/+/602176
Reviewed-by: Dmitri Shuralyov <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
Auto-Submit: Rhys Hiltner <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/604375 mentions this issue: runtime: add direct benchmark of mutex contention

gopherbot pushed a commit that referenced this issue Aug 9, 2024
Measure throughput of a single mutex with all threads contending. Do not
attempt to measure fairness/starvation.

The ChanContended benchmark works somewhat well for this (interacting
with the mutex is a large contributor to its results), but it's better
to be clear about what we're attempting to measure.

For #68578

Change-Id: Ie397b4c363bfcd5afddf796a81cd6c34ebf8551b
Reviewed-on: https://go-review.googlesource.com/c/go/+/604375
Reviewed-by: David Chase <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
Auto-Submit: Rhys Hiltner <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/606901 mentions this issue: cmd/compile/internal/ssa: add atomic.Xchg8 intrinsic

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/606900 mentions this issue: internal/runtime/atomic: add Xchg8 for amd64

gopherbot pushed a commit that referenced this issue Oct 2, 2024
For #68578

Change-Id: Idecfdbb793f46560dd69287af9170c07cf4ee973
Reviewed-on: https://go-review.googlesource.com/c/go/+/606900
Reviewed-by: Keith Randall <[email protected]>
Reviewed-by: Mauri de Souza Meneguzzo <[email protected]>
Auto-Submit: Rhys Hiltner <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Keith Randall <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
gopherbot pushed a commit that referenced this issue Oct 2, 2024
For #68578

Change-Id: Ia9580579bfc4709945bfcf6ec3803d5d11812187
Reviewed-on: https://go-review.googlesource.com/c/go/+/606901
Reviewed-by: Mauri de Souza Meneguzzo <[email protected]>
Reviewed-by: Keith Randall <[email protected]>
Reviewed-by: Keith Randall <[email protected]>
Reviewed-by: Martin Möhrmann <[email protected]>
Auto-Submit: Rhys Hiltner <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/617618 mentions this issue: design/68578-mutex-spinbit.md: share TLA+ model

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/620435 mentions this issue: runtime: unify lock2, allow deeper sleep

gopherbot pushed a commit to golang/proposal that referenced this issue Oct 16, 2024
Add a design doc describing the general approach of the "spinbit" mutex
protocol, and the details of the two drafts that implement it.

Based on futex, for linux/amd64: https://go.dev/cl/601597
For all GOOS values and four architectures: https://go.dev/cl/620435

For golang/go#68578

Change-Id: Ie9665085c9b8cf1741deeb431acfa12fba550b63
Reviewed-on: https://go-review.googlesource.com/c/proposal/+/617618
Auto-Submit: Rhys Hiltner <[email protected]>
Reviewed-by: Rhys Hiltner <[email protected]>
Commit-Queue: Rhys Hiltner <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/622996 mentions this issue: runtime: allow futex OSes to use sema-based mutex

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/622997 mentions this issue: runtime: unify lock2, allow deeper sleep, control tail

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/622995 mentions this issue: runtime: add test for mutex starvation

@rhysh rhysh modified the milestones: Backlog, Go1.24 Nov 14, 2024
@mknyszek mknyszek moved this from Todo to In Progress in Go Compiler / Runtime Nov 14, 2024
gopherbot pushed a commit that referenced this issue Nov 14, 2024
When multiple threads all need to acquire the same runtime.mutex, make
sure that none of them has to wait for too long. Measure how long a
single thread can capture the mutex, and how long individual other
threads must go between having a turn with the mutex.

For #68578

Change-Id: I56ecc551232f9c2730c128a9f8eeb7bd45c2d3b5
Reviewed-on: https://go-review.googlesource.com/c/go/+/622995
Auto-Submit: Rhys Hiltner <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
Reviewed-by: Cherry Mui <[email protected]>
gopherbot pushed a commit that referenced this issue Nov 15, 2024
Implement sema{create,sleep,wakeup} in terms of the futex syscall when
available. Split the lock2/unlock2 implementations out of lock_sema.go
and lock_futex.go (which they shared with runtime.note) to allow
swapping in new implementations of those.

Let futex-based platforms use the semaphore-based mutex implementation.
Control that via the new "spinbitmutex" GOEXPERMENT value, disabled by
default.

This lays the groundwork for a "spinbit" mutex implementation; it does
not include the new mutex implementation.

For #68578.

Change-Id: I091289c85124212a87abec7079ecbd9e610b4270
Reviewed-on: https://go-review.googlesource.com/c/go/+/622996
Reviewed-by: Michael Knyszek <[email protected]>
Reviewed-by: Cherry Mui <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
@github-project-automation github-project-automation bot moved this from In Progress to Done in Go Compiler / Runtime Nov 15, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/628515 mentions this issue: runtime: double-link list of waiting Ms

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/629416 mentions this issue: runtime: use only CAS in lock2 slow path

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/629415 mentions this issue: runtime: clean up new lock2 structure

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/629417 mentions this issue: runtime: spin faster in lock2

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Nov 18, 2024
gopherbot pushed a commit that referenced this issue Nov 20, 2024
Simplify some flow control, as suggested on https://go.dev/cl/620435.

The MutexCapture microbenchmark shows a bit of throughput improvement at
moderate levels of contention, and little change to capture and
starvation. (Note that the capture and starvation figures below are in
terms of power-of-two buckets multiplied by throughput, so they either
follow similar patterns or move by a factor of two.)

For #68578

goos: linux
goarch: amd64
pkg: runtime
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
                │     old      │                 new                  │
                │    sec/op    │    sec/op     vs base                │
MutexCapture      18.21n ±  0%   18.35n ±  0%   +0.77% (p=0.000 n=10)
MutexCapture-2    21.46n ±  8%   21.05n ± 12%        ~ (p=0.796 n=10)
MutexCapture-3    22.56n ±  9%   22.59n ± 18%        ~ (p=0.631 n=10)
MutexCapture-4    22.85n ±  5%   22.74n ±  2%        ~ (p=0.565 n=10)
MutexCapture-5    22.84n ±  5%   22.50n ± 14%        ~ (p=0.912 n=10)
MutexCapture-6    23.33n ± 14%   22.22n ±  3%   -4.78% (p=0.004 n=10)
MutexCapture-7    27.04n ± 14%   23.78n ± 15%        ~ (p=0.089 n=10)
MutexCapture-8    25.44n ± 10%   23.03n ±  6%   -9.48% (p=0.004 n=10)
MutexCapture-9    25.56n ±  7%   24.39n ± 11%        ~ (p=0.218 n=10)
MutexCapture-10   26.77n ± 10%   24.00n ±  7%  -10.33% (p=0.023 n=10)
MutexCapture-11   27.02n ±  7%   24.55n ± 15%   -9.18% (p=0.035 n=10)
MutexCapture-12   26.71n ±  8%   24.96n ±  8%        ~ (p=0.148 n=10)
MutexCapture-13   25.58n ±  4%   25.82n ±  5%        ~ (p=0.271 n=10)
MutexCapture-14   26.86n ±  6%   25.91n ±  7%        ~ (p=0.529 n=10)
MutexCapture-15   25.12n ± 13%   26.16n ±  4%        ~ (p=0.353 n=10)
MutexCapture-16   26.18n ±  4%   26.21n ±  9%        ~ (p=0.838 n=10)
MutexCapture-17   26.04n ±  4%   25.85n ±  5%        ~ (p=0.363 n=10)
MutexCapture-18   26.02n ±  7%   25.93n ±  5%        ~ (p=0.853 n=10)
MutexCapture-19   25.67n ±  5%   26.21n ±  4%        ~ (p=0.631 n=10)
MutexCapture-20   25.50n ±  6%   25.99n ±  8%        ~ (p=0.404 n=10)
geomean           24.73n         24.02n         -2.88%

                │      old       │                  new                   │
                │ sec/streak-p90 │ sec/streak-p90  vs base                │
MutexCapture        76.36m ±  0%    76.96m ±   0%   +0.79% (p=0.000 n=10)
MutexCapture-2     10.609µ ± 50%    5.390µ ± 119%        ~ (p=0.579 n=10)
MutexCapture-3      5.936µ ± 93%    5.782µ ±  18%        ~ (p=0.684 n=10)
MutexCapture-4      5.849µ ±  5%    5.820µ ±   2%        ~ (p=0.579 n=10)
MutexCapture-5      5.849µ ±  5%    5.759µ ±  14%        ~ (p=0.912 n=10)
MutexCapture-6      5.975µ ± 14%    5.687µ ±   3%   -4.81% (p=0.004 n=10)
MutexCapture-7      6.921µ ± 14%    6.086µ ±  18%        ~ (p=0.165 n=10)
MutexCapture-8      6.512µ ± 10%    5.894µ ±   6%   -9.50% (p=0.004 n=10)
MutexCapture-9      6.544µ ±  7%    6.245µ ±  11%        ~ (p=0.218 n=10)
MutexCapture-10     6.962µ ± 11%    6.144µ ±   7%  -11.76% (p=0.023 n=10)
MutexCapture-11     6.938µ ±  7%    6.284µ ± 130%        ~ (p=0.190 n=10)
MutexCapture-12     6.838µ ±  8%    6.408µ ±  13%        ~ (p=0.404 n=10)
MutexCapture-13     6.549µ ±  4%    6.608µ ±   5%        ~ (p=0.271 n=10)
MutexCapture-14     6.877µ ±  8%    6.634µ ±   7%        ~ (p=0.436 n=10)
MutexCapture-15     6.433µ ± 13%    6.697µ ±   4%        ~ (p=0.247 n=10)
MutexCapture-16     6.702µ ± 10%    6.711µ ± 116%        ~ (p=0.796 n=10)
MutexCapture-17     6.730µ ±  3%    6.619µ ±   5%        ~ (p=0.225 n=10)
MutexCapture-18     6.663µ ±  7%    6.716µ ±  13%        ~ (p=0.853 n=10)
MutexCapture-19     6.570µ ±  5%    6.710µ ±   4%        ~ (p=0.529 n=10)
MutexCapture-20     6.528µ ±  6%    6.775µ ±  11%        ~ (p=0.247 n=10)
geomean             10.66µ          10.00µ          -6.13%

                │      old       │                  new                   │
                │ sec/starve-p90 │ sec/starve-p90  vs base                │
MutexCapture-2    10.609µ ±  50%    5.390µ ± 119%        ~ (p=0.579 n=10)
MutexCapture-3     184.8µ ±  91%    183.9µ ±  48%        ~ (p=0.436 n=10)
MutexCapture-4     388.8µ ± 270%    375.6µ ± 280%        ~ (p=0.436 n=10)
MutexCapture-5     807.2µ ±  83%   2880.9µ ±  85%        ~ (p=0.105 n=10)
MutexCapture-6     2.272m ±  61%    2.173m ±  34%        ~ (p=0.280 n=10)
MutexCapture-7     1.351m ± 125%    2.990m ±  70%        ~ (p=0.393 n=10)
MutexCapture-8     3.328m ±  97%    3.064m ±  96%        ~ (p=0.739 n=10)
MutexCapture-9     3.526m ±  91%    3.081m ±  47%  -12.62% (p=0.015 n=10)
MutexCapture-10    3.641m ±  86%    3.228m ±  90%  -11.34% (p=0.005 n=10)
MutexCapture-11    3.324m ± 109%    3.190m ±  71%        ~ (p=0.481 n=10)
MutexCapture-12    3.519m ±  77%    3.200m ± 106%        ~ (p=0.393 n=10)
MutexCapture-13    3.353m ±  91%    3.368m ±  99%        ~ (p=0.853 n=10)
MutexCapture-14    3.314m ± 101%    3.396m ± 286%        ~ (p=0.353 n=10)
MutexCapture-15    3.534m ±  83%    3.397m ±  91%        ~ (p=0.739 n=10)
MutexCapture-16    3.485m ±  90%    3.436m ± 116%        ~ (p=0.853 n=10)
MutexCapture-17    6.516m ±  48%    3.452m ±  88%        ~ (p=0.190 n=10)
MutexCapture-18    6.645m ± 105%    3.439m ± 108%        ~ (p=0.218 n=10)
MutexCapture-19    6.521m ±  46%    4.907m ±  42%        ~ (p=0.529 n=10)
MutexCapture-20    6.532m ±  47%    3.516m ±  89%        ~ (p=0.089 n=10)
geomean            1.919m           1.783m          -7.06%

Change-Id: I36106e1baf8afd132f1568748d1b83b797fa260e
Reviewed-on: https://go-review.googlesource.com/c/go/+/629415
Reviewed-by: Michael Knyszek <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
Auto-Submit: Rhys Hiltner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
Development

No branches or pull requests

6 participants