Add memory barrier to Mutex#unlock on aarch64#14272
Add memory barrier to Mutex#unlock on aarch64#14272straight-shoota merged 2 commits intocrystal-lang:masterfrom
Mutex#unlock on aarch64#14272Conversation
This solution is the same as the one used in crystal-lang#13050. The following code is expected to output `1000000` preceded by the time it took to perform it: ``` mutex = Mutex.new numbers = Array(Int32).new(initial_capacity: 1_000_000) done = Channel(Nil).new concurrency = 20 iterations = 1_000_000 // concurrency concurrency.times do spawn do iterations.times { mutex.synchronize { numbers << 0 } } ensure done.send nil end end start = Time.monotonic concurrency.times { done.receive } print Time.monotonic - start print ' ' sleep 100.milliseconds # Wait just a bit longer to be sure the discrepancy isn't due to a *different* race condition pp numbers.size ``` Before this commit, on an Apple M1 CPU, the array size would be anywhere from 880k-970k, but I never observed it reach 1M. Here is a sample: ``` $ repeat 20 (CRYSTAL_WORKERS=10 ./mutex_check) 00:00:00.119271625 881352 00:00:00.111249083 936709 00:00:00.102355208 946428 00:00:00.116415166 926724 00:00:00.127152583 899899 00:00:00.097160792 964577 00:00:00.120564958 930859 00:00:00.122803000 917583 00:00:00.093986834 954112 00:00:00.079212333 967772 00:00:00.093168208 953491 00:00:00.102553834 962147 00:00:00.091601625 967304 00:00:00.108157208 954855 00:00:00.080879666 944870 00:00:00.114638042 930429 00:00:00.093617083 956496 00:00:00.112108959 940205 00:00:00.092837875 944993 00:00:00.097882625 916220 ``` This indicates that some of the mutex locks were getting through when they should not have been. With this commit, using the exact same parameters (built with `--release -Dpreview_mt` and run with `CRYSTAL_WORKERS=10` to spread out across all 10 cores) these are the results I'm seeing: ``` 00:00:00.078898166 1000000 00:00:00.072308084 1000000 00:00:00.047157000 1000000 00:00:00.088043834 1000000 00:00:00.060784625 1000000 00:00:00.067710250 1000000 00:00:00.081070750 1000000 00:00:00.065572208 1000000 00:00:00.065006958 1000000 00:00:00.061041541 1000000 00:00:00.059648291 1000000 00:00:00.078100125 1000000 00:00:00.050676250 1000000 00:00:00.049395875 1000000 00:00:00.069352334 1000000 00:00:00.063897833 1000000 00:00:00.067534333 1000000 00:00:00.070290833 1000000 00:00:00.067361500 1000000 00:00:00.078021833 1000000 ``` Note that it's not only correct, but also significantly faster.
|
Are you sure this resolves #13055 entirely and there are no other places that may need barriers? |
|
What if you replace the lazy set ( Here is for example what the linux kernel source code (v4.4) has to say:
I assume this stands for V7 CPUs too.
We use sequential consistency instead of acquire/release but that should only impact performance & seq-cst is stronger than acquire/release anyway. My understanding is that the atomic is enough as long as we don't break the contract (without a barrier the CPU may reorder lazy set before we increment the counter). |
@straight-shoota I’m sure that it fixes the issues I’ve observed with thread-safety on If you’re referring to the wording in the title of the PR, I can change it to “add memory barriers” as in #13050.
In my tests last night, that did give me the expected values, but was slower. I don’t know how much that matters since correctness > speed (up to a point), but this implementation gave us both. |
|
@jgaskins nice, at least it proves that it's working. The speed improvement with a barrier is weird 🤔 I'd be interested to see the performance impact when using acquire/release semantics on the atomics (without the barrier) instead sequential consistency 👀 |
|
We might get better performance by using LSE atomics from ARMv8.1 (e.g. EDIT: confirmed, by default llvm will generate ll/sc atomics but specifying |
|
I ran the example code from the PR description on a Neoverse-N1 server 🤩 with 16 worker threads.
With LL/SC atomics (
With LSE atomics (
Take aways:
NOTE: we might consider enabling LSE by default for AArch64, and having a |
|
Weird. With LSE it was slower on my M1 Mac, but ~18% faster than this PR on an Ampere Arm server on Google Cloud (T2A VM, 8 cores), which is fascinating.
|
|
The other part of #13055 is |
|
Would be nice to have some spec coverage. |
|
There's a spec for it, but CI doesn't use There is one CI entry that uses |
Mutex on aarch64Mutex#unlock on aarch64
This solution is the same as the one used in #13050.
The following code is expected to output
1000000preceded by the time it took to perform it:Before this commit, on an Apple M1 CPU, the array size would be anywhere from 880k-970k, but I never observed it reach 1M. Here is a sample:
This indicates that some of the mutex locks were getting through when they should not have been. With this commit, using the exact same parameters (built with
--release -Dpreview_mtand run withCRYSTAL_WORKERS=10to spread out across all 10 cores) these are the results I'm seeing:Note that it's not only correct, but also significantly faster.
Fixes #13055