Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomicity: lock-free guarantees #300

Closed
jfbastien opened this issue Aug 17, 2015 · 41 comments
Closed

Atomicity: lock-free guarantees #300

jfbastien opened this issue Aug 17, 2015 · 41 comments

Comments

@jfbastien
Copy link
Member

TL;DR: make all accesses of 1, 2 and 4 bytes "always" lock-free, wider access "maybe".

C++11 has a concept of "atomicity" where an operations is either entirely observable, or not at all. This guarantee is on a per object basis, during the object's lifetime—for our purpose this is equivalent to per memory location. This is irrespective of the object's size, but does require proper natural alignment (UB if not naturally aligned).

Not all hardware have load/store of wide sizes (indeed, atomics can have any size!), the language therefore offers an is_lock_free method on the std::atomic class. This is a per-type property, but it's effectively based on size. This method is determined at runtime (i.e. it is not constexpr), but it's effectively known at load time because it can't change during program execution.

Using an atomic of non-atomic size relies on the runtime using a global lock (or a shard of global locks) to seamlessly prevent partial reads or writes. Normal loads/stores are unaffected, only atomics of unusual sizes go through this lock.

C has ATOMIC_*_LOCK_FREE macros for BOOL, CHAR, CHAR16_T, CHAR32_T, WCHAR_T, SHORT, INT, LONG, LLONG, POINTER, which can be used at compile time. These macros are a tribool: never lock-free, sometimes lock free, always lock free.

There is a proposal to add constexpr atomic::is_always_lock_free to the languages.

Note: things are a bit more complicated because C11 decided to standardize is_lock_free as taking a pointer to the object as well. Implementations ignore that pointer, I suggest we do the same.

In all 3 cases we (WebAssembly implementors) need to decide what we want to guarantee. The runtime checks are easy (look at current machine, return that), the shard of global locks is easy (it's an implementation detail), but the compile-time guarantee of "yes this will always be lock free on all platforms WebAssembly will ever support" isn't that easy to make, and it's binding forever.

Having such a guarantee is pretty important: it's silly to write a lock-free algorithm if the datastructure is atomic through a lock (i.e. isn't actually lock free!). A good example for 64-bit lock-free being useful is a doubly-linked list where 2 pointers use up 64 bits.

@sunfishcode points out:

  • ppc32, ppc64, x86-32, x86-64, sparcv9, systemz, bpf, mips32, mips64, arm32, and arm64 can all do int32 lock-free.
  • ppc64, x86-64, arm64, sparcv9, systemz, bpf, and mips64 can all do int64 lock-free.

I think we can safely say 1, 2 and 4 byte accesses are always lock-free (even if you can't do a byte load you can always cmpxchg a wider size).

@sunfishcode
Copy link
Member

I would also like to propose that 8-byte accesses also be guaranteed to be lock-free in 64-bit mode. This would mean that some 32-bit platforms (like ppc32, mips32) wouldn't be able to run 64-bit wasm, though some (like x86-32) can, but as detailed above, all known relevant 64-bit platforms would be fine.

@titzer
Copy link

titzer commented Aug 17, 2015

Is lock-freedom observable? E.g. would it be possible for an implementation to "violate" lock-freedom in a way that is observable to the program? Or is this more of a performance advisory?

@jfbastien
Copy link
Member Author

It's mostly a performance advisory (albeit a significant one), but it is potentially possible to observe. A few examples:

  • When accessing the same memory location of non-lock-free size with atomic accesses as well as non-atomic accesses (this is UB).
  • A non-lock-free atomic access straddles a page boundary with different page protections.
  • During signal handling or asynchronous / timer events.
  • Communicating through shared memory (mmap based physical page sharing) with other wasm instances or non-wasm agents (potentially in other processes).

@kg
Copy link
Contributor

kg commented Aug 17, 2015

How does this interact with unaligned access? Also, does 4-byte lock-free status guarantee that 1-byte and 2-byte IOs are lock-free? I can imagine an architecture where a 1-byte write turns into read-4-into-temp, write-1-into-temp, write-4-from-temp, in which case it would be very much not lock-free.

@lukewagner
Copy link
Member

Since PostMVP.md#threads already mentions the goal of providing atomic ops and Atomics.isLockFree(size) is already in the SharedArrayBuffer draft, is what you're ultimately proposing that we explicitly mention that we'd add a wasm equivalent (new AST op, is a constant function, so foldable) of Atomics.isLockFree? If so, sgtm.

@jfbastien
Copy link
Member Author

@kg your two questions:

  • Unaligned accesses of atomics are UB in C++. In wasm we should spec it as potentially tearing (local effect only).
  • 1 and 2 byte lock-free can be implemented as a cmpxchg of 4-byte values, with mask / shift / or of the 1 or 2 byte value into the 4 bytes.

@lukewagner I basically want to figure out what LLVM should do (both for constexpr as well as for the runtime function), because @sunfishcode's clang patch ran into this issue. Can we promise that 1, 2 and 4 byte accesses will always be lock-free, forever, for all architectures wasm will ever support?

@lukewagner
Copy link
Member

I like the idea of shooting for the best semantics we can get, but I'm not sure mandating is_lock_free(1|2|4)=true wins us anything: either it will be true on all archs forever, or it won't and the wasm impl on this arch will be forced to lie (have is_lock_free return true but implement with a lock); what choice does it have? Semantically, both cases behave the same. The difference is that in the latter case, the engine is lying so wasm code that could have used the information doesn't have it.

@jfbastien
Copy link
Member Author

@lukewagner lying isn't an option IMO :-)
We're allowed to say one of three things for each size:

  1. Always lock-free;
  2. Never lock-free;
  3. Sometimes lock-free (and then give is_lock_free a true/false value at load time, based on the actual architecture).

The most conservative is to say "sometimes" in all cases, which is what I asked @sunfishcode to do in clang for now.

@kg
Copy link
Contributor

kg commented Aug 18, 2015

@jfbastien

1 and 2 byte lock-free can be implemented as a cmpxchg of 4-byte values, with mask / shift / or of the 1 or 2 byte value into the 4 bytes.

Are you proposing that as a part of the '1/2/4-byte accesses are atomic' guarantee, we require conforming implementations to turn all sub-4-byte accesses into cmpxchg? That seems catastrophically non-performant.

@jfbastien
Copy link
Member Author

@kg I indeed am, and it's lock-free so cheaper than the alternative :-)

@jfbastien
Copy link
Member Author

@kg to clarify, I'm talking about atomic accesses only, not regular load/store.

@lukewagner
Copy link
Member

@jfbastien Ok, then it sounds like we agree (to not require is_lock_free(1|2|4)=true) unless you're positing that no future arch will ever drop support for atomic 1/2/4-byte aligned accesses.

@sunfishcode
Copy link
Member

@luke I am positing that no future arch will ever drop support for lock-free 1/2/4-byte atomic accesses. The arch survey mentioned above includes all major architectures, and beyond that, the C++ memory model encourages architectures to do so, and is one of the pillars of our work here.

@jfbastien
Copy link
Member Author

I sure hope architectures support 1-byte atomics forever :-)
2 is questionable, but...
4-bytes seems like the accepted minimum for anything we'd care about, and in a pinch can be used to do 2 with cmpxchg.
8 is still out there for some architectures, but seems to be slowly making its way in.
16 is oh-so-tantalizing, but not everyone has the luxury of lock cmpxchg16b.
And don't get me started on the wonders of transactional memory, where I get an L1's worth of dirty pages!

So I tend to agree with @sunfishcode and count on 4 being 4eva.

@lukewagner
Copy link
Member

Yeah, I was mostly wondering about 2. If we want to make this assumption, then, we should add it to our list of assumptions.

@jfbastien
Copy link
Member Author

2 is an assumption on efficient execution, but the make-or-break assumption we're potentially making is about 4. Being wrong on 4 is more than just performance breaking!

@lars-t-hansen
Copy link
Contributor

From people working on MIPS I have heard that the implementation advice for 1-byte and 2-byte atomic RMW is to use a separate spinlock, not the 4-byte atomic LL/SC pair. I don't know where that advice came from originally, and I remain skeptical that it's good advice. Just sayin'. I have not yet investigated what the C++ compilers do on that platform.

I'm not sure how to interpret @jfbastien's comment earlier about 1-byte atomics, surely all modern platforms are single-copy atomic for a single byte and new platforms will be so, but MIPS and (last I looked) PPC have no instructions specifically for atomic 1-byte RMW. (Nor 2-byte.)

@kg
Copy link
Contributor

kg commented Aug 18, 2015

@jfbastien

@kg to clarify, I'm talking about atomic accesses only, not regular load/store.

OK, we don't currently have atomic accesses in AstSemantics so this was confusing. We should make sure to be clear that this only applies to atomic access opcodes (and introduce them, naturally)

@lars-t-hansen
Copy link
Contributor

Related discussion happening in the Shared memory repo. I've noted there that some microcontroller systems that can and do run JS are probably not lock-free in our sense for four-byte data. For example, the Model 1 Raspberry PI is an ARM11 chip with only a SWP instruction suitable mainly for building a spinlock. I don't want to overemphasize the microcontroller use case, and microcontrollers are getting better fast, but JS is everywhere :)

@sunfishcode
Copy link
Member

@lars-t-hansen My survey above was based on what's in LLVM, and LLVM seems to interpret "lock-free" for types to just mean that it can do plain atomic loads and stores of those types without locks. For example, it considers 1-byte, 2-byte, and 4-byte integer types to all be "lock-free" for all armv6 platforms. Unfortunately, the C++ standard itself is vague about what a "lock-free type" means.

@sunfishcode
Copy link
Member

Specifically, for this C code for example:

#if __GCC_ATOMIC_SHORT_LOCK_FREE != 2
#error short is not always lock free
#endif
short foo(short *x, short y, short z) {
  return __sync_val_compare_and_swap(x, y, z);
}

clang --target=armv6 b.c -S -O2 -o - gives no error and emits this:

.LBB0_1:
    ldrexh  r1, [r0]
    cmp r1, r12
    bne .LBB0_3
    strexh  r3, r2, [r0]
    cmp r3, #0
    bne .LBB0_1

Which to my understanding is not "lock-free" code. I will seek clarification.

@lars-t-hansen
Copy link
Contributor

@sunfishcode, we probably need to figure out what we need by the terminology then. On ARM and MIPS, that property is called "single-copy atomicity", which just boils down to a relaxed atomic load or store being implementable with a normal load or store instruction. I agree that all platforms I've surveyed, post-dating the first Alpha, are single-copy atomic for integer data (only) up to the native pointer size, a prerequisite for implementing Java.

The code above is what I refer to as lock-free: no external lock needed for a RMW operation.

(Incidentally, though not central to this issue, ldrexh is only available in the ARMv6K and later and I'm not sure that would work on the Raspberry PI.)

@lukewagner
Copy link
Member

@kg Atomics aren't in AstSemantics.md b/c AstSemantics.md is what's in MVP; atomics are specifically called out in FutureFeatures.md#threads where I think they belong. We coudl add a link from AstSemantics.md#linear-memory-operations to this, though.

@sunfishcode
Copy link
Member

@lars-t-hansen I was reading the current C++ draft which incorporates the definition of lock-free from n3927 but I'm not confident I interpreted it correctly.

FWIW, I looked up the original Raspberry Pi and it has an ARM1176JZF-S which according to ARM does have ldrexh, strexh, ldrexb, and strexb.

@jfbastien
Copy link
Member Author

I'm not sure how to interpret @jfbastien's comment earlier about 1-byte atomics, surely all modern platforms are single-copy atomic for a single byte and new platforms will be so, but MIPS and (last I looked) PPC have no instructions specifically for atomic 1-byte RMW. (Nor 2-byte.)

@lars-t-hansen what I meant is: if you have a 4-byte lock-free access then you can synthesize 1-byte and 2-byte accesses using shifts and masks. That's true for load/store as well as RMW.

See this LLVM code:
https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/AtomicExpandPass.cpp#L514

@pizlonator
Copy link
Contributor

On Aug 18, 2015, at 7:25 AM, Dan Gohman [email protected] wrote:

Specifically, for this C code for example:

#if __GCC_ATOMIC_SHORT_LOCK_FREE != 2
#error short is not always lock free
#endif
short foo(short *x, short y, short z) {
return __sync_val_compare_and_swap(x, y, z);
}
clang --target=armv6 b.c -S -O2 -o - gives no error and emits this:

.LBB0_1:
ldrexh r1, [r0]
cmp r1, r12
bne .LBB0_3
strexh r3, r2, [r0]
cmp r3, #0
bne .LBB0_1
Which to my understanding is not "lock-free" code. I will seek clarification.

This code is lock free according to the C++ spec definition and according to other definitions of lock freedom with which I am familiar, unless you take a hard line on cache line interference, which is probably not a practically useful position to take.

Going point by point with the C++ definition:

  1. If this is the only thread running then this assembly snippet will return after one loop iteration.

  2. If there are many threads running in this process and some thread running this loop has to reloop, it can only be because: a) a context switch, which means that the next loop iteration will likely succeed because context switches don't happen that frequently; b) another thread succeeded in making progress on some lock-free algorithm on this same short and therefore that thread made progress; c) cache line interference.

The C++ spec appears to mumble and fumble on 2c for LL/SC, which is probably as good as it gets. The spirit of lock freedom is all about ensuring global progress rather than the progress of any one thread. Even ignoring 2c, lock freedom allows for some thread to get stuck indefinitely. Even if all of the threads executing this LL/SC loop on some short are stuck, it can only be because some some other thread is succeeding in modifying some word on this cache line. Arguably that thread is making progress in that case.


Reply to this email directly or view it on GitHub.

@sunfishcode
Copy link
Member

Thanks. I'm prepared to call a ldrexh/strexh loop lock-free then.

I looked up some reference points: the original Raspberry Pi and Cortex-M3 both have ldrexh/ldrexb/strexh/strexb. Cortex-M0 does not.

So:

  • Can we guarantee that 4-byte is always lock-free? I don't see any objections to this. 4 bytes is a pointer size on wasm32, so this is a desirable property.
  • Can we guarantee that 1 and 2-byte are always lock-free? Depending on interpretation, this may or may not be implied if we have 4-byte accesses. If not, some armv6 variants can't do this, though some, like the Raspberry Pi, can. Other microcontrollers may have similar limitations.
  • Can we guarantee that 8-byte accesses are always lock-free on wasm64? See my survey at the top for details. This would give wasm64 lock-free pointers too. I believe the main decision here is whether we're comfortable saying that some 32-bit platforms can't run 64-bit wasm with threads, since mips32 and ppc32 for example don't support 8-byte lock-free atomics (though 32-bit x86 does). [edit: I meant ppc32 instead of mips64].

@lars-t-hansen
Copy link
Contributor

@pizlonator, doesn't that definition of lock-freedom mean that any size access implemented with a spinlock (eg using a test-and-set on a separate one-bit location, even one that is shared among all atomic accesses) is lock-free? I'm not necessarily opposed to that, but it is broader than I would have expected.

In the gcc documentation we have this:

"bool __atomic_always_lock_free (size_t size, void *ptr). This built-in function returns true if objects of size bytes always generate lock-free atomic instructions for the target architecture. size must resolve to a compile-time constant and the result also resolves to a compile-time constant. "

That's more limited than your definition but it also seems more useful - it says something about the specific performance characteristics and implementation of the access: atomic access, not (sharded) (spin)lock.

@sunfishcode, re ldrexh/strexh, serves me right for trusting Wikipedia on such details :)

@jfbastien, clearly a 4-byte op can be used to synthesize a 1-byte op and a 2-byte op, no discussion about that.

@pizlonator
Copy link
Contributor

On Aug 19, 2015, at 1:16 AM, Lars T Hansen [email protected] wrote:

@pizlonator, doesn't that definition of lock-freedom mean that any size access implemented with a spinlock (eg using a test-and-set on a separate one-bit location, even one that is shared among all atomic accesses) is lock-free? I'm not necessarily opposed to that, but it is broader than I would have expected.

No. The definition allows the OS scheduler to do anything it wants. It could schedule a thread that grabs the lock, but then preempt it just before it releases the lock and then never pick that thread again. No that case, no other thread will make progress. For this reason the spin lock approach fails the lock freedom test.

You're right that spin locks would be lock free if the OS scheduler had some fairness guarantee and the "spinning" part of the spin lock caused the OS to fairly round-robin through all runnable threads before coming back to the spinning thread.

In the gcc documentation we have this:

"bool __atomic_always_lock_free (size_t size, void *ptr). This built-in function returns true if objects of size bytes always generate lock-free atomic instructions for the target architecture. size must resolve to a compile-time constant and the result also resolves to a compile-time constant. "

That's more limited than your definition but it also seems more useful - it says something about the specific performance characteristics and implementation of the access: atomic access, not (sharded) (spin)lock.

Actually this text contained no definition of lock freedom. "Always generate lock-free atomic instructions" is not a definition of "lock free" nor does it preclude the use of a spin lock, since spin locks are indeed implemented using lock free instructions.
@sunfishcode, re ldrexh/strexh, serves me right for trusting Wikipedia on such details :)

@jfbastien, clearly a 4-byte op can be used to synthesize a 1-byte op and a 2-byte op, no discussion about that.


Reply to this email directly or view it on GitHub.

@sunfishcode
Copy link
Member

@lars-t-hansen As one data point, LLVM's MIPS backend uses LL/SC for i16 and i8.

@jfbastien
Copy link
Member Author

Using LL/SC or cmpxchg of i32 for i8 and i16 types implies that accessing the unmodified sub-parts of the i32 non-atomically can have unexpected outcome on some hardware. C++ disallows this by controlling std::atomic's size. WebAssembly should have similar restrictions.

@sunfishcode
Copy link
Member

@jfbastien Can you help me reconcile your interpretation with the fact that clang and libc++ seem to have no code for making the size or alignment of std::atomic<T> different from T?

@jfbastien
Copy link
Member Author

I'm more familiar with ARM, where non-atomic stores do reset the global monitor set by ldrex (so the corresponding strex would fail). Note that there's trickiness on ARM when it comes to local monitors, which are spec'd to not necessarily be address-based (the implementation may assume ldrex is always paired with strex at the same address), but that's local and only applies for non-shareable memory (though it does mean that a core can self-deadlock if it performs any memory operation between a ldrex and strex).

x86 is obviously not a problem.

I'm not familiar enough with MIPS to know how their LL/SC is spec'd, and whether that's consistent across all of their SMP cores.

I'm told NVIDIA processors "just work" in these circumstances.

I do think that it's sensible for weak memory systems with multiple sockets to do what I've detailed (effectively TBAA based on atomic access types) but I don't have a public ISA document to point at to back it up. Let me poke a few people.

@pizlonator
Copy link
Contributor

On Aug 25, 2015, at 10:36 AM, JF Bastien [email protected] wrote:

I'm more familiar with ARM, where non-atomic stores do reset the global monitor set by ldrex (so the corresponding strex would fail). Note that there's trickiness on ARM when it comes to local monitors, which are spec'd to not necessarily be address-based (the implementation may assume ldrex is always paired with strex at the same address), but that's local and only applies for non-shareable memory (though it does mean that a core can self-deadlock if it performs any memory operation between a ldrex and strex).

Can you clarify if you believe that arm requires atomic accesses to be of the same word size as the corresponding nonatomic accesses?

x86 is obviously not a problem.

I'm not familiar enough with MIPS to know how their LL/SC is spec'd, and whether that's consistent across all of their SMP cores.

I'm told NVIDIA processors "just work" in these circumstances.

I do think that it's sensible for weak memory systems with multiple sockets to do what I've detailed (effectively TBAA based on atomic access types) but I don't have a public ISA document to point at to back it up. Let me poke a few people.

I think that this would lead to a bizarre programming model, and I've seen lock-free algorithms that would break if they ran on hardware that did this.

Jikes RVM's thin/biased locks, GC header, and hash code logic relied on doing a mix of 32-bit LL/SC and 8-bit or 16-bit non-atomic accesses. That code was known to work on PPC.

WebKit on ARM uses 32-bit LL/SC to implement 8-bit weak CAS, and those bytes also get accesses by nonatomic 8-bit memory access instructions. WebKit is known to work on a variety of ARMs.

I'm not entirely opposed to specifying the behavior that you speak of if it is necessary to support some major architecture, but it strikes me as a very risky move based on my own experience with lock free algorithms.


Reply to this email directly or view it on GitHub.

@jfbastien
Copy link
Member Author

Can you clarify if you believe that arm requires atomic accesses to be of the same word size as the corresponding nonatomic accesses?

Size doesn't matter in this context. ARM's global monitor protects wide address ranges, so non-atomics that are "too close" to atomics can cause the strex to fail. In particular, some of ARM's docs state:

The Exclusives Reservation Granule (ERG) is implementation defined, in the range 8-2048 bytes, in multiples of two bytes.

I agree with you that it would lead to a bizarre programming model, and that it would break some lock-free code. That's what developers get for running with sharp scissors ;-)

More seriously I'm hoping this isn't something we need to spec, couldn't find hardware that we care about where it's an issue. I'll will post if I hear that this is an issue. I'm simply cautious because C++ is pretty specific but obscure about this (the requirement is on std::atomic), and I assume it's this way for a reason.

@jfbastien
Copy link
Member Author

I poked C++ WG21 SG1 folks who were involved in the current standard's wording. The worry which led to the current restrictions in C++ is mostly around "embedded processors that we don't know about".

Paul McKenney confirms that Alpha (pre-EV5) is of course the chief offender, but I think it's safe to ignore.

Lawrence Crowl points out that one interesting other case is the 80386:

The LOCK # signal is asserted during execution of the instruction following the lock prefix. This signal can be used in a multiprocessor system to ensure exclusive use of shared memory while LOCK # is asserted. The bts instruction is the read-modify-write sequence used to implement test-and-run.

If a different 80386 processor is concurrently executing an instruction that has a characteristic listed here, locked access is not guaranteed. The previous instruction:

  • Does not follow a lock prefix
  • Is not on the previous list of acceptable instructions
  • A memory operand specified has a partial overlap with the destination operand.

@jyasskin and Hans Boehm point out that processors which exhibit this issue with atomic / non-atomic mixing would also exhibit the same issue with pure non-atomic accesses: they'd also have to do e.g. 16-bit accesses using 32-bit cmpxchg or LL/SC.

Hans also says:

This is much less of an issue for uniprocessors. You just need some kernel help to restart the ld; replace bits; st sequence at the beginning if you get preempted in the middle. That's not a piece of cake, but it's easier to hack the kernel than the hardware.

TL;DR: we're probably good if we ignore this issue.

@sunfishcode
Copy link
Member

@jfbastien does "TL;DR: we're probably good if we ignore this issue" mean that you support guaranteeing that i8, i16, and i32 on WebAssembly are all "lock free types" in the C/C++ sense?

@jfbastien
Copy link
Member Author

Yes, but I'd phrase it as a platform assumption the same way we expect little-endian. I can create a PR to do this if it sounds good to you.

@sunfishcode
Copy link
Member

That sounds good to me.

@lars-t-hansen
Copy link
Contributor

FWIW, I'm not yet convinced that the JS shared memory spec will make the assumption that any particular size is lock free. So this may be another point of infidelity between the wasm spec and its JS polyfill. That's probably not a big deal, just noting it.

jfbastien added a commit that referenced this issue Sep 2, 2015
@jfbastien
Copy link
Member Author

I'll mark this as closed for now, though @lars-t-hansen's point means we should probably revisit the exact guarantees in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants