-
Notifications
You must be signed in to change notification settings - Fork 32
Atomics.fence() and SIMD fences? #25
Comments
So far as I know the SSE loads and stores participate in the ordering model on x86 like normal loads and stores do. LOCKed instructions should therefore order sse loads and stores before and after as expected. In the absence of a LOCK you need an mfence or similar following a store to make sure it is not reordered with a subsequent load. The general workaround for a missing mfence instruction is Atomics.add(address, 0), which in firefox will turn into a single LOCK ADD instruction. Linux uses this trick on older machines; JVM implementations use it even on some newer machines because it is faster than MFENCE (but this is CPU implementation dependent). (Speaking strictly about Firefox, I'm not going to remove Atomics.fence in the near future, since that just causes unnecessary turbulence.) |
The following discussion with @hboehm and @jyasskin explains how fence was removed. Reproducing it here for posterity.
|
Right. The underlying problem is that fences may be useful when trying to optimize some code. I don't think we know how useful yet - mostly it was for implementing pthreads, and I haven't looked at that code at all. But IIRC @juj said that the current structure of the code (it's the musl library) did not lend itself well to using atomics. |
Without some sort of weakly ordered atomics, fences really have no effect I looked at the musl code quickly. It's not correct by C11 rules. It On Thu, Sep 3, 2015 at 11:17 AM, Lars T Hansen [email protected]
|
I may be about to talk nonsense as a result of jumping into an ongoing discussion, but I was wondering if the barrier semantics of the atomic operations needed to be spelt out. As I understand it, on x86 an instruction like LOCK INC will provide an implicit barrier. This may not be the case on other architectures, for example http://developer.android.com/training/articles/smp.html says:
There's also a question of whether stand-alone barriers are useful for lock-free data structures. |
@PeteX, the memory model implicitly specifies the fences in its use of sequentially consistent memory order for all atomics. If you want to think of it that way, a fence is required before and after every atomic operation. (In implementation terms, a LOCK prefix is sufficient on x86 when you can use it; an x86 MFENCE is needed some places if you don't use LOCK; on ARM, a barrier (DMB) is needed before and after atomic operations.) It's pretty clear that release/acquire atomics, relaxed atomics, and possibly explicit fences, will be useful for certain faster data structures. But I've opted not to include them in v1 of the spec, because they will seriously complicate it and (in my judgement) reduce the chances of the spec being adopted at all. |
@lars-t-hansen—I'm not sure which part of the spec your first paragraph refers to. 7.2.1 says that 'Atomic accesses ... are performed in a single total order ... consistent with each agent's program order.' That makes sense but doesn't there also need to be a discussion of data accesses, and an explicit statement that they must not be reordered in a way that moves them past an atomic access? Otherwise on ARM you could end up doing the fast (no system call) release of a futex before all the updates have been written back to the data object it is protecting. Another thread could then do a fast acquisition of the same futex, and read stale data from the object. In a way I think it's a shame that the features in your last paragraph won't be in v1, because if they go in v2, people will have to handle browsers that support differing feature sets. However I'm not familiar with the politics and I doubt the potential efficiency gains are very large. I doubt, for example, that there is much difference between a barrier instruction and an atomic operation on a cache line that is already held exclusively. (As an aside, 7.2.1 also seems to allow non-aligned atomic accesses. Are you sure these are possible on all architectures without using a separate mutex?) |
futex should also provide ordering. Could you point out where the wording doesn't guarantee it? That would be a bug in the spec :-) I'm not sure I understand your point on cachelines. Each cacheline update also requires ordering. Atomic accesses should require natural alignment, anything else would be a spec bug. |
@PeteX, the observability rules should constrain reordering appropriately, though I'm not sure if they do that yet. (The memory model is not complete.) In general it's my preference to keep that part of the spec conceptual - by using constraints - and not to start introducing a concept of fences. Futexes are included in the rules for the happens-before relationship in the same manner as atomics so if the rule is right for atomics it should be right for futexes. Modulo muddy prose in the spec, of course. No non-aligned accesses should be possible, by design of the atomics - you can only use them via typed arrays, which are always aligned. Can you quote the text you think is ambiguous? There is text there to deal with overlapping atomics, which we have to contend with because typed arrays can be aliased, but in the current design if two atomic accesses overlap but are not the same then one of them addresses a subrange of the the other (ie one is a word and the other is a halfword within that word). |
You're asking me to prove a negative! I read through the spec draft and it wasn't obvious to me that ordering was guaranteed, but perhaps I just didn't read it in the right way. Which part of the spec is supposed to guarantee ordering of non-atomic operations relative to atomics? If futexes are going to be implemented the same way as on Linux, there will be a fast and a slow path through the logic. If you want to acquire a futex and you don't have to wait, you take the fast path. If you want to free a futex and you don't have to wake anyone, you take the fast path. These fast paths don't involve the futex system call, so in these cases, your data ordering depends exclusively on the semantics of the atomic operations. I agree it would be a strange implementation of futex if data accesses could be reordered in such a way that they moved past the futex system call—but that isn't the only problem. The fast/non-contended path through the logic also has to constrain reordering.
My point was just about efficiency. A fence is generally cheaper than an atomic operation, because it can be carried out entirely on one CPU core. When you do an atomic operation, by contrast, the cache coherency system has to acquire the relevant cache line in exclusive mode. Now imagine you're implementing a lock-free data structure, and you don't have access to explicit fences. All you have is atomic operations which provide a fence as a side-effect. In this case, if you're careful, you can reduce the cost by always doing the atomic operation on the same cache line. If no one else tries to use that memory location, it should remain in cache, in exclusive mode. The cost of the atomic operation is then much lower because there is no interaction with other cores. Ideally the cost wouldn't be much higher than an explicit fence. The upshot is that, while lock-free data structures might benefit from explicit fences, I think the benefit may not be particularly high. Because of this, if you want to leave explicit fences out of the spec for political reasons, I don't think that's a huge problem. (I suppose it might be worth doing some tests and figuring out what the cost actually is.) @lars-t-hansen — If the memory model is not yet complete, perhaps I'm commenting too soon! With the current wording, though, I couldn't see how the happens-before relationship works with normal data accesses. I now understand the point about overlapping atomic accesses; thanks for clearing that up. I think the spec is good on that point. (I was imagining an overlapping access where both were operating on a whole word, and at least one access was not word-aligned.) |
I see your concern now. I thought you disagreed with the wording, but you're pointing out that it's missing entirely :-) @lars-t-hansen: to address this I suggest re-visiting synchronic (see #12). Specifically, the API was changed significantly (I believe that was your original concern) and it has solid wording. It was discussed at the C++ standards committee meeting's concurrency and parallelism group today. We are moving the paper to the library evolution group, and I think it has a pretty solid chance of making it to C++17, or at worst a TS. The new paper is http://wg21.link/P0126R0
That's handled by the wording in synchronic.
That's highly dependent on which instruction is performed, and which architecture we're executing on.
I believe you want this: http://wg21.link/N4522 Or their descendants (we're still tuning the wording). These may make it to C++17, and could also be in this specification but I think we should tackle a few other things first (acquire/release, relaxed, regular fences).
We're not leaving fences out for political reasons: we're leaving them out because we're only considering seq_cst ordering to start. It makes it easier to present the paper, start implementation work, find bugs in the design (e.g. #16), and get model-checking tools going. That allows us to have a minimally useful product which can be used to make a pretty solid case for adoption. That doesn't mean that other features won't come soon after (perhaps even before standardization).
Now is the perfect time to comment :-) |
Sounds as though we're on the same page now—excellent! This is an exciting technology and I'm looking forward to seeing it become available in browsers. |
Definitely already using Atomics.mfence with a feature check to use it if available in some code (with fallback to Atomics.load to do the fencing). The case being a single read single writer FIFO ring buffer in use. |
@taisel I'm not sure I understand your usecase: you shouldn't need fencing with the current load/store's seq_cst semantics. Are you trying to use regular load/store as relaxed atomics (hence the need for a fence)? |
yes. It's a case where outdated reads aren't a problem, but the buffer pointed by the possibly outdated reads must be correct up to what the possibly outdated offsets read out to. It's pretty much the described implementation of a lock-free queue on wikipedia: https://en.wikipedia.org/wiki/Non-blocking_algorithm "a single-reader single-writer ring buffer FIFO, with a size which evenly divides the overflow of one of the available unsigned integer types, can unconditionally be implemented safely using only a memory barrier" |
The usage of which I'm employing it for is audio/video streaming to main thread/UI from a worker. |
@taisel if you're exchanging more than one lock-free value then you need two barriers for this to work (one on the reader side, one on the writer side), and depending on what you're doing you probably just want acquire / release accesses (or the mythical consume). That Wikipedia article doesn't look like it's quite ready for prime-time. |
I'm sending data only in one direction. The staleness of the indices into the ring buffer don't matter, as long as they eventually get updated by the system. I only need to make sure the data within the buffer is correct on weakly ordererd memory systems. The indices can be out of sync in when weakly ordered systems update them to the other worker, it just means less of the buffer is either free or readable than in reality to reader/writer. I just don't see a reason for the fencing on the writer side. That reader fence shouldn't even be necessary on x86, since the "staleness" viewed is in order (the store before load exception with x86 reordering doesn't apply). |
Closing this because:
|
I recall we talked about the removal of the
Atomics.fence()
instruction from the spec. Current Emscripten codegen still has that. Also, the Intel SSE1 and SSE2 instruction sets have three fence operations as well, see https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=fenceDoes "generic" x86 fence also guarantee synchronization for SSE loads and stores? If one does e.g. an
Atomics.compareExchange
that has SIMD.js operations on both sides of it, are those guaranteed to be synchonized by the cas?In the absence of
Atomics.fence()
, would it be emulatable withAtomics.add(someDummyMemoryAddress, 0);
or something like that? (I recall we talked about this as well, but my memory is fading). See this issue in Emscripten: emscripten-core/emscripten#3051 .The text was updated successfully, but these errors were encountered: