Skip to content
This repository has been archived by the owner on Jan 25, 2022. It is now read-only.

Atomics.fence() and SIMD fences? #25

Closed
juj opened this issue Sep 3, 2015 · 19 comments
Closed

Atomics.fence() and SIMD fences? #25

juj opened this issue Sep 3, 2015 · 19 comments

Comments

@juj
Copy link

juj commented Sep 3, 2015

I recall we talked about the removal of the Atomics.fence() instruction from the spec. Current Emscripten codegen still has that. Also, the Intel SSE1 and SSE2 instruction sets have three fence operations as well, see https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=fence

Does "generic" x86 fence also guarantee synchronization for SSE loads and stores? If one does e.g. an Atomics.compareExchange that has SIMD.js operations on both sides of it, are those guaranteed to be synchonized by the cas?

In the absence of Atomics.fence(), would it be emulatable with Atomics.add(someDummyMemoryAddress, 0); or something like that? (I recall we talked about this as well, but my memory is fading). See this issue in Emscripten: emscripten-core/emscripten#3051 .

@lars-t-hansen
Copy link
Collaborator

So far as I know the SSE loads and stores participate in the ordering model on x86 like normal loads and stores do. LOCKed instructions should therefore order sse loads and stores before and after as expected. In the absence of a LOCK you need an mfence or similar following a store to make sure it is not reordered with a subsequent load.

The general workaround for a missing mfence instruction is Atomics.add(address, 0), which in firefox will turn into a single LOCK ADD instruction. Linux uses this trick on older machines; JVM implementations use it even on some newer machines because it is faster than MFENCE (but this is CPU implementation dependent).

(Speaking strictly about Firefox, I'm not going to remove Atomics.fence in the near future, since that just causes unnecessary turbulence.)

@jfbastien
Copy link
Contributor

The following discussion with @hboehm and @jyasskin explains how fence was removed. Reproducing it here for posterity.

Hans Boehm Mar 2, 2015
My immediate reaction is to kill this. It's hard to specify correctly. If you provide sequential consistency for race-free programs, as you should, then fences are only observable for racy programs, which you shouldn't write.

The most serious down-side to this seems to be that seqlocks (increment sequence number twice on write, check that it's unchanged on read) become much less useful. Does it matter?

jyasskin Mar 2, 2015
If they add back in non-seq_cst atomics, they'll probably need fences in order to compile C++11 atomic-using code to Javascript. If they don't do that, I think I agree.

lhansen Mar 3, 2015
All noted.

lhansen Mar 12, 2015
I will kill this for now and note its possible reappearance in the section where I discuss non-seq_cst atomics.

@lars-t-hansen
Copy link
Collaborator

Right. The underlying problem is that fences may be useful when trying to optimize some code. I don't think we know how useful yet - mostly it was for implementing pthreads, and I haven't looked at that code at all. But IIRC @juj said that the current structure of the code (it's the musl library) did not lend itself well to using atomics.

@hboehm
Copy link

hboehm commented Sep 3, 2015

Without some sort of weakly ordered atomics, fences really have no effect
for correct code. I think if we were to introduce them now, we really
couldn't say what they mean.

I looked at the musl code quickly. It's not correct by C11 rules. It
seems to be written in roughly a Linux kernel style, using volatile instead
of atomic, and including explicit fences. This isn't technically correct,
since it's officially full of data races. But it would probably really
fail only on an architecture with relaxed alignment, which is exceedingly
rare. It would need a bit of an overhaul to make it work with this model.
But I think the overhaul would be fairly straightforward. And doing so
would probably improve its performance and clarity; the ARM code seems to
include substantially more fences than it actually needs and the i386
implementaion of a_and_64 is, well, interesting.

On Thu, Sep 3, 2015 at 11:17 AM, Lars T Hansen [email protected]
wrote:

Right. The underlying problem is that fences may be useful when trying to
optimize some code. I don't think we know how useful yet - mostly it was
for implementing pthreads, and I haven't looked at that code at all. But
IIRC @juj https://github.com/juj said that the current structure of the
code (it's the musl library) did not lend itself well to using atomics.


Reply to this email directly or view it on GitHub
https://github.com/lars-t-hansen/ecmascript_sharedmem/issues/25#issuecomment-137532918
.

@PeteX
Copy link

PeteX commented Oct 19, 2015

I may be about to talk nonsense as a result of jumping into an ongoing discussion, but I was wondering if the barrier semantics of the atomic operations needed to be spelt out. As I understand it, on x86 an instruction like LOCK INC will provide an implicit barrier. This may not be the case on other architectures, for example http://developer.android.com/training/articles/smp.html says:

On ARM, atomics have no memory barrier semantics. While a series of atomic RMW [read-modify-write] operations on a single address will be observed in program order by other cores, there are no guarantees when it comes to the ordering of atomic and non-atomic operations.

There's also a question of whether stand-alone barriers are useful for lock-free data structures.

@lars-t-hansen
Copy link
Collaborator

@PeteX, the memory model implicitly specifies the fences in its use of sequentially consistent memory order for all atomics. If you want to think of it that way, a fence is required before and after every atomic operation.

(In implementation terms, a LOCK prefix is sufficient on x86 when you can use it; an x86 MFENCE is needed some places if you don't use LOCK; on ARM, a barrier (DMB) is needed before and after atomic operations.)

It's pretty clear that release/acquire atomics, relaxed atomics, and possibly explicit fences, will be useful for certain faster data structures. But I've opted not to include them in v1 of the spec, because they will seriously complicate it and (in my judgement) reduce the chances of the spec being adopted at all.

@PeteX
Copy link

PeteX commented Oct 20, 2015

@lars-t-hansen—I'm not sure which part of the spec your first paragraph refers to. 7.2.1 says that 'Atomic accesses ... are performed in a single total order ... consistent with each agent's program order.' That makes sense but doesn't there also need to be a discussion of data accesses, and an explicit statement that they must not be reordered in a way that moves them past an atomic access? Otherwise on ARM you could end up doing the fast (no system call) release of a futex before all the updates have been written back to the data object it is protecting. Another thread could then do a fast acquisition of the same futex, and read stale data from the object.

In a way I think it's a shame that the features in your last paragraph won't be in v1, because if they go in v2, people will have to handle browsers that support differing feature sets. However I'm not familiar with the politics and I doubt the potential efficiency gains are very large. I doubt, for example, that there is much difference between a barrier instruction and an atomic operation on a cache line that is already held exclusively.

(As an aside, 7.2.1 also seems to allow non-aligned atomic accesses. Are you sure these are possible on all architectures without using a separate mutex?)

@jfbastien
Copy link
Contributor

futex should also provide ordering. Could you point out where the wording doesn't guarantee it? That would be a bug in the spec :-)

I'm not sure I understand your point on cachelines. Each cacheline update also requires ordering.

Atomic accesses should require natural alignment, anything else would be a spec bug.

@lars-t-hansen
Copy link
Collaborator

@PeteX, the observability rules should constrain reordering appropriately, though I'm not sure if they do that yet. (The memory model is not complete.) In general it's my preference to keep that part of the spec conceptual - by using constraints - and not to start introducing a concept of fences.

Futexes are included in the rules for the happens-before relationship in the same manner as atomics so if the rule is right for atomics it should be right for futexes. Modulo muddy prose in the spec, of course.

No non-aligned accesses should be possible, by design of the atomics - you can only use them via typed arrays, which are always aligned. Can you quote the text you think is ambiguous? There is text there to deal with overlapping atomics, which we have to contend with because typed arrays can be aliased, but in the current design if two atomic accesses overlap but are not the same then one of them addresses a subrange of the the other (ie one is a word and the other is a halfword within that word).

@PeteX
Copy link

PeteX commented Oct 23, 2015

@jfbastien

futex should also provide ordering. Could you point out where the wording doesn't guarantee it?

You're asking me to prove a negative! I read through the spec draft and it wasn't obvious to me that ordering was guaranteed, but perhaps I just didn't read it in the right way. Which part of the spec is supposed to guarantee ordering of non-atomic operations relative to atomics?

If futexes are going to be implemented the same way as on Linux, there will be a fast and a slow path through the logic. If you want to acquire a futex and you don't have to wait, you take the fast path. If you want to free a futex and you don't have to wake anyone, you take the fast path. These fast paths don't involve the futex system call, so in these cases, your data ordering depends exclusively on the semantics of the atomic operations. I agree it would be a strange implementation of futex if data accesses could be reordered in such a way that they moved past the futex system call—but that isn't the only problem. The fast/non-contended path through the logic also has to constrain reordering.

I'm not sure I understand your point on cachelines. Each cacheline update also requires ordering.

My point was just about efficiency. A fence is generally cheaper than an atomic operation, because it can be carried out entirely on one CPU core. When you do an atomic operation, by contrast, the cache coherency system has to acquire the relevant cache line in exclusive mode.

Now imagine you're implementing a lock-free data structure, and you don't have access to explicit fences. All you have is atomic operations which provide a fence as a side-effect. In this case, if you're careful, you can reduce the cost by always doing the atomic operation on the same cache line. If no one else tries to use that memory location, it should remain in cache, in exclusive mode. The cost of the atomic operation is then much lower because there is no interaction with other cores. Ideally the cost wouldn't be much higher than an explicit fence.

The upshot is that, while lock-free data structures might benefit from explicit fences, I think the benefit may not be particularly high. Because of this, if you want to leave explicit fences out of the spec for political reasons, I don't think that's a huge problem. (I suppose it might be worth doing some tests and figuring out what the cost actually is.)

@lars-t-hansen — If the memory model is not yet complete, perhaps I'm commenting too soon! With the current wording, though, I couldn't see how the happens-before relationship works with normal data accesses.

I now understand the point about overlapping atomic accesses; thanks for clearing that up. I think the spec is good on that point. (I was imagining an overlapping access where both were operating on a whole word, and at least one access was not word-aligned.)

@jfbastien
Copy link
Contributor

You're asking me to prove a negative! I read through the spec draft and it wasn't obvious to me that ordering was guaranteed, but perhaps I just didn't read it in the right way. Which part of the spec is supposed to guarantee ordering of non-atomic operations relative to atomics?

I see your concern now. I thought you disagreed with the wording, but you're pointing out that it's missing entirely :-)

@lars-t-hansen: to address this I suggest re-visiting synchronic (see #12). Specifically, the API was changed significantly (I believe that was your original concern) and it has solid wording. It was discussed at the C++ standards committee meeting's concurrency and parallelism group today. We are moving the paper to the library evolution group, and I think it has a pretty solid chance of making it to C++17, or at worst a TS.

The new paper is http://wg21.link/P0126R0

If futexes are going to be implemented the same way as on Linux, there will be a fast and a slow path through the logic. If you want to acquire a futex and you don't have to wait, you take the fast path. If you want to free a futex and you don't have to wake anyone, you take the fast path. These fast paths don't involve the futex system call, so in these cases, your data ordering depends exclusively on the semantics of the atomic operations. I agree it would be a strange implementation of futex if data accesses could be reordered in such a way that they moved past the futex system call—but that isn't the only problem. The fast/non-contended path through the logic also has to constrain reordering.

That's handled by the wording in synchronic.

My point was just about efficiency. A fence is generally cheaper than an atomic operation, because it can be carried out entirely on one CPU core. When you do an atomic operation, by contrast, the cache coherency system has to acquire the relevant cache line in exclusive mode.

That's highly dependent on which instruction is performed, and which architecture we're executing on.

Now imagine you're implementing a lock-free data structure, and you don't have access to explicit fences. All you have is atomic operations which provide a fence as a side-effect. In this case, if you're careful, you can reduce the cost by always doing the atomic operation on the same cache line. If no one else tries to use that memory location, it should remain in cache, in exclusive mode. The cost of the atomic operation is then much lower because there is no interaction with other cores. Ideally the cost wouldn't be much higher than an explicit fence.

I believe you want this: http://wg21.link/N4522
Coupled with this: http://wg21.link/N4523

Or their descendants (we're still tuning the wording). These may make it to C++17, and could also be in this specification but I think we should tackle a few other things first (acquire/release, relaxed, regular fences).

The upshot is that, while lock-free data structures might benefit from explicit fences, I think the benefit may not be particularly high. Because of this, if you want to leave explicit fences out of the spec for political reasons, I don't think that's a huge problem. (I suppose it might be worth doing some tests and figuring out what the cost actually is.)

We're not leaving fences out for political reasons: we're leaving them out because we're only considering seq_cst ordering to start. It makes it easier to present the paper, start implementation work, find bugs in the design (e.g. #16), and get model-checking tools going. That allows us to have a minimally useful product which can be used to make a pretty solid case for adoption. That doesn't mean that other features won't come soon after (perhaps even before standardization).

@lars-t-hansen — If the memory model is not yet complete, perhaps I'm commenting too soon! With the current wording, though, I couldn't see how the happens-before relationship works with normal data accesses.

Now is the perfect time to comment :-)

@PeteX
Copy link

PeteX commented Oct 23, 2015

Sounds as though we're on the same page now—excellent! This is an exciting technology and I'm looking forward to seeing it become available in browsers.

@taisel
Copy link

taisel commented Jan 2, 2016

Definitely already using Atomics.mfence with a feature check to use it if available in some code (with fallback to Atomics.load to do the fencing). The case being a single read single writer FIFO ring buffer in use.

@jfbastien
Copy link
Contributor

@taisel I'm not sure I understand your usecase: you shouldn't need fencing with the current load/store's seq_cst semantics. Are you trying to use regular load/store as relaxed atomics (hence the need for a fence)?

@taisel
Copy link

taisel commented Jan 2, 2016

yes. It's a case where outdated reads aren't a problem, but the buffer pointed by the possibly outdated reads must be correct up to what the possibly outdated offsets read out to.

It's pretty much the described implementation of a lock-free queue on wikipedia:

https://en.wikipedia.org/wiki/Non-blocking_algorithm

"a single-reader single-writer ring buffer FIFO, with a size which evenly divides the overflow of one of the available unsigned integer types, can unconditionally be implemented safely using only a memory barrier"

@taisel
Copy link

taisel commented Jan 2, 2016

The usage of which I'm employing it for is audio/video streaming to main thread/UI from a worker.

@jfbastien
Copy link
Contributor

@taisel if you're exchanging more than one lock-free value then you need two barriers for this to work (one on the reader side, one on the writer side), and depending on what you're doing you probably just want acquire / release accesses (or the mythical consume). That Wikipedia article doesn't look like it's quite ready for prime-time.

@taisel
Copy link

taisel commented Jan 2, 2016

I'm sending data only in one direction. The staleness of the indices into the ring buffer don't matter, as long as they eventually get updated by the system. I only need to make sure the data within the buffer is correct on weakly ordererd memory systems. The indices can be out of sync in when weakly ordered systems update them to the other worker, it just means less of the buffer is either free or readable than in reality to reader/writer. I just don't see a reason for the fencing on the writer side. That reader fence shouldn't even be necessary on x86, since the "staleness" viewed is in order (the store before load exception with x86 reordering doesn't apply).

@lars-t-hansen
Copy link
Collaborator

Closing this because:

  • our fences are gone and the world did not end
  • the issue discussed above is tracked by Formal memory model #88, now, but I will open a separate issue to track it directly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants