-
Notifications
You must be signed in to change notification settings - Fork 32
Design: micro-wait primitives for efficient locks #87
Comments
One option is to take the stateless part of the library, just the spin wait part. Make it clear it's not a viable long-term way to wait for something -- maybe call it try_wait( ... ) or similar. |
Good, that corresponds with my intuition (but I didn't trust it). |
Here's an idea. I just thought of it, so I don't yet know what its performance would be. The spec doesn't export any spinning helpers, but it does provide non-normative encouragement for the user to never spin and instead to just call futexWait on the slow path, sort of like glibc's lowlevellock does. futexWait is already allowed to return prematurely (NOTEQUAL and TIMEDOUT). We could get rid of those two return codes and simply return false from futexWait if it is returning due to something other than some thread doing futexWake. When false is returned the client has no choice but to loop around again. Then implementations do as follows. Each thread gets an internal thread-local variable called spinCount. This will be initialized to whatever the implementation thinks is the optimal number of times that you should spin. For example, WebKit would pick 40. Every call to futexWait will inspect the spinCount. If it is greater than zero, futexWait will do whatever spin magic the implementation believes in (so WebKit will call sched_yield) and then it will decrement the spinCount and return false. If it is zero, futexWait will reset it to 40 (or whatever the implementation's optimal spin setting is) and then it will actually do the futex wait operation, which means that it may return false (on timeout or notequal) or true (some futexWake happened). If you sort of imagine how this will work, you will see that this causes many locking algorithms that do no spinning of their own to become like WebKit's adaptive 40-spins-with-sched_yield lock, except for an occasional hiccup where a lock will do the real futexWait even though it would have been better to spin. I think that this will not impact performance much because for real-world programs that use locks, there is a huge plateau of different spinning strategies that are roughly optimal. Some locking algorithms will become suboptimal adaptive locks because they think that futexWait has a high probability of returning true, but even such algorithms will still perform a bit better this way than if they had no spinning. I looked at the following futex-based algorithms and all of them would magically turn into adaptive locks if futexWait had this behavior:
|
I see the appeal of that API, the spinning is an implementation detail and does not show up in the spec at all. The API is a little poorer than the current one because it loses the information in the return value from futexWait, adding complexity to user code that wants to use timeouts notably. Not sure how much that matters, since timeouts must be recomputed if we're looping around futexWait anyway. On the third hand, futexWait is currently specified so that such looping is generally not necessary (no spurious wakeups allowed), so the need to spin adds complexity there. Unavoidable though, if we're going to spin. I'm probably more inclined to decouple the spinning from futexWait and instead introduce something like an Atomics.pause() API that takes an argument that is an iteration value (>= 0), this method would do what it needs to do for the platform appropriately for that value - relax/yield/whatever. (There's some hand-waving here about the program deciding that at a high enough speed, but maybe, as you say, always yielding is about good enough.) Suppose the method returns true if it can go again for a higher count and false otherwise. Using this in combination with the futexWait we have, programs can trade complexity (recomputing timeouts, managing the spin count, etc) for performance - just using futexWait without spinning is easy, but potentially slower; but existing code like the libc we're using for emscripten is already managing its own spin loops and will benefit from the decoupling I expect. (It might also make sense to expose Atomics.pause to asm.js, and call-out to futexWait only when we can't spin more.) |
@lars-t-hansen That mostly makes sense to me. Can you be clear on what you think pause() will do with the iteration value? I'd like to understand if we can get away with a pause() implementation that doesn't take any arguments. I'd also like to know what data anyone has on the profitability of different implementations of pause(). I've heard you and others make the claim that there is a portability issue with how you spin. I'm aware of this, but I'm optimistic that we can find something simpler than requiring the client to pass the iteration value. Here's what I know:
So, the highest performance implementation of pause on POSIX platforms will be sched_yield and the highest performance implementation on Win32 will be the x86 pause instruction. This gets pretty weird, since if you do this then you have to communicate to the client that they must spin a different amount of times on different platforms. I think we should avoid such a situation because it means that it will be too easy to write code that is not perf-portable. This is why I was looking for an easy fix that doesn't require exposing too much to the client. Another option is to say that Atomics.pause takes no arguments and must do the moral equivalent of sched_yield, which on Win32 would mean busy-waiting about 100 times while calling 'pause' - the reason why sched_yield is so good is that it's a hint to run another thread that the kernel is free to ignore, and in cases where there aren't other threads to run, it just introduces a tiny delay in execution. That delay is smaller than what you'd get from any of the sleep syscalls but larger than if you just used the x86 'pause' instruction. Win32 has very expensive yield/sleep operations (like SwitchToThread), so the best you can do is a tiny busy-wait loop to emulate the small delay of sched_yield. To summarize, I'm fine with Atomics.pause() working the way you propose, but I'd like for us to have a good justification for it. It should be more than just that someone ran a microbenchmark. While I appreciate that @ogiroux does crazy spinning in synchronic, that's just not how most production locks work. They either do no spinning (Linux) or they have a dead-simple spin loop (RVM, WebKit, etc) that doesn't try to do anything super smart. |
@pizlonator, re the system-specific behavior I've been leaning on @ogiroux's testimony since he has more experience than me. What you're reporting backs it up, to some extent, but I understand your desire to hide it as much as possible / look for platform-independent solutions. My intent was for Atomics.pause() to use the iteration value in some explicitly system-specific way to determine what to do and to return false if calling pause() again with a higher value would not do anything useful. The contract would be simple and would ideally require pause() to have no state. On Darwin one would imagine that for an iteration value i < 10 (say), pause() would call sched_yield and return true; for i == 10 it would call sched_yield and return false; and for i > 10 it would just return false. The idea is that the only "communication" you need to give to the client is that it should increment the loop counter and check the return value; it doesn't need to be told what the loop limit is, it only needs to start at 0. I agree that Atomics.pause() should have good justification behind it before we introduce it. At this point it might be possible to introduce it in Firefox, do a review on our Futex implementation to check that it is as good as it should be (it isn't), and then try to write some programs with the API to see if we get some benefit out of it and if the ergonomics are adequate. |
I'm happy with this. I'd like to see data on what the perf impact is over the obvious alternatives (not spinning at all or spinning without calling pause()), because that would make for useful informative text. -Filip
|
Any reason for discrete time values over abstract hint constants if it were to take an argument? |
What abstract unit of time would you propose to use? Usually the challenge in a spin loop is that you want to sleep for less time than the amount of time that it takes to check the current time. -Filip
|
I'm afraid to even specify that. I'm personally against giving "time" to the pauses, in order to make the pauses only defined in helping circumvent wake/wait syscall latencies in usage. If people want time then add a function with the name sleep in it. |
@taisel, the typical usage pattern here would be that the thread wants to sleep until it receives a signal, but normally that signal is also accompanied by a change in a shared cell (which futexWait checks so that it does not go to sleep if the signal has already been sent). The spinloop before the futexWait is thus used to check the value change very often for a time if the signal can be expected "soon". (If the signal is expected to be slow there's no reason to spin.) A "sleep" primitive does not serve that purpose, since we don't know how long to sleep for. (I guess you chould have three values MICRO, MINI, MAXI but then you're back to handling your own backoffs and that defeats the purpose to a fair extent I would say.) The iteration count is so abstract that the user code does not have to worry. And again, there's no requirement to use pause(). What is now just a call to futexWait(a, n, v) would instead be: for (let i=0;;i++) {
if (Atomics.load(a,n) != v) return;
if (!Atomics.pause(i)) break;
}
Atomics.futexWait(a,n,v); Of course, this would have to be significantly better performing (and generally no worse) in a range of scenarios than raw futexWait() to be worth the bother. |
I'm starting to really like this primitive. This example is very convincing. -Filip
|
@lars-t-hansen Edit: Merging the comments from above... below:
The desires listed above might be better suited for a separate function call rather than Atomics.pause, to reduce the overhead of an Atomics.pause call itself. |
Actually, there is nothing wrong with using futexWait for this purpose. -Filip
|
@pizlonator It feels awkward to specify a memory location when the use case doesn't call for a conditional. |
I just remembered window.setImmediate exists, so I'm wondering if Atomics.pause exists on the main thread, how will the two differ in implementation? I understand Atomics.pause is blocking based, while setImmediate returns to the message pump, but I'm wondering if there's fine print throttling that would be different between the two. After applying main thread restrictions to Atomics.pause, of course. |
Atomics.pause would be synchronous, setImmediate is asynchronous.
I actually think it should not be thought of as "blocking", but as an efficient spin-wait.
And I also don't think the main-thread restriction on Atomics.futexWait should apply to Atomics.pause. One of the intriguing prospects about pause is that it can be used to implement efficient optimistic waiting on the main thread without risking blocking the main thread due to a lost signal bug. I'm winging the syntax here, but consider: async function expectUpdate(ia, index, current) {
for (let i=0;;i++ ) {
if (Atomics.load(ia, index) != current) return;
if (!Atomics.pause(i)) break;
}
await slowExpectUpdate(ia, index, current);
} where slowExpectUpdate probably needs some additional working storage for coordinating with whoever will signal it, but basically it registers its interest in a change of the value away from "current" and turns into a promise that will be resolved when that change is observed, I'm virtually certain that can be implemented on top of regular message events (since I have code that does that now, without the promises or async/await). This may or may not be an improvement over other alternatives and we'll have to experiment, but in situations with high message frequency it could make communicating with the main thread more affordable. |
POC for Firefox: https://bugzil.la/1260432. This does not inline Atomics.pause() and I'm not sure why it would, either, but it could. I adapted a test program I had lying around to use Atomics.pause() before futexWait and the performance improved adequately (more than 20x). But (a) my futex implementation is still not realistic enough, it has a single global lock, and (b) the test program is probably not representative. More data needed, for sure. The test program in question is the futex.html case here, see futex-worker.js. |
I understand that concurrency primitives can be tricky to understand and futexWait is no exception. I suggest reading: https://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf -Filip
|
@pizlonator In most cases I have utilization for the conditional in futexWait, and I understand futexWait has to get passed a shared array to manage the signals. My comment pertains to the rarer case where a general mutex non-conditional wait is all that's needed with no user-space check. Although passing a function with a shared array with no target indice seems awkward as well to use. I also completely understand reducing the API surface to keep things simple. I'm asking whether or not adding a slightly more specific wait is worth the extra function in the API, and if the overhead difference validates having such. |
You are wrong about mutex non-conditional wait. It will use the conditional in the futexWait. If you aren't using it then you aren't using futexes correctly. -Filip
|
I've weighted against using such in an extremely rare case where it's known the worker will always end up ending before the next signal, and it's allowed to skip one signal (by design). I think the misunderstanding is whether or not a signal skip is required or allowed for in the batching used. I went against setting/clearing additional variables to track whether the signal was received or not to avoid that minuscule overhead. I did have a hard time justifying the elision of the conditional, and almost everywhere else I do indeed use the conditional in futexWait, as I agree it's very necessary. |
That's not the common case. The common case is that workers that share an array buffer will want to have many mutexes or condvars. -Filip
|
@pizlonator I'm just golfing with suggestions here, in the hopes that the published API is polished out the door. I'm hoping that the overhead of the rarer cases is taken into account, and that power saving is promoted, as well as being developer friendly. Don't worry, I approve of the API as it is, I'm just trying to find corner cases or possible weirdness/overhead. |
I do want to be refuted. The comment golf is to verify we are on the right path. |
@pizlonator, @jfbastien, @ogiroux
With the demise of the synchronic proposal (Issue #12) we're back to futexes as the main blocking mechanism. From that comes some further issues we should discuss:
(Further examples of platform variations are here.)
For v1 we can perhaps get away with not providing more than futexes and the ability to spin in a regular loop before going into futexWait, ie, what we already have, but if we know we're going to need more, this would be a good time to have the discussion.
(A related issue is #64, an idea that it is good to allow micro-waiting on the main thread even if we don't allow blocking.)
The first order of business should be to determine whether such a mechanism is required and what its (minimal + nice-to-have) aspects are.
The text was updated successfully, but these errors were encountered: