Load elimination via single thread valid only optimization #56

taisel · 2016-01-30T17:59:45Z

https://github.com/taisel/IodineGBA/blob/0af5eaaaf08cc3a1a024974c354e2ae5554c1f3f/IodineGBA/core/graphics/Worker.js#L98

Already got bitten by load elimination, rather than addition. Had to write out a specific comment mentioning it on that line, as I originally had the load as relaxed, since only one thread ever writes into it. There are other various futex/atomic calls that should flush out the producer thread into posting an updated value of the load that got optimized away, it's just IonMonkey removed it thinking it would never change. I wish we had lower level Atomics to avoid this kind of thing, as I try to optimize around blanket full fences.

Basically if the statement was:
gfxLinesCPU = gfxLineCounter[0]

instead of:
gfxLinesCPU =Atomics.load(gfxLineCounter, 0)

IonMonkey never reloads the value, because it doesn't consider another thread as existing in its optimization context.

The reason I find this in the gray area is, because in a sort of manner, deceptive to have a programmer explicitly use a shared array buffer to back a variable and have the optimizer not take that into account. I'd test v8 for Google Chrome as well, but code error via https://code.google.com/p/chromium/issues/detail?id=31666 (Edit: no longer true, there's a fallback now that hoists code from a worker to main and the subworker to the next worker up)

lars-t-hansen · 2016-02-02T12:11:51Z

So v1 of the spec will almost certainly not have relaxed atomics, you have to use the full atomics if you want to avoid races (as in your spinloop).

Obviously on some systems there will be some performance cost to using full atomics, but it would be nice to measure what that cost is. After all, the cost of the write might not matter much, only the cost of the atomics in the read loop. If the performance overhead of using Atomics is low then it doesn't matter (in this case) if you have to use them. On x86, an Atomics.load is just a normal load instruction, so there really is no additional cost. On ARM, there's the additional cost of a DMB before and after the load, though in this case we should be able to hoist those out of the loop (IonMonkey does not yet do that).

More generally, the rule that data races cause garbage to be read and written is not likely to change. There are various reasons for that. Current hardware is single-copy atomic up to pointer size for integer data (that is, two writes that race end up writing one datum or the other, not a mixture of them), but is not single-copy atomic for floating point data, for example (ARMv8 actually made the rule even looser than ARMv7 had it). Reads and writes can hit the memory system out of order as seen from the point of view of the CPU that issued them, leading to a situation that is very similar to "garbage" being observed if there is a race. And for programs that actually use shared memory for its memory it is more important to run quickly in the absence of races, and thus to allow aggressive optimization, than to run slowly always because there might be a race.

You might be able to make an argument here about which loads may and may not be eliminated along specific control flow paths, eg, if I have code like this I could get away with loading the value once:

let x = ta[n] + ta[n]

but if I have code like this I would have to load at least once through the body:

while (true) {
  if (ta[n]) break;
}

taisel · 2016-02-02T14:11:16Z

So we're going to let optimizations valid only to single thread contexts apply to explicitly allocated sharedarraybuffers? I'd understand if it was a typed array that got shared, but shared typed arrays are specifically for the time being requested for. This is why the issue is brought up.

taisel · 2016-02-02T14:15:30Z

Also this issue brought up is slightly different from a prior issue I brought up, in that there's already a futex and a bunch of surrounding Atomics that cause fencing, so the issue brought forth is about the optimization passes rather than memory model.

jfbastien · 2016-02-02T14:17:20Z

Not everything in a SAB is shared, only a subset is. Forbidding valid optimizations when the user has all the mechanics needed to do the right thing is undesirable, though as Lars says we should quantify the impact. At the end of the day it sounds like you want relaxed accesses, I don't think patching over the lack of relaxed is a good approach.

taisel · 2016-02-02T14:20:02Z

If the optimization passes that cause this issue are to stay, then the way the shared backing store is allocated is my concern. Using "sharedarraybuffer" in the language would make a programmer reason about that the optimizer knows about such a structure residing within itself.

lars-t-hansen · 2016-02-02T14:51:27Z

@taisel, of course the compiler can know that there is such a thing as a SharedArrayBuffer, and obviously we should discuss what optimizations will be allowed in the presence of shared memory, as in Issue #51 and here.

But it is incorrect, as you write, that these optimizations are "valid" only in unshared memory.

First, we decide whether they are going to be valid or not in the memory model that we define, it's not like this is a truth handed down from up on high. For example, at the TC39 meeting last week the committee decided that it did not want the "quantum garbage" resulting from the rematerialization optimization in the presence of races (Issue #51), so now there's work going on to see if that can be avoided (it almost certainly can, at a modest cost).

Second, hoisting is a valid optimization in all race-free programs. It's undesirable to force all programs that run in shared memory to run at the speed of a racy program whose races we are trying to "fix" (this is JF's point).

And finally, unless you're willing to insert fences everywhere in the machine code it's unlikely that even a compiler that performs no optimization will be able to generate code that runs "as you intended it" on all hardware in the presence of data races, because the hardware reorders and interleaves operations (this is why hardware is not sequentially consistent).

If you're going to argue that an optimization should be illegal then please argue what you think the meaning of specific program fragments should be, and then we can at least discuss whether there's a rule that's plausible and see what the impact of it would be. At the moment I think all you're arguing is that the optimization breaks the user's conception of how the program should behave, without taking into account that the program has data races and therefore unpredictable behavior.

taisel · 2016-02-02T14:55:21Z

The thing in particular is the load in question here sits between memory fences. The value isn't being updated at all unless it is specifically loaded via atomic load as well. this is implying that the optimizer doesn't care about underlying memory model and is optimizing to what instruction patterns are visible to a single worker. This transformation can break a bunch of sharedarraybuffer access that resides between Atomics that rely on surrounding Atomics existing for fencing. My older issue that was resolved was forward progress without fencing. This case here has timely fencing via Atomics on other addressed variables and the optimizer is looking past them.

taisel · 2016-02-02T15:03:25Z

Sorry I suck at conveying the proper issue experienced.

taisel · 2016-02-02T15:31:27Z

Jfbastien: I do agree single thread context validity only optimizations should still be applied, but only if the programmer doesn't opt out, unlike what I believe I have done here.

Edit: and which working group should I inquiry for a requestAnimationFrame inside webworkers? That spin loop mentioned earlier uses a futexwait+futexwake combo to hitch a ride off rAF timing from the main thread. futexWake is on the main thread and the futexWait sleeps until triggered by it

taisel · 2016-02-02T16:10:13Z

I'm sorry if my wording sounds offensive or attacking. I understand this is an experimental API still under review with browser support still behind either a nightly build or runtime flags. I want to make sure this API is polished by the time regular web devs view it. Keep up the good work.

lars-t-hansen · 2016-02-03T16:06:54Z

@taisel, I can't imagine anyone is offended by what you're writing. We're just trying to work out what the most appropriate semantics are.

lukewagner · 2016-02-03T16:46:05Z

Edit: and which working group should I inquiry for a requestAnimationFrame inside webworkers?

I'm not exactly sure, but see bug 1203382 for more details.

taisel · 2016-02-03T18:09:27Z

@lukewagner does this mean for the time being the closest way is to use that god awful futexwake+futexwait trick for vsync?

Also IIRC the moz dev wiki has a typo somewhere saying rAF exists for workers at the bottom of https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Using_web_workers

taisel · 2016-02-03T18:21:37Z

I'm sorry for re-iterating the question constantly, hopefully this time it sounds a bit better phrased:

@lars-t-hansen Is or is it not allowed to do atomic operations on adjacent shm variables to fence and quantify a timeframe* for updates from another thread that also does atomic operations and fencing on adjacent variables, to eventually update a shm variable that doesn't have atomics directly applied on itself? The current IonMonkey doesn't oracle into recognizing the adjacent fencing as de-opting the load elimination on the not-directly synchronized array.

The big question is if this by design and not a lack of optimizer awareness, are we proceeding with the penalty of forcing all cross thread synchronization to use atomics for each individual load/store, rather than a minimal amount surrounding a block you'd utilize to synchronize on a block level rather than individual level.

*timeframe meaning relative timing that the atomic operations provide, which can still be infinite delays.

lukewagner · 2016-02-03T19:55:36Z

@taisel I assume so. OffscreenCanvas has not fully shipped on all platforms so hopefully we'll have rAF-in-workers by the time it would be fully released since obviously the two go hand in hand.

taisel · 2016-02-03T19:58:51Z

@lukewagner Now I'm afraid of what the fallback mechanisms will look like in the near future, if vendors implement rAF in workers at different times.

16 ms timer
Use the ugly futex hack if sharedmem support exists
postMessage from main thread.

taisel · 2016-02-03T20:06:20Z

Not sure if discouraging the futex trick for rAF should be proposed, maybe it should... That should be a question I guess in another issue for shared mem...

"allowance of signals from main thread to be passed via futexWake"

lukewagner · 2016-02-04T02:13:57Z

(Sorry, don't mean to derail this issue.) @taisel I think OffscreenCanvas is a long way away from being in most production browsers on most platforms and it seems quite likely each of those browsers would understand that rAF-in-workers is necessary. I pinged roc and 1203382 is planned, just blocking on a dependent bug (1184283) which is in progress. So I'd just assume rAF-in-workers for production use.

lars-t-hansen · 2016-02-04T11:07:08Z

@taisel, I have no idea what question you are asking. I don't know what the "ugly futex trick" is. I don't know what "adjacent" variables has to do with anything. And I sure don't know what "The current IonMonkey doesn't oracle into recognizing the adjacent fencing as de-opting the load elimination on the not-directly synchronized array" is supposed to mean.

Generally though:

The memory model of the spec defines a happens-before relationship that guarantees that when a thread R atomically reads a variable V that was atomically written by some other thread W, then all other writes performed by W before it wrote to V - whether those writes were atomic or not - are also seen by the thread R. Most of your reads and writes can be non-atomic, it's just the synchronization that must be atomic, and this is all well-defined.

In addition, futexWait acts as a "read" and futexWake acts as a "write", so anything written by a thread before it calls futexWake becomes visible to a thread that was waiting in futexWait and was woken by that futexWake.

taisel · 2016-02-04T12:02:51Z

In addition, futexWait acts as a "read" and futexWake acts as a "write", so anything written by a thread before it calls futexWake becomes visible to a thread that was waiting in futexWait and was woken by that futexWake.

And that's what's not happening when it should here with IonMonkey in this extreme optimizer edge case. I was presuming this was by design, which was driving me slightly insane.

lars-t-hansen · 2016-02-04T12:20:06Z

In addition, futexWait acts as a "read" and futexWake acts as a "write", so anything written by a thread before it calls futexWake becomes visible to a thread that was waiting in futexWait and was woken by that futexWake.

And that's what's not happening when it should here with IonMonkey in this extreme optimizer edge case. I was presuming this was by design, which was driving me slightly insane.

That's extremely surprising given the structure of the implementation and I don't know why that would be, but I can investigate further. Do you have a smallish test case for this? (I'm [email protected] if you want to email something or create a bugzilla bug and cc me on it.)

Note it's important for the thread that's waiting to actually go into a wait and be woken up for the causality to exist, if futexWait returns with Atomics.NOTEQUAL or Atomics.TIMEDOUT there's no causality. (In the case of Atomics.NOTEQUAL there might be a causal write->read relationship though, and that may be sufficient. I think the spec is missing prose about that, I'll file a bug.)

taisel · 2016-02-04T12:43:55Z

I have no minimal test case yet, but I was trying to figure out the pattern the optimizer was seeing in earlier comments above.

cosinusoidally · 2016-02-05T16:32:46Z

If I understand rightly you are initialising the SharedArrayBuffers here:

https://github.com/taisel/IodineGBA/blob/0af5eaaaf08cc3a1a024974c354e2ae5554c1f3f/IodineGBA/core/graphics/RendererShim.js#L39

this.gfxCommandCounters = getSharedInt32Array(3);
this.gfxLineCounter = getSharedInt32Array(1);

getSharedInt32Array defined here:

https://github.com/taisel/IodineGBA/blob/0af5eaaaf08cc3a1a024974c354e2ae5554c1f3f/IodineGBA/includes/TypedArrayShim.js#L99

function getSharedInt32Array(size_t) {
    try {
        //Compatibility for older Firefox Nightlies:
        return new SharedInt32Array(size_t);
    }
    catch (error) {
        return new Int32Array(new SharedArrayBuffer(size_t << 2));
    }
}

So we have 2 SharedArrayBuffers (one for gfxCommandCounters and one for gfxLineCounter )

As far as I can tell you atomically set values inside gfxCommandCounters and gfxLineCounter here:

https://github.com/taisel/IodineGBA/blob/0af5eaaaf08cc3a1a024974c354e2ae5554c1f3f/IodineGBA/core/graphics/RendererShim.js#L133

Atomics.store(this.gfxCommandCounters, 1, this.end | 0);
Atomics.store(this.gfxLineCounter, 0, this.linesPassed | 0);

and you atomically load values from each here:

https://github.com/taisel/IodineGBA/blob/0af5eaaaf08cc3a1a024974c354e2ae5554c1f3f/IodineGBA/core/graphics/Worker.js#L94

 var end = Atomics.load(gfxCommandCounters, 1) | 0;  //Written by the other thread.
 gfxLinesCPU = Atomics.load(gfxLineCounter, 0) | 0;  //Keep atomic; IonMonkey thinks this is dead code if it's removed.

Since gfxLineCounter and gfxCommandCounters are different SharedArrayBuffers, surely it would be a race condition if you replaced gfxLinesCPU = Atomics.load(gfxLineCounter, 0) | 0 with gfxLinesCPU = gfxLineCounter[0]|0; (and therefore Ionmonkey could chose to eliminate the load)? Wouldn't it essentially be equivalent to:

Thread A:

Atomics.store(this.gfxLineCounter, 0, this.linesPassed | 0);

Thread B:

gfxLinesCPU = gfxLineCounter[0]|0;

regardless of whatever is going on with gfxCommandCounters?

taisel · 2016-02-05T18:46:51Z

The race (if using the non-atomic load instead of atomic) is supposed to be bounded (by design), such that it's allowed to go out of synchronization with other variables, but gets synchronized past a certain code block, see the surrounding futexWait and atomic loads and stores that should memory fence. The lack of updates spans across multiple futexes and atomic loads and stores of various code blocks, and lags enough behind (well it never gets updated) the gfxCommandCounters variable to cause funky line skip logic.

taisel · 2016-02-05T18:48:48Z

The issue I see is that the optimizer doesn't see the bounding of the race, if present.

taisel · 2016-02-05T23:31:45Z

Interesting, I can't reproduce the original optimizer bug anymore in the latest nightly.
Lars, I'm wondering if this was an optimizer issue not relevant to sharedmem, but to IonMonkey.

Edit: And looking at the original racey code again, the action taken by the fault should have been impossible even with load elimination, unless of course something like MAX_INT was supplanted as a default value by the optimizer, or the load was broken into multiple loads.

taisel · 2016-02-05T23:42:35Z

Closing for now, unless I can re-verify or find spookiness again.

lars-t-hansen · 2016-02-08T10:10:53Z

Hrm. Spooky. I'll ask around.

taisel mentioned this issue Jan 30, 2016

No quantum garbage! #51

Closed

taisel changed the title ~~Load elimination via single threaded optimization~~ Load elimination via single thread valid only optimization Feb 1, 2016

lars-t-hansen mentioned this issue Feb 4, 2016

Write->read causality with futexWait returning NOTEQUAL #58

Closed

taisel closed this as completed Feb 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load elimination via single thread valid only optimization #56

Load elimination via single thread valid only optimization #56

taisel commented Jan 30, 2016

lars-t-hansen commented Feb 2, 2016

taisel commented Feb 2, 2016

taisel commented Feb 2, 2016

jfbastien commented Feb 2, 2016

taisel commented Feb 2, 2016

lars-t-hansen commented Feb 2, 2016

taisel commented Feb 2, 2016

taisel commented Feb 2, 2016

taisel commented Feb 2, 2016

taisel commented Feb 2, 2016

lars-t-hansen commented Feb 3, 2016

lukewagner commented Feb 3, 2016

taisel commented Feb 3, 2016

taisel commented Feb 3, 2016

lukewagner commented Feb 3, 2016

taisel commented Feb 3, 2016

taisel commented Feb 3, 2016

lukewagner commented Feb 4, 2016

lars-t-hansen commented Feb 4, 2016

taisel commented Feb 4, 2016

lars-t-hansen commented Feb 4, 2016

taisel commented Feb 4, 2016

cosinusoidally commented Feb 5, 2016

taisel commented Feb 5, 2016

taisel commented Feb 5, 2016

taisel commented Feb 5, 2016

taisel commented Feb 5, 2016

lars-t-hansen commented Feb 8, 2016

Load elimination via single thread valid only optimization #56

Load elimination via single thread valid only optimization #56

Comments

taisel commented Jan 30, 2016

lars-t-hansen commented Feb 2, 2016

taisel commented Feb 2, 2016

taisel commented Feb 2, 2016

jfbastien commented Feb 2, 2016

taisel commented Feb 2, 2016

lars-t-hansen commented Feb 2, 2016

taisel commented Feb 2, 2016

taisel commented Feb 2, 2016

taisel commented Feb 2, 2016

taisel commented Feb 2, 2016

lars-t-hansen commented Feb 3, 2016

lukewagner commented Feb 3, 2016

taisel commented Feb 3, 2016

taisel commented Feb 3, 2016

lukewagner commented Feb 3, 2016

taisel commented Feb 3, 2016

taisel commented Feb 3, 2016

lukewagner commented Feb 4, 2016

lars-t-hansen commented Feb 4, 2016

taisel commented Feb 4, 2016

lars-t-hansen commented Feb 4, 2016

taisel commented Feb 4, 2016

cosinusoidally commented Feb 5, 2016

taisel commented Feb 5, 2016

taisel commented Feb 5, 2016

taisel commented Feb 5, 2016

taisel commented Feb 5, 2016

lars-t-hansen commented Feb 8, 2016