Skip to content
This repository has been archived by the owner on Jan 25, 2022. It is now read-only.

Load elimination via single thread valid only optimization #56

Closed
taisel opened this issue Jan 30, 2016 · 28 comments
Closed

Load elimination via single thread valid only optimization #56

taisel opened this issue Jan 30, 2016 · 28 comments

Comments

@taisel
Copy link

taisel commented Jan 30, 2016

https://github.com/taisel/IodineGBA/blob/0af5eaaaf08cc3a1a024974c354e2ae5554c1f3f/IodineGBA/core/graphics/Worker.js#L98

Already got bitten by load elimination, rather than addition. Had to write out a specific comment mentioning it on that line, as I originally had the load as relaxed, since only one thread ever writes into it. There are other various futex/atomic calls that should flush out the producer thread into posting an updated value of the load that got optimized away, it's just IonMonkey removed it thinking it would never change. I wish we had lower level Atomics to avoid this kind of thing, as I try to optimize around blanket full fences.

Basically if the statement was:
gfxLinesCPU = gfxLineCounter[0]

instead of:
gfxLinesCPU =Atomics.load(gfxLineCounter, 0)

IonMonkey never reloads the value, because it doesn't consider another thread as existing in its optimization context.

The reason I find this in the gray area is, because in a sort of manner, deceptive to have a programmer explicitly use a shared array buffer to back a variable and have the optimizer not take that into account. I'd test v8 for Google Chrome as well, but code error via https://code.google.com/p/chromium/issues/detail?id=31666 (Edit: no longer true, there's a fallback now that hoists code from a worker to main and the subworker to the next worker up)

@taisel taisel changed the title Load elimination via single threaded optimization Load elimination via single thread valid only optimization Feb 1, 2016
@lars-t-hansen
Copy link
Collaborator

So v1 of the spec will almost certainly not have relaxed atomics, you have to use the full atomics if you want to avoid races (as in your spinloop).

Obviously on some systems there will be some performance cost to using full atomics, but it would be nice to measure what that cost is. After all, the cost of the write might not matter much, only the cost of the atomics in the read loop. If the performance overhead of using Atomics is low then it doesn't matter (in this case) if you have to use them. On x86, an Atomics.load is just a normal load instruction, so there really is no additional cost. On ARM, there's the additional cost of a DMB before and after the load, though in this case we should be able to hoist those out of the loop (IonMonkey does not yet do that).

More generally, the rule that data races cause garbage to be read and written is not likely to change. There are various reasons for that. Current hardware is single-copy atomic up to pointer size for integer data (that is, two writes that race end up writing one datum or the other, not a mixture of them), but is not single-copy atomic for floating point data, for example (ARMv8 actually made the rule even looser than ARMv7 had it). Reads and writes can hit the memory system out of order as seen from the point of view of the CPU that issued them, leading to a situation that is very similar to "garbage" being observed if there is a race. And for programs that actually use shared memory for its memory it is more important to run quickly in the absence of races, and thus to allow aggressive optimization, than to run slowly always because there might be a race.

You might be able to make an argument here about which loads may and may not be eliminated along specific control flow paths, eg, if I have code like this I could get away with loading the value once:

let x = ta[n] + ta[n]

but if I have code like this I would have to load at least once through the body:

while (true) {
  if (ta[n]) break;
}

@taisel
Copy link
Author

taisel commented Feb 2, 2016

So we're going to let optimizations valid only to single thread contexts apply to explicitly allocated sharedarraybuffers? I'd understand if it was a typed array that got shared, but shared typed arrays are specifically for the time being requested for. This is why the issue is brought up.

@taisel
Copy link
Author

taisel commented Feb 2, 2016

Also this issue brought up is slightly different from a prior issue I brought up, in that there's already a futex and a bunch of surrounding Atomics that cause fencing, so the issue brought forth is about the optimization passes rather than memory model.

@jfbastien
Copy link
Contributor

Not everything in a SAB is shared, only a subset is. Forbidding valid optimizations when the user has all the mechanics needed to do the right thing is undesirable, though as Lars says we should quantify the impact. At the end of the day it sounds like you want relaxed accesses, I don't think patching over the lack of relaxed is a good approach.

@taisel
Copy link
Author

taisel commented Feb 2, 2016

If the optimization passes that cause this issue are to stay, then the way the shared backing store is allocated is my concern. Using "sharedarraybuffer" in the language would make a programmer reason about that the optimizer knows about such a structure residing within itself.

@lars-t-hansen
Copy link
Collaborator

@taisel, of course the compiler can know that there is such a thing as a SharedArrayBuffer, and obviously we should discuss what optimizations will be allowed in the presence of shared memory, as in Issue #51 and here.

But it is incorrect, as you write, that these optimizations are "valid" only in unshared memory.

First, we decide whether they are going to be valid or not in the memory model that we define, it's not like this is a truth handed down from up on high. For example, at the TC39 meeting last week the committee decided that it did not want the "quantum garbage" resulting from the rematerialization optimization in the presence of races (Issue #51), so now there's work going on to see if that can be avoided (it almost certainly can, at a modest cost).

Second, hoisting is a valid optimization in all race-free programs. It's undesirable to force all programs that run in shared memory to run at the speed of a racy program whose races we are trying to "fix" (this is JF's point).

And finally, unless you're willing to insert fences everywhere in the machine code it's unlikely that even a compiler that performs no optimization will be able to generate code that runs "as you intended it" on all hardware in the presence of data races, because the hardware reorders and interleaves operations (this is why hardware is not sequentially consistent).

If you're going to argue that an optimization should be illegal then please argue what you think the meaning of specific program fragments should be, and then we can at least discuss whether there's a rule that's plausible and see what the impact of it would be. At the moment I think all you're arguing is that the optimization breaks the user's conception of how the program should behave, without taking into account that the program has data races and therefore unpredictable behavior.

@taisel
Copy link
Author

taisel commented Feb 2, 2016

The thing in particular is the load in question here sits between memory fences. The value isn't being updated at all unless it is specifically loaded via atomic load as well. this is implying that the optimizer doesn't care about underlying memory model and is optimizing to what instruction patterns are visible to a single worker. This transformation can break a bunch of sharedarraybuffer access that resides between Atomics that rely on surrounding Atomics existing for fencing. My older issue that was resolved was forward progress without fencing. This case here has timely fencing via Atomics on other addressed variables and the optimizer is looking past them.

@taisel
Copy link
Author

taisel commented Feb 2, 2016

Sorry I suck at conveying the proper issue experienced.

@taisel
Copy link
Author

taisel commented Feb 2, 2016

Jfbastien: I do agree single thread context validity only optimizations should still be applied, but only if the programmer doesn't opt out, unlike what I believe I have done here.

Edit: and which working group should I inquiry for a requestAnimationFrame inside webworkers? That spin loop mentioned earlier uses a futexwait+futexwake combo to hitch a ride off rAF timing from the main thread. futexWake is on the main thread and the futexWait sleeps until triggered by it

@taisel
Copy link
Author

taisel commented Feb 2, 2016

I'm sorry if my wording sounds offensive or attacking. I understand this is an experimental API still under review with browser support still behind either a nightly build or runtime flags. I want to make sure this API is polished by the time regular web devs view it. Keep up the good work.

@lars-t-hansen
Copy link
Collaborator

@taisel, I can't imagine anyone is offended by what you're writing. We're just trying to work out what the most appropriate semantics are.

@lukewagner
Copy link

Edit: and which working group should I inquiry for a requestAnimationFrame inside webworkers?

I'm not exactly sure, but see bug 1203382 for more details.

@taisel
Copy link
Author

taisel commented Feb 3, 2016

@lukewagner does this mean for the time being the closest way is to use that god awful futexwake+futexwait trick for vsync?

Also IIRC the moz dev wiki has a typo somewhere saying rAF exists for workers at the bottom of https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Using_web_workers

@taisel
Copy link
Author

taisel commented Feb 3, 2016

I'm sorry for re-iterating the question constantly, hopefully this time it sounds a bit better phrased:

@lars-t-hansen Is or is it not allowed to do atomic operations on adjacent shm variables to fence and quantify a timeframe* for updates from another thread that also does atomic operations and fencing on adjacent variables, to eventually update a shm variable that doesn't have atomics directly applied on itself? The current IonMonkey doesn't oracle into recognizing the adjacent fencing as de-opting the load elimination on the not-directly synchronized array.

The big question is if this by design and not a lack of optimizer awareness, are we proceeding with the penalty of forcing all cross thread synchronization to use atomics for each individual load/store, rather than a minimal amount surrounding a block you'd utilize to synchronize on a block level rather than individual level.

*timeframe meaning relative timing that the atomic operations provide, which can still be infinite delays.

@lukewagner
Copy link

@taisel I assume so. OffscreenCanvas has not fully shipped on all platforms so hopefully we'll have rAF-in-workers by the time it would be fully released since obviously the two go hand in hand.

@taisel
Copy link
Author

taisel commented Feb 3, 2016

@lukewagner Now I'm afraid of what the fallback mechanisms will look like in the near future, if vendors implement rAF in workers at different times.

  • 16 ms timer
  • Use the ugly futex hack if sharedmem support exists
  • postMessage from main thread.

@taisel
Copy link
Author

taisel commented Feb 3, 2016

Not sure if discouraging the futex trick for rAF should be proposed, maybe it should... That should be a question I guess in another issue for shared mem...

"allowance of signals from main thread to be passed via futexWake"

@lukewagner
Copy link

(Sorry, don't mean to derail this issue.) @taisel I think OffscreenCanvas is a long way away from being in most production browsers on most platforms and it seems quite likely each of those browsers would understand that rAF-in-workers is necessary. I pinged roc and 1203382 is planned, just blocking on a dependent bug (1184283) which is in progress. So I'd just assume rAF-in-workers for production use.

@lars-t-hansen
Copy link
Collaborator

@taisel, I have no idea what question you are asking. I don't know what the "ugly futex trick" is. I don't know what "adjacent" variables has to do with anything. And I sure don't know what "The current IonMonkey doesn't oracle into recognizing the adjacent fencing as de-opting the load elimination on the not-directly synchronized array" is supposed to mean.

Generally though:

The memory model of the spec defines a happens-before relationship that guarantees that when a thread R atomically reads a variable V that was atomically written by some other thread W, then all other writes performed by W before it wrote to V - whether those writes were atomic or not - are also seen by the thread R. Most of your reads and writes can be non-atomic, it's just the synchronization that must be atomic, and this is all well-defined.

In addition, futexWait acts as a "read" and futexWake acts as a "write", so anything written by a thread before it calls futexWake becomes visible to a thread that was waiting in futexWait and was woken by that futexWake.

@taisel
Copy link
Author

taisel commented Feb 4, 2016

In addition, futexWait acts as a "read" and futexWake acts as a "write", so anything written by a thread before it calls futexWake becomes visible to a thread that was waiting in futexWait and was woken by that futexWake.

And that's what's not happening when it should here with IonMonkey in this extreme optimizer edge case. I was presuming this was by design, which was driving me slightly insane.

@lars-t-hansen
Copy link
Collaborator

In addition, futexWait acts as a "read" and futexWake acts as a "write", so anything written by a thread before it calls futexWake becomes visible to a thread that was waiting in futexWait and was woken by that futexWake.

And that's what's not happening when it should here with IonMonkey in this extreme optimizer edge case. I was presuming this was by design, which was driving me slightly insane.

That's extremely surprising given the structure of the implementation and I don't know why that would be, but I can investigate further. Do you have a smallish test case for this? (I'm [email protected] if you want to email something or create a bugzilla bug and cc me on it.)

Note it's important for the thread that's waiting to actually go into a wait and be woken up for the causality to exist, if futexWait returns with Atomics.NOTEQUAL or Atomics.TIMEDOUT there's no causality. (In the case of Atomics.NOTEQUAL there might be a causal write->read relationship though, and that may be sufficient. I think the spec is missing prose about that, I'll file a bug.)

@taisel
Copy link
Author

taisel commented Feb 4, 2016

I have no minimal test case yet, but I was trying to figure out the pattern the optimizer was seeing in earlier comments above.

@cosinusoidally
Copy link

If I understand rightly you are initialising the SharedArrayBuffers here:

https://github.com/taisel/IodineGBA/blob/0af5eaaaf08cc3a1a024974c354e2ae5554c1f3f/IodineGBA/core/graphics/RendererShim.js#L39

this.gfxCommandCounters = getSharedInt32Array(3);
this.gfxLineCounter = getSharedInt32Array(1);

getSharedInt32Array defined here:

https://github.com/taisel/IodineGBA/blob/0af5eaaaf08cc3a1a024974c354e2ae5554c1f3f/IodineGBA/includes/TypedArrayShim.js#L99

function getSharedInt32Array(size_t) {
    try {
        //Compatibility for older Firefox Nightlies:
        return new SharedInt32Array(size_t);
    }
    catch (error) {
        return new Int32Array(new SharedArrayBuffer(size_t << 2));
    }
}

So we have 2 SharedArrayBuffers (one for gfxCommandCounters and one for gfxLineCounter )

As far as I can tell you atomically set values inside gfxCommandCounters and gfxLineCounter here:

https://github.com/taisel/IodineGBA/blob/0af5eaaaf08cc3a1a024974c354e2ae5554c1f3f/IodineGBA/core/graphics/RendererShim.js#L133

Atomics.store(this.gfxCommandCounters, 1, this.end | 0);
Atomics.store(this.gfxLineCounter, 0, this.linesPassed | 0);

and you atomically load values from each here:

https://github.com/taisel/IodineGBA/blob/0af5eaaaf08cc3a1a024974c354e2ae5554c1f3f/IodineGBA/core/graphics/Worker.js#L94

 var end = Atomics.load(gfxCommandCounters, 1) | 0;  //Written by the other thread.
 gfxLinesCPU = Atomics.load(gfxLineCounter, 0) | 0;  //Keep atomic; IonMonkey thinks this is dead code if it's removed.

Since gfxLineCounter and gfxCommandCounters are different SharedArrayBuffers, surely it would be a race condition if you replaced gfxLinesCPU = Atomics.load(gfxLineCounter, 0) | 0 with gfxLinesCPU = gfxLineCounter[0]|0; (and therefore Ionmonkey could chose to eliminate the load)? Wouldn't it essentially be equivalent to:

Thread A:

Atomics.store(this.gfxLineCounter, 0, this.linesPassed | 0);

Thread B:

gfxLinesCPU = gfxLineCounter[0]|0;

regardless of whatever is going on with gfxCommandCounters?

@taisel
Copy link
Author

taisel commented Feb 5, 2016

The race (if using the non-atomic load instead of atomic) is supposed to be bounded (by design), such that it's allowed to go out of synchronization with other variables, but gets synchronized past a certain code block, see the surrounding futexWait and atomic loads and stores that should memory fence. The lack of updates spans across multiple futexes and atomic loads and stores of various code blocks, and lags enough behind (well it never gets updated) the gfxCommandCounters variable to cause funky line skip logic.

@taisel
Copy link
Author

taisel commented Feb 5, 2016

The issue I see is that the optimizer doesn't see the bounding of the race, if present.

@taisel
Copy link
Author

taisel commented Feb 5, 2016

Interesting, I can't reproduce the original optimizer bug anymore in the latest nightly.
Lars, I'm wondering if this was an optimizer issue not relevant to sharedmem, but to IonMonkey.

Edit: And looking at the original racey code again, the action taken by the fault should have been impossible even with load elimination, unless of course something like MAX_INT was supplanted as a default value by the optimizer, or the load was broken into multiple loads.

@taisel
Copy link
Author

taisel commented Feb 5, 2016

Closing for now, unless I can re-verify or find spookiness again.

@taisel taisel closed this as completed Feb 5, 2016
@lars-t-hansen
Copy link
Collaborator

Hrm. Spooky. I'll ask around.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants