-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement structure Universal
with an applicative exception declaration
#184
Conversation
This makes it somewhat easier to identify uses of this exception variant in intermediate representations.
Allow the elaboration/implementation of exception declarations to be either generative (default) or applicative. The default generative behavior is according to the Definition of Standard ML, where each dynamic evaluation of an `exception C of ty` introduces a fresh exception variant with name `C` but distinct from any previous evaluations of this `exception` declaration. The implementation of a generative exception declaration is: * introduce a `C of unit ref * ty` variant for the `exn` datatype * replace the `exception C of ty` declaration with `val nonce : unit ref = ref ()` * replace any `C arg` constructor applications with `C (nonce, arg)` constructor applications * replace any `case e of C x => exp | _ => next ()` pattern matches with `case e of C (n, x) => if nonce = n then exp else next () | _ => next ()` Note that the freshness of the `ref () : unit ref` allocation is what ensures that each dynamic evaluation of `exception C of ty` is distinct from any previous evaluations of this `exception` declaration. The (new) applicative behavior simply changes the implementation to use a `unit` nonce rather than a `unit ref` nonce. This avoids the allocation of a fresh `unit ref` at the `exception C of ty` declaration. Because MLton implements exceptions after monomorphisation, this means that an applicative exception declaration essentially introduces a distinct variant for each monomorphic type at which the `exception` declaration is evaluated, allowing distinct evaluations to share the same variant when they share the same monomorphic type. Because MLton implements monomorphisation after SML type checking and elaboration, the sharing of variants is with respect to the *elaborated* types (and ignores any type distinctions that may have been present in the source code due to opaque signature constraints). The utility of applicative exception declarations is to slightly optimize the implementation of a universal type using exceptions (see http://mlton.org/UniversalType). In the special case that one can be sure that the use of the universal type will never `inject` at one type and then try to `project` at another type that would be considered distinct in the source code (due to opaque signature constraints) but has the same *elaborated* type, then implementing the universal type with an applicative exception declaration can remove the overhead of allocating the `unit ref`. Normally, universal types are used sparingly in idiomatic Standard ML code and rarely occur on a hot/fast/critical path. An exception to this is in the implementation of parallelism, such as MaPLe's `pcall`. Consider `pcallFork : (unit -> 'a) * (unit -> 'b) -> 'a * 'b`. Simplifying somewhat, if the second thunk is stolen, then a `'b option ref` must be allocated to communicate the result of the stolen work to the main computation. A simple implementation would be: fun pcallFork (f, g) = let val gres = ref NONE fun seq fres = (fres, g ()) fun par fres = (fres, get gres) fun spwn () = (put (gres, g ()) ; exit ()) in pcall (f, seq, par, spwn) end where `put` and `get` treat an `'a option ref` as an Id-style I-structure. The disadvantage of this implementation is that it incurs a `ref NONE : 'b option ref` allocation for *every* `pcallFork`, although most evaluations of `pcallFork` will not have the second thunk stolen. To avoid this overhead, we'd like to move the `ref NONE` allocation to the slow path, occurring only when the second thunk is stolen. It is for this reason that the lower-level `pcall` operation allows the slow-path stealing code to pass (a pointer to) some data back to the `par` and `spwn` continuations; in particular, we can have the stealing code allocate the `ref NONE`: fun pcallFork (f, g) = let fun seq fres = (fres, g ()) fun par (fres, gres) = (fres, get gres) fun spwn gres = (put (gres, g ()) ; exit ()) in pcall (f, seq, par, spwn) end However, the stealing code is *generic* and partially implemented in SML (and, therefore, must integrate with the SML type system). In particular, it only has access to an opaque `Thread.t` representing the interrupted thread that has a `pcall` to steal and has no obvious means of obtaining the type that the to-be-stolen thunk will return in order to properly allocate a `ref NONE : 'b option ref`. One expedient approach is to use a universal type. Now, the stealing code can allocate a `ref NONE : Universal.t option ref` and the `pcallFork` can `inject` to / `project` from the universal type: fun pcallFork (f, g) = let val (inject, project) = Universal.embed () fun seq fres = (fres, g ()) fun par (fres, gres) = (fres, valOf (project (get gres))) fun spwn gres = (put (gres, inject (g ())) ; exit ()) in pcall (f, seq, par, spwn) end Unfortunately, when `structure Universal` is implemented with generative exceptions, this reintroduces a `val nonce : unit ref = ref ()` allocation for *every* `pcallFork`. (Arguably, a `unit ref` is "cheaper" than a `'b ref`, since it is expected that a meaningful `pcallFork` will have a second thunk that returns a non-`unit` result. A `unit ref` can be allocated with only a header (and no object data), while a `'b ref` (with `'b` not `unit`) will be allocated with a header and at least 8-bytes of object data (typically, an object pointer).) However, when `structure Universal` is implemented with applicative exceptions, then there is only a `val nonce : unit = ()` (which will be optimized away). Note that a distinct `gres` is created for each stolen `pcall` and is properly passed exactly to the `par` and `spwn` continuations of the stolen `pcall`, so there is no possibility of conflating the `Universal.t` values from one `pcall` with another and the monomorphic behavior of applicative exceptions is acceptable.
Beautiful! I'll look into testing performance ASAP. |
@MatthewFluet there's a couple missing files:
|
76a9158
to
b6c8ef6
Compare
@shwestrick Oops, forgot to add the new files. |
A little testing this morning. The largest performance improvement I've found so far is 5-10% on fully parallel fib. On other benchmarks, I'm seeing 0-5% improvement. So, the impact seems fairly small, but measurable. I looked at the generated code for fib, and confirmed that this change indeed removes a heap allocation at each Looking closely at the generated code, I noticed that there are (two?) other heap allocations along the fast path that I didn't expect. I've done some digging, and these appear to be related to the eager fork path of the token management algorithm... my best guess is that the compiler is deciding to heap-allocate a couple closures at each fully par fib(40) timings:
By the way, one impact of eliminating heap allocations is that it reduces the number of LGC garbage collections. On fully par fib(40), I'm getting approximately half as many LGCs (down to 7500 LGCs on a single core on average, instead of 13000). It's possible that the reduction in number of LGCs accounts for the bulk of the performance improvement, because the cost of a heap allocation itself is incredibly small. Anyway, a quick summary of my thoughts:
|
That's a very good point. And it would have a big difference on
I looked at the SSA code and confirmed that the I didn't look as deeply as you; those other heap allocations would be worth investigating.
That's pretty impressive!
Agreed! |
Yeah, for the overhead of I'll go ahead and merge this PR and then we can investigate these other overheads separately! |
@shwestrick @colin-mcd @mikerainey
A simple approach to avoiding the
unit ref
allocation withstructure Universal
. Read a4eb201's commit message for the interesting details.It will be interesting to see if this has any impact on performance. A
unit ref
allocation ought to be just two instructions:*frontier = header(unit ref); frontier += 8;
. And, I guess there would be a third instruction to write the object pointer to the freshly allocatedunit ref
to the stack, in order to be live across thepcall
and accessible in thepar
andspwn
continuations.If it doesn't have any impact on performance, then I wonder whether avoiding the
Universal.t
entirely and just doing theval rightSideResult = 'b option ref
allocation inpcallFork
is really as bad as feared. (We would still allocate the rest of the joint point in the signal handler, passed to thepar
andspwn
continuations via the data pointer; then we can just pass therightSideResult
along with thejp
fetched viagetData
to the real synchronization operation. Or, maybe it is simpler for the signal handler to allocate apre_joinpoint
(monomorphic, with norightSideResult
field), which is propagated to thepar
andspwn
continuations, who combine thepre_joinpoint
with therightSideResult
to yield a'b joinpoint
.)