Implement `structure Universal` with an applicative exception declaration #184

MatthewFluet · 2024-05-30T16:05:55Z

A simple approach to avoiding the unit ref allocation with structure Universal. Read a4eb201's commit message for the interesting details.

It will be interesting to see if this has any impact on performance. A unit ref allocation ought to be just two instructions: *frontier = header(unit ref); frontier += 8;. And, I guess there would be a third instruction to write the object pointer to the freshly allocated unit ref to the stack, in order to be live across the pcall and accessible in the par and spwn continuations.

If it doesn't have any impact on performance, then I wonder whether avoiding the Universal.t entirely and just doing the val rightSideResult = 'b option ref allocation in pcallFork is really as bad as feared. (We would still allocate the rest of the joint point in the signal handler, passed to the par and spwn continuations via the data pointer; then we can just pass the rightSideResult along with the jp fetched via getData to the real synchronization operation. Or, maybe it is simpler for the signal handler to allocate a pre_joinpoint (monomorphic, with no rightSideResult field), which is propagated to the par and spwn continuations, who combine the pre_joinpoint with the rightSideResult to yield a 'b joinpoint.)

This makes it somewhat easier to identify uses of this exception variant in intermediate representations.

Allow the elaboration/implementation of exception declarations to be either generative (default) or applicative. The default generative behavior is according to the Definition of Standard ML, where each dynamic evaluation of an `exception C of ty` introduces a fresh exception variant with name `C` but distinct from any previous evaluations of this `exception` declaration. The implementation of a generative exception declaration is: * introduce a `C of unit ref * ty` variant for the `exn` datatype * replace the `exception C of ty` declaration with `val nonce : unit ref = ref ()` * replace any `C arg` constructor applications with `C (nonce, arg)` constructor applications * replace any `case e of C x => exp | _ => next ()` pattern matches with `case e of C (n, x) => if nonce = n then exp else next () | _ => next ()` Note that the freshness of the `ref () : unit ref` allocation is what ensures that each dynamic evaluation of `exception C of ty` is distinct from any previous evaluations of this `exception` declaration. The (new) applicative behavior simply changes the implementation to use a `unit` nonce rather than a `unit ref` nonce. This avoids the allocation of a fresh `unit ref` at the `exception C of ty` declaration. Because MLton implements exceptions after monomorphisation, this means that an applicative exception declaration essentially introduces a distinct variant for each monomorphic type at which the `exception` declaration is evaluated, allowing distinct evaluations to share the same variant when they share the same monomorphic type. Because MLton implements monomorphisation after SML type checking and elaboration, the sharing of variants is with respect to the *elaborated* types (and ignores any type distinctions that may have been present in the source code due to opaque signature constraints). The utility of applicative exception declarations is to slightly optimize the implementation of a universal type using exceptions (see http://mlton.org/UniversalType). In the special case that one can be sure that the use of the universal type will never `inject` at one type and then try to `project` at another type that would be considered distinct in the source code (due to opaque signature constraints) but has the same *elaborated* type, then implementing the universal type with an applicative exception declaration can remove the overhead of allocating the `unit ref`. Normally, universal types are used sparingly in idiomatic Standard ML code and rarely occur on a hot/fast/critical path. An exception to this is in the implementation of parallelism, such as MaPLe's `pcall`. Consider `pcallFork : (unit -> 'a) * (unit -> 'b) -> 'a * 'b`. Simplifying somewhat, if the second thunk is stolen, then a `'b option ref` must be allocated to communicate the result of the stolen work to the main computation. A simple implementation would be: fun pcallFork (f, g) = let val gres = ref NONE fun seq fres = (fres, g ()) fun par fres = (fres, get gres) fun spwn () = (put (gres, g ()) ; exit ()) in pcall (f, seq, par, spwn) end where `put` and `get` treat an `'a option ref` as an Id-style I-structure. The disadvantage of this implementation is that it incurs a `ref NONE : 'b option ref` allocation for *every* `pcallFork`, although most evaluations of `pcallFork` will not have the second thunk stolen. To avoid this overhead, we'd like to move the `ref NONE` allocation to the slow path, occurring only when the second thunk is stolen. It is for this reason that the lower-level `pcall` operation allows the slow-path stealing code to pass (a pointer to) some data back to the `par` and `spwn` continuations; in particular, we can have the stealing code allocate the `ref NONE`: fun pcallFork (f, g) = let fun seq fres = (fres, g ()) fun par (fres, gres) = (fres, get gres) fun spwn gres = (put (gres, g ()) ; exit ()) in pcall (f, seq, par, spwn) end However, the stealing code is *generic* and partially implemented in SML (and, therefore, must integrate with the SML type system). In particular, it only has access to an opaque `Thread.t` representing the interrupted thread that has a `pcall` to steal and has no obvious means of obtaining the type that the to-be-stolen thunk will return in order to properly allocate a `ref NONE : 'b option ref`. One expedient approach is to use a universal type. Now, the stealing code can allocate a `ref NONE : Universal.t option ref` and the `pcallFork` can `inject` to / `project` from the universal type: fun pcallFork (f, g) = let val (inject, project) = Universal.embed () fun seq fres = (fres, g ()) fun par (fres, gres) = (fres, valOf (project (get gres))) fun spwn gres = (put (gres, inject (g ())) ; exit ()) in pcall (f, seq, par, spwn) end Unfortunately, when `structure Universal` is implemented with generative exceptions, this reintroduces a `val nonce : unit ref = ref ()` allocation for *every* `pcallFork`. (Arguably, a `unit ref` is "cheaper" than a `'b ref`, since it is expected that a meaningful `pcallFork` will have a second thunk that returns a non-`unit` result. A `unit ref` can be allocated with only a header (and no object data), while a `'b ref` (with `'b` not `unit`) will be allocated with a header and at least 8-bytes of object data (typically, an object pointer).) However, when `structure Universal` is implemented with applicative exceptions, then there is only a `val nonce : unit = ()` (which will be optimized away). Note that a distinct `gres` is created for each stolen `pcall` and is properly passed exactly to the `par` and `spwn` continuations of the stolen `pcall`, so there is no possibility of conflating the `Universal.t` values from one `pcall` with another and the monomorphic behavior of applicative exceptions is acceptable.

…tion

shwestrick · 2024-05-30T21:45:45Z

Beautiful! I'll look into testing performance ASAP.

shwestrick · 2024-05-30T22:30:54Z

@MatthewFluet there's a couple missing files:

mlton/atoms/exn-dec-elab.fun
mlton/atoms/exn-dec-elab.sig

MatthewFluet · 2024-05-30T23:21:06Z

@shwestrick Oops, forgot to add the new files. force-pushed with the correct files.

shwestrick · 2024-05-31T18:22:32Z

A little testing this morning. The largest performance improvement I've found so far is 5-10% on fully parallel fib. On other benchmarks, I'm seeing 0-5% improvement. So, the impact seems fairly small, but measurable.

I looked at the generated code for fib, and confirmed that this change indeed removes a heap allocation at each ForkJoin.par. So, that's good!

Looking closely at the generated code, I noticed that there are (two?) other heap allocations along the fast path that I didn't expect. I've done some digging, and these appear to be related to the eager fork path of the token management algorithm... my best guess is that the compiler is deciding to heap-allocate a couple closures at each par. If I implement fully parallel fib just in terms of pcall directly, these additional heap allocations go away, and I get a ~50% performance improvement. So, there's something funky going on here... still looking into it. 🤔

fully par fib(40) timings:

procs	MPL-v05	This PR	MPL-v05 + pcall directly	This PR + pcall directly
1	5.44	5.23	3.79	3.57
72	0.1057	0.0982	0.0732	0.0676

By the way, one impact of eliminating heap allocations is that it reduces the number of LGC garbage collections. On fully par fib(40), I'm getting approximately half as many LGCs (down to 7500 LGCs on a single core on average, instead of 13000). It's possible that the reduction in number of LGCs accounts for the bulk of the performance improvement, because the cost of a heap allocation itself is incredibly small.

Anyway, a quick summary of my thoughts:

This PR is clearly good for performance, with a small but measurable performance improvement due to eliminating one heap allocation along the fast path.
There are other overheads along the fast path that still need a closer look.

MatthewFluet · 2024-05-31T18:42:39Z

By the way, one impact of eliminating heap allocations is that it reduces the number of LGC garbage collections. On fully par fib(40), I'm getting approximately half as many LGCs (down to 7500 LGCs on a single core on average, instead of 13000). It's possible that the reduction in number of LGCs accounts for the bulk of the performance improvement, because the cost of a heap allocation itself is incredibly small.

That's a very good point. And it would have a big difference on fib, where there ought to be no allocation from the fib computation itself.

Looking closely at the generated code, I noticed that there are (two?) other heap allocations along the fast path that I didn't expect. I've done some digging, and these appear to be related to the eager fork path of the token management algorithm... my best guess is that the compiler is deciding to heap-allocate a couple closures at each par.

I looked at the SSA code and confirmed that the Ref_ref[unit] was removed. But, I noticed that it was coming immediately after a GC_currentSpareHeartbeats C call; in terms of raw instructions, I'm sure that the C call (as trivial as it may be) is more than the Ref_ref[unit] allocation.

I didn't look as deeply as you; those other heap allocations would be worth investigating.

If I implement fully parallel fib just in terms of pcall directly, these additional heap allocations go away, and I get a ~50% performance improvement. So, there's something funky going on here... still looking into it. 🤔

That's pretty impressive!

This PR is clearly good for performance, with a small but measurable performance improvement due to eliminating one heap allocation along the fast path.

There are other overheads along the fast path that still need a closer look.

Agreed!

shwestrick · 2024-05-31T18:53:34Z

I noticed that it was coming immediately after a GC_currentSpareHeartbeats C call; in terms of raw instructions, I'm sure that the C call (as trivial as it may be) is more than the Ref_ref[unit] allocation.

Yeah, for the overhead of currentSpareHeartbeats, I was thinking it might be worth turning it into a _prim. We could then implement it by either (a) reading directly from the GCState, or (b) caching it, similar to StackTop, Frontier, etc.

I'll go ahead and merge this PR and then we can investigate these other overheads separately!

MatthewFluet added 3 commits May 30, 2024 08:29

Use more descriptive exception variant name in struct Universal

9cb00da

This makes it somewhat easier to identify uses of this exception variant in intermediate representations.

Implement structure Universal with an applicative exception declara…

b6c8ef6

…tion

MatthewFluet force-pushed the applicative-universal branch from 76a9158 to b6c8ef6 Compare May 30, 2024 23:18

shwestrick merged commit ff9b96d into MPLLang:main Jun 1, 2024

MatthewFluet deleted the applicative-universal branch June 2, 2024 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `structure Universal` with an applicative exception declaration #184

Implement `structure Universal` with an applicative exception declaration #184

MatthewFluet commented May 30, 2024 •

edited

Loading

shwestrick commented May 30, 2024

shwestrick commented May 30, 2024

MatthewFluet commented May 30, 2024

shwestrick commented May 31, 2024

MatthewFluet commented May 31, 2024

shwestrick commented May 31, 2024

Implement structure Universal with an applicative exception declaration #184

Implement structure Universal with an applicative exception declaration #184

Conversation

MatthewFluet commented May 30, 2024 • edited Loading

shwestrick commented May 30, 2024

shwestrick commented May 30, 2024

MatthewFluet commented May 30, 2024

shwestrick commented May 31, 2024

MatthewFluet commented May 31, 2024

shwestrick commented May 31, 2024

Implement `structure Universal` with an applicative exception declaration #184

Implement `structure Universal` with an applicative exception declaration #184

MatthewFluet commented May 30, 2024 •

edited

Loading