Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement structure Universal with an applicative exception declaration #184

Merged
merged 3 commits into from
Jun 1, 2024

Conversation

MatthewFluet
Copy link
Collaborator

@MatthewFluet MatthewFluet commented May 30, 2024

@shwestrick @colin-mcd @mikerainey

A simple approach to avoiding the unit ref allocation with structure Universal. Read a4eb201's commit message for the interesting details.

It will be interesting to see if this has any impact on performance. A unit ref allocation ought to be just two instructions: *frontier = header(unit ref); frontier += 8;. And, I guess there would be a third instruction to write the object pointer to the freshly allocated unit ref to the stack, in order to be live across the pcall and accessible in the par and spwn continuations.

If it doesn't have any impact on performance, then I wonder whether avoiding the Universal.t entirely and just doing the val rightSideResult = 'b option ref allocation in pcallFork is really as bad as feared. (We would still allocate the rest of the joint point in the signal handler, passed to the par and spwn continuations via the data pointer; then we can just pass the rightSideResult along with the jp fetched via getData to the real synchronization operation. Or, maybe it is simpler for the signal handler to allocate a pre_joinpoint (monomorphic, with no rightSideResult field), which is propagated to the par and spwn continuations, who combine the pre_joinpoint with the rightSideResult to yield a 'b joinpoint.)

This makes it somewhat easier to identify uses of this exception
variant in intermediate representations.
Allow the elaboration/implementation of exception declarations to be
either generative (default) or applicative.

The default generative behavior is according to the Definition of
Standard ML, where each dynamic evaluation of an `exception C of ty`
introduces a fresh exception variant with name `C` but distinct from
any previous evaluations of this `exception` declaration.  The
implementation of a generative exception declaration is:

 * introduce a `C of unit ref * ty` variant for the `exn` datatype
 * replace the `exception C of ty` declaration with
   `val nonce : unit ref = ref ()`
 * replace any `C arg` constructor applications with
   `C (nonce, arg)` constructor applications
 * replace any `case e of C x => exp | _ => next ()` pattern matches
   with
   `case e of C (n, x) => if nonce = n then exp else next () | _ => next ()`

Note that the freshness of the `ref () : unit ref` allocation is what
ensures that each dynamic evaluation of `exception C of ty` is
distinct from any previous evaluations of this `exception`
declaration.

The (new) applicative behavior simply changes the implementation to
use a `unit` nonce rather than a `unit ref` nonce.  This avoids the
allocation of a fresh `unit ref` at the `exception C of ty`
declaration.  Because MLton implements exceptions after
monomorphisation, this means that an applicative exception declaration
essentially introduces a distinct variant for each monomorphic type at
which the `exception` declaration is evaluated, allowing distinct
evaluations to share the same variant when they share the same
monomorphic type.  Because MLton implements monomorphisation after SML
type checking and elaboration, the sharing of variants is with respect
to the *elaborated* types (and ignores any type distinctions that may
have been present in the source code due to opaque signature
constraints).

The utility of applicative exception declarations is to slightly
optimize the implementation of a universal type using exceptions (see
http://mlton.org/UniversalType).  In the special case that one can be
sure that the use of the universal type will never `inject` at one
type and then try to `project` at another type that would be
considered distinct in the source code (due to opaque signature
constraints) but has the same *elaborated* type, then implementing the
universal type with an applicative exception declaration can remove
the overhead of allocating the `unit ref`.

Normally, universal types are used sparingly in idiomatic Standard ML
code and rarely occur on a hot/fast/critical path.  An exception to
this is in the implementation of parallelism, such as MaPLe's `pcall`.

Consider `pcallFork : (unit -> 'a) * (unit -> 'b) -> 'a * 'b`.
Simplifying somewhat, if the second thunk is stolen, then a
`'b option ref` must be allocated to communicate the result
of the stolen work to the main computation.  A simple implementation
would be:

    fun pcallFork (f, g) =
      let
         val gres = ref NONE
         fun seq fres = (fres, g ())
         fun par fres = (fres, get gres)
         fun spwn () = (put (gres, g ()) ; exit ())
      in
         pcall (f, seq, par, spwn)
      end

where `put` and `get` treat an `'a option ref` as an Id-style
I-structure.  The disadvantage of this implementation is that it
incurs a `ref NONE : 'b option ref` allocation for *every*
`pcallFork`, although most evaluations of `pcallFork` will not have
the second thunk stolen.  To avoid this overhead, we'd like to move
the `ref NONE` allocation to the slow path, occurring only when the
second thunk is stolen.  It is for this reason that the lower-level
`pcall` operation allows the slow-path stealing code to pass (a
pointer to) some data back to the `par` and `spwn` continuations; in
particular, we can have the stealing code allocate the `ref NONE`:

    fun pcallFork (f, g) =
      let
         fun seq fres = (fres, g ())
         fun par (fres, gres) = (fres, get gres)
         fun spwn gres = (put (gres, g ()) ; exit ())
      in
         pcall (f, seq, par, spwn)
      end

However, the stealing code is *generic* and partially implemented in
SML (and, therefore, must integrate with the SML type system).  In
particular, it only has access to an opaque `Thread.t` representing
the interrupted thread that has a `pcall` to steal and has no obvious
means of obtaining the type that the to-be-stolen thunk will return in
order to properly allocate a `ref NONE : 'b option ref`.

One expedient approach is to use a universal type.  Now, the stealing
code can allocate a `ref NONE : Universal.t option ref` and the
`pcallFork` can `inject` to / `project` from the universal type:

    fun pcallFork (f, g) =
      let
         val (inject, project) = Universal.embed ()
         fun seq fres = (fres, g ())
         fun par (fres, gres) = (fres, valOf (project (get gres)))
         fun spwn gres = (put (gres, inject (g ())) ; exit ())
      in
         pcall (f, seq, par, spwn)
      end

Unfortunately, when `structure Universal` is implemented with
generative exceptions, this reintroduces a
`val nonce : unit ref = ref ()` allocation for *every* `pcallFork`.
(Arguably, a `unit ref` is "cheaper" than a `'b ref`, since it is
expected that a meaningful `pcallFork` will have a second thunk that
returns a non-`unit` result.  A `unit ref` can be allocated with only
a header (and no object data), while a `'b ref` (with `'b` not `unit`)
will be allocated with a header and at least 8-bytes of object
data (typically, an object pointer).)

However, when `structure Universal` is implemented with applicative
exceptions, then there is only a `val nonce : unit = ()`
(which will be optimized away).  Note that a distinct `gres` is
created for each stolen `pcall` and is properly passed exactly to the
`par` and `spwn` continuations of the stolen `pcall`, so there is no
possibility of conflating the `Universal.t` values from one `pcall`
with another and the monomorphic behavior of applicative exceptions is
acceptable.
@shwestrick
Copy link
Collaborator

Beautiful! I'll look into testing performance ASAP.

@shwestrick
Copy link
Collaborator

@MatthewFluet there's a couple missing files:

mlton/atoms/exn-dec-elab.fun
mlton/atoms/exn-dec-elab.sig

@MatthewFluet MatthewFluet force-pushed the applicative-universal branch from 76a9158 to b6c8ef6 Compare May 30, 2024 23:18
@MatthewFluet
Copy link
Collaborator Author

@shwestrick Oops, forgot to add the new files. force-pushed with the correct files.

@shwestrick
Copy link
Collaborator

A little testing this morning. The largest performance improvement I've found so far is 5-10% on fully parallel fib. On other benchmarks, I'm seeing 0-5% improvement. So, the impact seems fairly small, but measurable.

I looked at the generated code for fib, and confirmed that this change indeed removes a heap allocation at each ForkJoin.par. So, that's good!

Looking closely at the generated code, I noticed that there are (two?) other heap allocations along the fast path that I didn't expect. I've done some digging, and these appear to be related to the eager fork path of the token management algorithm... my best guess is that the compiler is deciding to heap-allocate a couple closures at each par. If I implement fully parallel fib just in terms of pcall directly, these additional heap allocations go away, and I get a ~50% performance improvement. So, there's something funky going on here... still looking into it. 🤔

fully par fib(40) timings:

procs MPL-v05 This PR MPL-v05 + pcall directly This PR + pcall directly
1 5.44 5.23 3.79 3.57
72 0.1057 0.0982 0.0732 0.0676

By the way, one impact of eliminating heap allocations is that it reduces the number of LGC garbage collections. On fully par fib(40), I'm getting approximately half as many LGCs (down to 7500 LGCs on a single core on average, instead of 13000). It's possible that the reduction in number of LGCs accounts for the bulk of the performance improvement, because the cost of a heap allocation itself is incredibly small.

Anyway, a quick summary of my thoughts:

  • This PR is clearly good for performance, with a small but measurable performance improvement due to eliminating one heap allocation along the fast path.
  • There are other overheads along the fast path that still need a closer look.

@MatthewFluet
Copy link
Collaborator Author

By the way, one impact of eliminating heap allocations is that it reduces the number of LGC garbage collections. On fully par fib(40), I'm getting approximately half as many LGCs (down to 7500 LGCs on a single core on average, instead of 13000). It's possible that the reduction in number of LGCs accounts for the bulk of the performance improvement, because the cost of a heap allocation itself is incredibly small.

That's a very good point. And it would have a big difference on fib, where there ought to be no allocation from the fib computation itself.

Looking closely at the generated code, I noticed that there are (two?) other heap allocations along the fast path that I didn't expect. I've done some digging, and these appear to be related to the eager fork path of the token management algorithm... my best guess is that the compiler is deciding to heap-allocate a couple closures at each par.

I looked at the SSA code and confirmed that the Ref_ref[unit] was removed. But, I noticed that it was coming immediately after a GC_currentSpareHeartbeats C call; in terms of raw instructions, I'm sure that the C call (as trivial as it may be) is more than the Ref_ref[unit] allocation.

I didn't look as deeply as you; those other heap allocations would be worth investigating.

If I implement fully parallel fib just in terms of pcall directly, these additional heap allocations go away, and I get a ~50% performance improvement. So, there's something funky going on here... still looking into it. 🤔

That's pretty impressive!

  • This PR is clearly good for performance, with a small but measurable performance improvement due to eliminating one heap allocation along the fast path.
  • There are other overheads along the fast path that still need a closer look.

Agreed!

@shwestrick
Copy link
Collaborator

I noticed that it was coming immediately after a GC_currentSpareHeartbeats C call; in terms of raw instructions, I'm sure that the C call (as trivial as it may be) is more than the Ref_ref[unit] allocation.

Yeah, for the overhead of currentSpareHeartbeats, I was thinking it might be worth turning it into a _prim. We could then implement it by either (a) reading directly from the GCState, or (b) caching it, similar to StackTop, Frontier, etc.

I'll go ahead and merge this PR and then we can investigate these other overheads separately!

@shwestrick shwestrick merged commit ff9b96d into MPLLang:main Jun 1, 2024
@MatthewFluet MatthewFluet deleted the applicative-universal branch June 2, 2024 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants