moe.tex

\part{Mixture of Experts}

\section{Basics}

The $ \Ocal \left(  D ^{ 2 } \right)  $ FLOPs count due to MLP layers\footnote{The $ \Ocal \left( S
^{ 2 } \right)  $ scaling of the self-attention layers is also untenable, but MoE only addresses the
MLP layers.} is untenable past a given point: inference and training just take too long.  Mixture of
Experts\footnote{The original MoE research came out of Google: see
    \cite{fedus2022switchtransformersscalingtrillion},
    \cite{shazeer2017outrageouslylargeneuralnetworks} and related work by these authors. An
excellent MoE paper with open-source everything is here
\cite{muennighoff2024olmoeopenmixtureofexpertslanguage}. } (MoE) models address this concern by
splitting single MLP layer into a number of ``expert" MLP layers and route a subset of the tokens to
a subset of the experts." MoE is a lever for changing the relation between the per-token FLOPs count
and the overall parameter count. Example: comparing a dense and a MoE model at similar parameter counts, the
expert layer's intermediate dimension is reduced by $ \Ocal \left( N _{ {\rm ex} } \right)  $ (the
number of experts) and the FLOPs count is also reduced by this factor. Perhaps unsurprisingly, MoE
experts outperform and train faster than their FLOPs equivalent dense models (at the cost of more
engineering complexity and a higher memory burden).

The general form of the MoE layer output is
\begin{align}
    z' _{ sd } &=G _{ se }(z _{ sd }, \ldots )E _{ esd } \left ( z _{ sd } \right ) \label{eq_general_moe}
\end{align}
where $ G _{ se }(z _{ sd }, \ldots ) \in \mathbb{R} ^{ S \times N _{ {\rm ex}  } } $ is a gating (i.e.,
weighting) function and $ E _{ esd } \left ( z _{ sd } \right ) \in \mathbb{R} ^{ N _{ {\rm ex}
}\times S \times D } $ is the usual MLP operation performed by the $ e $-th expert. Many of the
entries $ G _{ es } $  are zero in practice, and only the computations $ E _{ esd } \left ( z _{ sd
} \right )$ corresponding to non-trivial gating values are performed, of course. Different MoE
variants are essentially differentiated by the specific form of their weighting function.


\section{Routing}


Choosing which experts process which tokens is crucial, affecting both the downstream model and
engineering (i.e. throughput) performance.  There are two dominant schemes:
\begin{enumerate}
    \item \textbf{Token Choice}: each token selects a fixed number of experts. $ G _{ se } $
         is sparse over the expert index; see \eqref{eq_general_moe}.
    \item \textbf{Expert Choice}: each expert selects a fixed number of tokens. $ G _{ se } $ is
        sparse over the token index; see \eqref{eq_general_moe}.
\end{enumerate}
Layered on top of this choice are the details of the routing mechanisms.

\subsection{Token Choice vs Expert Choice}

Token and expert choice both introduce a tensor $W _{ de } \in \mathbb{R} ^{ D\times N _{ {\rm  ex}
} }$ which is used to produce a score between each token and expert: $ S _{ se } = z _{ sd } W _{ de
} $. In each case, we perform a \texttt{topk} computation and output a weighted sum of expert
outputs: the two methods just differ in the dimension over which the \texttt{topk} is performed.

For token choice, the gating function is:
\begin{align}
    G  ^{ {\rm  token} }_{ se }(z _{ sd }, W)  &= \Sm_{ e } \left ( \texttt{topk} _{ e } \left ( z _{ sd } \cdot W _{ de }  \right ) \right ) \ , \label{eq_token_choice}
\end{align}
where this \texttt{topk} just sets all non-top-$ k $ entries to $ -\infty $. $ G _{ se } $
is sparse in its expert dimension and has $ Sk $ non-trivial elements. While every token will get
routed to $ k $ experts with token choice routing, the per-expert load can be very unbalanced. Some
token-choice implementations require setting a maximum tokens per expert limit which in turn defines
the capacity factor $ c $: $ \texttt{maxtok} = c \times \frac{ S }{ N _{ {\rm ex} } } $. Tokens
exceeding this limit are just not sent through the expert MLP at all (but remain in the residual
stream, of course).


Expert choice just performs the \Sm and \texttt{topk} on the sequence dimension,
instead.  The gating function is
\begin{align}
    G  ^{ {\rm  expert} }_{ se }(z _{ sd }, W)  &= \Sm_{ s } \left ( \texttt{topk} _{ s } \left ( z _{ sd } \cdot W _{ de }  \right ) \right ) \ , \label{eq_expert_choice}
\end{align}
with \texttt{topk} acting as in the token choice case. $G _{ se } $ is sparse along the sequence
dimension and has $ N _{ {\rm ex} }k $ non-trivial elements. A (potential) disadvantage of expert
choice is that some tokens may not be routed to any expert at all, but every expert is at least
guaranteed an equal load. In this case, we effectively have $ k = c \times  \frac{ S }{ N _{ {\rm
ex} } } $, with $ c $ the capacity factor above.


\section{MegaBlocks}

The MoE computation maps awkwardly to the typical GPU primitives. Ideally the expert computations in
\eqref{eq_general_moe} are parallelized as much as possible, but
\href{https://pytorch.org/docs/stable/generated/torch.bmm.html}{batched matrix multiplies} (the
closet common primitive) enforces equal token counts per expert, which is overly restrictive.

MegaBlocks \cite{gale2022megablocksefficientsparsetraining} introduces the proper sparse kernels to
handle general MoE computations without the need to enforce any hard per-expert token limits or
introduce unnecessary padding. They call their method dropless MoE (dMoE).


\section{MoE Variants}

A collections of other MoE architecture choices.

\subsection{Shared Experts}


Shared experts forces one particular expert to always be used, with the motivation of having the
differentiated expert serve as a common pool of knowledge.