-
Notifications
You must be signed in to change notification settings - Fork 617
Description
Motivation.
The primary goal of this refactoring is to make Mixture of Experts (MoE) models fully compatible with the ACL Graph execution mode on Ascend NPUs. MoE models often require different communication strategies depending on the context of the execution step (e.g., a token-intensive prefill phase versus a latency-sensitive decode phase). However, not all communication primitives are capturable by a static graph. For instance, communication like AlltoAll might be efficient but incompatible with ACL Graph (due to D2H operations), whereas a simpler AllGather can be graphed but may be less performant.
The previous MoE implementation was monolithic, tightly coupling the computation with a single communication method. This made it impossible to switch strategies at runtime or to isolate graph-compatible components from dynamic ones. Consequently, MoE layers could not be effectively accelerated with ACL Graph, creating a significant performance bottleneck.
We now introduces a comprehensive refactoring to address this, decoupling communication from computation using a strategy pattern, enabling the dynamic selection of the most appropriate communication method for each forward pass. This allows us to use graph-captured kernels for compatible strategies (like AllGather) while falling back to eager execution for more complex, non-graphable strategies (like AlltoAll), thus achieving both correctness and optimal performance across all scenarios.
Proposed Change.
The refactoring introduces a flexible, strategy-based architecture for MoE communication and integrates it into the vLLM-Ascend execution flow.
-
Strategy Pattern for MoE Communication (
MoECommMethod):- An abstract base class,
MoECommMethod, was introduced to define a common interface for all MoE communication strategies. - This interface standardizes the communication flow into three phases:
prepare()(pre-computation data permutation/communication),permute()/unpermute()(core data shuffling logic), andfinalize()(post-computation data aggregation/reduction). - Concrete implementations for different strategies are provided:
AllGatherCommImpl: A baseline, graph-compatible strategy.MC2CommImpl: A high-performance strategy optimized for scenarios with low token counts (activated whennum_input_tokens < tp_size * 512).DummyCommImpl: A no-op implementation for single-device or graph compilation warm-up runs.
- An abstract base class,
-
Dynamic Strategy Dispatch Mechanism:
- A new
AscendFusedMoEclass replaces the previous MoE implementation and is registered as an OOT custom operator. - Upon initialization,
AscendFusedMoEpre-instantiates all available communication strategy objects (e.g.,self.allgathercommimpl,self.mc2commimpl). - The
ModelRunneris now responsible for selecting the appropriate strategy for each step. It determines which method to use based on runtime conditions (e.g., using"mc2"if the token count is below a threshold, otherwise falling back to"allgather"). - The name of the chosen strategy (as a string) is passed to the forward pass via
ascend_forward_context.
- A new
-
Refactored MoE Forward Pass:
- Inside
AscendFusedMoE.forward_impl, the strategy name from the context is used to dynamically fetch the corresponding pre-instantiated communication object usinggetattr. - The forward pass is restructured around the strategy object:
moe_comm_method.prepare(): Prepares inputs and performs initial communication.quant_method.apply(): Executes the core expert computation (MLPs).moe_comm_method.finalize(): Gathers results from experts and finalizes the output.
- This design cleanly separates the communication logic, which may be dynamic, from the core computation, which can be a graph-captured kernel.
- Inside
This new architecture successfully enables MoE models to leverage ACL Graph for acceleration by selecting graph-compatible communication methods when possible. It also unlocks significant performance gains by using specialized, high-performance communication strategies like MC2 in scenarios where they are most effective. The next step in our roadmap is to extend this flexible framework to support quantized MoE models.
Plan / Roadmap
- Introduce
MoECommMethod, implementAllGatherImpl, and adapt ACL Graph handling to cover all scenarios ([1/N][Feat] Support MoE models with ACL Graph and refactor MoE communication logic #2125 ). - Implement
MC2CommImpland enable communication-switch ([2/N][Feat] Add MC2 communication method for MoE layers #2469 ). - Enable W8A8 / Int8 models to use
unified_fused_experts([3/N][Feat][Graph] Supportall-to-alland quantized models with ACL Graph #2614 ).
Outstanding items
- Merge
moe_comm_methodandtoken_dispatcher. - Add support for quantized models with
all-gathercommunication pattern. - Additional items to be specified (TBD).
Feedback Period.
Two weeks.
CC List.
Any Other Things.
No response