[Docs] Add moe kernel features doc #25297

bnellnm · 2025-09-20T00:33:28Z

Purpose

Add documentation describing the features supported by each moe kernel (modular or non-modular) and each PrepareAndFinalize class.

Test Plan

N/A

Test Result

N/A

cc @tlrmchlsmth , @robertgshaw2-redhat , @mgoin , @varun-sundar-rabindranath , @simon-mo , @WoosukKwon

varun-sundar-rabindranath · 2025-09-23T21:41:04Z

Nice. Thanks Bill. The vllm/docs/design/fused_moe_modular_kernel.md doc tries to list all the available prepare-finalize and fused-experts implementations. Do you think we could remove that part and link this features doc there ?

docs/design/moe_kernel_features.md

varun-sundar-rabindranath

Reviewed the content. Thanks Bill!

hmellor · 2025-09-24T16:58:38Z

Preview available here https://vllm--25297.org.readthedocs.build/en/25297/design/moe_kernel_features.html

2 things:

Not all of the API cross references appear to be correct
The tables are quite wide which hurts readability a little. There are CSS tricks we can do to make them more compact but I'm not sure that'll be enough

bnellnm · 2025-09-25T15:07:03Z

Preview available here https://vllm--25297.org.readthedocs.build/en/25297/design/moe_kernel_features.html

2 things:

Not all of the API cross references appear to be correct

The tables are quite wide which hurts readability a little. There are CSS tricks we can do to make them more compact but I'm not sure that'll be enough

I think I've fixed all the links. One is for a PR that hasn't landed yet so will not work.

hmellor · 2025-09-29T10:53:48Z

One is for a PR that hasn't landed yet so will not work.

That entry should be added in that PR. Including a broken link will cause the docs build to fail on main.

The tables are still very wide, could you try something like

vllm/docs/features/quantization/README.md

Lines 26 to 44 in edbaadd

    
           <style> 
        
           td:not(:first-child) { 
        
             text-align: center !important; 
        
           } 
        
           td { 
        
             padding: 0.5rem !important; 
        
             white-space: nowrap; 
        
           } 
        
           th { 
        
             padding: 0.5rem !important; 
        
             min-width: 0 !important; 
        
           } 
        
           th:not(:first-child) { 
        
             writing-mode: vertical-lr; 
        
             transform: rotate(180deg) 
        
           } 
        
           </style>

that we use for other wide tables?

docs/design/fused_moe_modular_kernel.md

bnellnm · 2025-09-29T15:44:56Z

The <style> section is triggering the linter. I'm not sure how to fix it (or if it can since I copied it directly from README.md).

docs/design/moe_kernel_features.md

docs/design/fused_moe_modular_kernel.md

bnellnm · 2025-09-29T16:45:42Z

@hmellor , I think I've addressed all the comments. Can you take another look when you get a chance?

hmellor · 2025-09-29T17:54:37Z

Thanks for making the changes so far. I'm going to have a look locally to see if I can make the tables render nicely

hmellor · 2025-09-29T18:07:50Z

I can't push changes directly to this PR because it comes from an organisation's fork. Please apply the following patch

fix.patch

diff --git a/docs/design/fused_moe_modular_kernel.md b/docs/design/fused_moe_modular_kernel.md
index f865b764e..ee5701989 100644
--- a/docs/design/fused_moe_modular_kernel.md
+++ b/docs/design/fused_moe_modular_kernel.md
@@ -242,8 +242,8 @@ Example: `python3 -m tests.kernels.moe.modular_kernel_tools.profile_modular_kern
 
 ## FusedMoEPrepareAndFinalize Implementations
 
-See [Fused MoE Kernel features](./moe_kernel_features.md#Fused-MoE-Modular-All2All-backends) for a list of all the available modular prepare and finalize subclasses.
+See [Fused MoE Kernel features](./moe_kernel_features.md#fused-moe-modular-all2all-backends) for a list of all the available modular prepare and finalize subclasses.
 
 ## FusedMoEPermuteExpertsUnpermute
 
-See [Fused MoE Kernel features](./moe_kernel_features.md#Fused-MoE-Experts-Kernels) for a list of all the available modular experts.
+See [Fused MoE Kernel features](./moe_kernel_features.md#fused-moe-experts-kernels) for a list of all the available modular experts.
diff --git a/docs/design/moe_kernel_features.md b/docs/design/moe_kernel_features.md
index 6e2727a57..6f3210fa7 100644
--- a/docs/design/moe_kernel_features.md
+++ b/docs/design/moe_kernel_features.md
@@ -19,9 +19,6 @@ Certain models require the topk weights to be applied to the input activations r
 unless otherwise specified, backends are controlled via `VLLM_ALL2ALL_BACKEND`.  All backends except `flashinfer` only work with EP+DP or EP+TP. `Flashinfer` can work with EP or DP w/o EP.
 
 <style>
-td:not(:first-child) {
-  text-align: center !important;
-}
 td {
   padding: 0.5rem !important;
   white-space: nowrap;
@@ -31,11 +28,6 @@ th {
   padding: 0.5rem !important;
   min-width: 0 !important;
 }
-
-th:not(:first-child) {
-  writing-mode: vertical-lr;
-  transform: rotate(180deg)
-}
 </style>
 
 | Backend                               | Output act. format | Quant. types    | Quant. format          | Async | Apply Weight On Input | Sub-class                                                                                                                                                     |
@@ -44,27 +36,25 @@ th:not(:first-child) {
 | pplx                                  | batched            | fp8,int8        | G,A,T                  | Y     | Y                     | [`PplxPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.pplx_prepare_finalize.PplxPrepareAndFinalize]                                                 |
 | deepep_high_throughput                | standard           | fp8             | G(128),A,T<sup>2</sup> | Y     | Y                     | [`DeepEPLLPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.deepep_ll_prepare_finalize.DeepEPLLPrepareAndFinalize]                                    |
 | deepep_low_latency                    | batched            | fp8             | G(128),A,T<sup>3</sup> | Y     | Y                     | [`DeepEPHTPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.deepep_ht_prepare_finalize.DeepEPHTPrepareAndFinalize]                                    |
-| flashinfer_all2allv                   | standard           | nvfp4,fp8       | G,A,T                  | N     | N                     | [`FlashInferAllToAllMoEPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_prepare_finalize.FlashInferAllToAllMoEPrepareAndFinalize] |
 | flashinfer<sup>4</sup>                | standard           | nvfp4,fp8       | G,A,T                  | N     | N                     | [`FlashInferCutlassMoEPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_prepare_finalize.FlashInferCutlassMoEPrepareAndFinalize]   |
 | flashinfer<sup>4</sup>                | standard           | nvfp4,fp8       | G,A,T                  | N     | N                     | [`FlashInferCutlassMoEPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_prepare_finalize.FlashInferCutlassMoEPrepareAndFinalize]   |
 | MoEPrepareAndFinalizeNoEP<sup>5</sup>    | standard           | fp8,int8        | G,A,T                  | N     | Y                     | [`MoEPrepareAndFinalizeNoEP`][vllm.model_executor.layers.fused_moe.prepare_finalize.MoEPrepareAndFinalizeNoEP]                                                      |
 | BatchedPrepareAndFinalize<sup>5</sup> | batched            | fp8,int8        | G,A,T                  | N     | Y                     | [`BatchedPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.fused_batched_moe.BatchedPrepareAndFinalize]                                               |
 
-1. All types: mxfp4, nvfp4, int4, int8, fp8
-2. A,T quantization occurs after dispatch.
-3. All quantization happens after dispatch.
-4. Controlled by different env vars (`VLLM_FLASHINFER_MOE_BACKEND` "throughput" or "latency")
-5. This is a no-op dispatcher that can be used to pair with any modular experts to produce a modular kernel that runs w/o dispatch or combine.  These cannot be selected via environment variable.  These are generally use for testing or adapting an expert subclass to the `fused_experts` API.
-6. This depends on the experts implementation.
+!!! info "Table key"
+    1. All types: mxfp4, nvfp4, int4, int8, fp8
+    2. A,T quantization occurs after dispatch.
+    3. All quantization happens after dispatch.
+    4. Controlled by different env vars (`VLLM_FLASHINFER_MOE_BACKEND` "throughput" or "latency")
+    5. This is a no-op dispatcher that can be used to pair with any modular experts to produce a modular kernel that runs w/o dispatch or combine.  These cannot be selected via environment variable.  These are generally use for testing or adapting an expert subclass to the `fused_experts` API.
+    6. This depends on the experts implementation.
 
-### Quantization format key
+    ---
 
-| Quantization formats        | Symbol |
-|-----------------------------|--------|
-| Grouped                     | G      |
-| Grouped w/block size N      | G(N)   |
-| Per activation token        | A      |
-| Per tensor                  | T      |
+    - G - Grouped
+    - G(N) - Grouped w/block size N
+    - A - Per activation token
+    - T - Per tensor
 
 Modular kernels are supported by the following `FusedMoEMethodBase` classes.
 
@@ -93,28 +83,29 @@ To be used with a particular `FusedMoEPrepareAndFinalize` sub-class, MoE kernels
 
 | Kernel                       | Input act. format | Quant. types    | Quant. format | Activation function                         | Apply Weight On Input | Modular | Source                                                                                                                                                                                                 |
 |------------------------------|-------------------|-----------------|---------------|---------------------------------------------|-----------------------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| triton                       | standard          | all<sup>1</sup> | G,A,T         | silu, gelu, swigluoai, silu_no_mul, gelu_no_mul | Y                     | Y       | [`fused_experts`][vllm.model_executor.layers.fused_moe.fused_moe.fused_experts], [`TritonExperts`][vllm.model_executor.layers.fused_moe.fused_moe.TritonExperts]                                                                                                                  |
+| triton                       | standard          | all<sup>1</sup> | G,A,T         | silu, gelu,</br>swigluoai,</br>silu_no_mul,</br>gelu_no_mul | Y                     | Y       | [`fused_experts`][vllm.model_executor.layers.fused_moe.fused_moe.fused_experts],</br>[`TritonExperts`][vllm.model_executor.layers.fused_moe.fused_moe.TritonExperts]                                                                                                                  |
 | triton (batched)             | batched           | all<sup>1</sup> | G,A,T         | silu, gelu                                   | <sup>6</sup>          | Y       | [`BatchedTritonExperts`][vllm.model_executor.layers.fused_moe.fused_batched_moe.BatchedTritonExperts]                                                                                                                    |
-| deep gemm                    | standard, batched | fp8             | G(128),A,T    | silu, gelu                                   | <sup>6</sup>          | Y       | [`deep_gemm_moe_fp8`][vllm.model_executor.layers.fused_moe.deep_gemm_moe.deep_gemm_moe_fp8], [`DeepGemmExperts`][vllm.model_executor.layers.fused_moe.deep_gemm_moe.DeepGemmExperts], [`BatchedDeepGemmExperts`][vllm.model_executor.layers.fused_moe.batched_deep_gemm_moe.BatchedDeepGemmExperts]             |
-| cutlass_fp4                  | standard, batched | nvfp4           | A,T           | silu                                        | Y                     | Y       | [`cutlass_moe_fp4`][vllm.model_executor.layers.fused_moe.cutlass_moe.cutlass_moe_fp4], [`CutlassExpertsFp4`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassExpertsFp4]                                                                                                          |
-| cutlass_fp8                  | standard, batched | fp8             | A,T           | silu, gelu                                   | Y                     | Y       | [`cutlass_moe_fp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.cutlass_moe_fp8], [`CutlassExpertsFp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassExpertsFp8], [`CutlasBatchedExpertsFp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassBatchedExpertsFp8]                                                                               |
-| flashinfer                   | standard          | nvfp4,fp8       | T             | <sup>5</sup>                                | N                     | Y       | [`flashinfer_cutlass_moe_fp4`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe.flashinfer_cutlass_moe_fp4], [`FlashInferExperts`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe.FlashInferExperts]                                                                                    |
-| gpt oss triton               | batched           | N/A             | N/A           | <sup>5</sup>                                | Y                     | Y       | [`triton_kernel_fused_experts`][vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe.triton_kernel_fused_experts], [`BatchedOAITritonExperts`][vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe.BatchedOAITritonExperts]                                                                         |
-| deep gemm+triton<sup>2</sup> | standard, batched | all<sup>1</sup> | G(128),A,T    | silu, gelu                                   | <sup>6</sup>          | Y       | [`TritonOrDeepGemmExperts`][vllm.model_executor.layers.fused_moe.triton_deep_gemm_moe.TritonOrDeepGemmExperts], [`BatchedTritonOrDeepGemmExperts`][vllm.model_executor.layers.fused_moe.batched_triton_or_deep_gemm_moe.BatchedTritonOrDeepGemmExperts] |
-| marlin                       | standard          | <sup>3</sup>    | <sup>3</sup>  | silu, swigluoai                              | Y                     | N       | [`fused_marlin_moe`][vllm.model_executor.layers.fused_moe.fused_marlin_moe.fused_marlin_moe]                                                                                                                         |
-| trtllm                       | standard          | mxfp4,nvfp4     | G(16),G32)    | <sup>5</sup>                                | N                     | Y       | [`TrtLlmGenExperts`][vllm.model_executor.layers.fused_moe.trtllm_moe.TrtLlmGenExperts]                                                                                                                               |
+| deep gemm                    | standard,</br>batched | fp8             | G(128),A,T    | silu, gelu                                   | <sup>6</sup>          | Y       | [`deep_gemm_moe_fp8`][vllm.model_executor.layers.fused_moe.deep_gemm_moe.deep_gemm_moe_fp8],</br>[`DeepGemmExperts`][vllm.model_executor.layers.fused_moe.deep_gemm_moe.DeepGemmExperts],</br>[`BatchedDeepGemmExperts`][vllm.model_executor.layers.fused_moe.batched_deep_gemm_moe.BatchedDeepGemmExperts]             |
+| cutlass_fp4                  | standard,</br>batched | nvfp4           | A,T           | silu                                        | Y                     | Y       | [`cutlass_moe_fp4`][vllm.model_executor.layers.fused_moe.cutlass_moe.cutlass_moe_fp4],</br>[`CutlassExpertsFp4`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassExpertsFp4]                                                                                                          |
+| cutlass_fp8                  | standard,</br>batched | fp8             | A,T           | silu, gelu                                   | Y                     | Y       | [`cutlass_moe_fp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.cutlass_moe_fp8],</br>[`CutlassExpertsFp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassExpertsFp8],</br>[`CutlasBatchedExpertsFp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassBatchedExpertsFp8]                                                                               |
+| flashinfer                   | standard          | nvfp4,</br>fp8       | T             | <sup>5</sup>                                | N                     | Y       | [`flashinfer_cutlass_moe_fp4`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe.flashinfer_cutlass_moe_fp4],</br>[`FlashInferExperts`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe.FlashInferExperts]                                                                                    |
+| gpt oss triton               | batched           | N/A             | N/A           | <sup>5</sup>                                | Y                     | Y       | [`triton_kernel_fused_experts`][vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe.triton_kernel_fused_experts],</br>[`BatchedOAITritonExperts`][vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe.BatchedOAITritonExperts]                                                                         |
+| deep gemm+triton<sup>2</sup> | standard,</br>batched | all<sup>1</sup> | G(128),A,T    | silu, gelu                                   | <sup>6</sup>          | Y       | [`TritonOrDeepGemmExperts`][vllm.model_executor.layers.fused_moe.triton_deep_gemm_moe.TritonOrDeepGemmExperts],</br>[`BatchedTritonOrDeepGemmExperts`][vllm.model_executor.layers.fused_moe.batched_triton_or_deep_gemm_moe.BatchedTritonOrDeepGemmExperts] |
+| marlin                       | standard          | <sup>3</sup>    | <sup>3</sup>  | silu,</br>swigluoai                              | Y                     | N       | [`fused_marlin_moe`][vllm.model_executor.layers.fused_moe.fused_marlin_moe.fused_marlin_moe]                                                                                                                         |
+| trtllm                       | standard          | mxfp4,</br>nvfp4     | G(16),G(32)    | <sup>5</sup>                                | N                     | Y       | [`TrtLlmGenExperts`][vllm.model_executor.layers.fused_moe.trtllm_moe.TrtLlmGenExperts]                                                                                                                               |
 | pallas                       | standard          | N/A             | N/A           | silu                                        | N                     | N       | [`fused_moe`][vllm.model_executor.layers.fused_moe.moe_pallas.fused_moe]                                                                                                                                      |
 | iterative                    | standard          | N/A             | N/A           | silu                                        | N                     | N       | [`fused_moe`][vllm.model_executor.layers.fused_moe.moe_torch_iterative.fused_moe]                                                                                                                             |
 | rocm aiter moe               | standard          | fp8             | G(128),A,T    | silu, gelu                                   | Y                     | N       | [`rocm_aiter_fused_experts`][vllm.model_executor.layers.fused_moe.rocm_aiter_fused_moe.rocm_aiter_fused_moe_impl]                                                                                                             |
 | cpu_fused_moe                | standard          | N/A             | N/A           | silu                                        | N                     | N       | [`CPUFusedMOE`][vllm.model_executor.layers.fused_moe.cpu_fused_moe.CPUFusedMOE]                                                                                                                                 |
-| naive batched<sup>4</sup>    | batched           | int8,fp8        | G,A,T         | silu, gelu                                   | <sup>6</sup>          | Y       | [`NaiveBatchedExperts`][vllm.model_executor.layers.fused_moe.fused_batched_moe.NaiveBatchedExperts]                                                                                                    |
+| naive batched<sup>4</sup>    | batched           | int8,</br>fp8        | G,A,T         | silu, gelu                                   | <sup>6</sup>          | Y       | [`NaiveBatchedExperts`][vllm.model_executor.layers.fused_moe.fused_batched_moe.NaiveBatchedExperts]                                                                                                    |
 
-1. All types: mxfp4, nvfp4, int4, int8, fp8
-2. A dispatcher wrapper around triton and deep gemm experts.  Will select based on type + shape + quantization params
-3. uint4, uint8, fp8, fp4
-4. This is a naive implementation of experts that supports batched format. Mainly used for testing.
-5. The `activation` parameter is ignored and SwiGlu is used by default instead.
-6. Only handled by or supported when used with modular kernels.
+!!! info "Table key"
+    1. All types: mxfp4, nvfp4, int4, int8, fp8
+    2. A dispatcher wrapper around triton and deep gemm experts.  Will select based on type + shape + quantization params
+    3. uint4, uint8, fp8, fp4
+    4. This is a naive implementation of experts that supports batched format. Mainly used for testing.
+    5. The `activation` parameter is ignored and SwiGlu is used by default instead.
+    6. Only handled by or supported when used with modular kernels.
 
 ## Modular Kernel "families"
 
@@ -122,6 +113,6 @@ The following table shows "families" of modular kernels that are intended to wor
 
 | backend                      | `FusedMoEPrepareAndFinalize` subclasses                | `FusedMoEPermuteExpertsUnpermute` subclasses                                                                   |
 |------------------------------|--------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|
-| deepep_high_throughput, pplx | `DeepEPHTPrepareAndFinalize`, `PplxPrepareAndFinalize` | `BatchedDeepGemmExperts`, `BatchedTritonExperts`, `BatchedTritonOrDeepGemmExperts`, `CutlassBatchedExpertsFp8` |
-| deepep_low_latency           | `DeepEPLLPrepareAndFinalize`                           | `DeepGemmExperts`, `TritonExperts`, `TritonOrDeepGemmExperts`, `CutlassExpertsFp8`                             |
+| deepep_high_throughput,</br>pplx | `DeepEPHTPrepareAndFinalize`,</br>`PplxPrepareAndFinalize` | `BatchedDeepGemmExperts`,</br>`BatchedTritonExperts`,</br>`BatchedTritonOrDeepGemmExperts`,</br>`CutlassBatchedExpertsFp8` |
+| deepep_low_latency           | `DeepEPLLPrepareAndFinalize`                           | `DeepGemmExperts`,</br>`TritonExperts`,</br>`TritonOrDeepGemmExperts`,</br>`CutlassExpertsFp8`                             |
 | flashinfer                   | `FlashInferCutlassMoEPrepareAndFinalize`               | `FlashInferExperts`                                                                                            |

Signed-off-by: Bill Nell <[email protected]>

hmellor

LGTM! Just a couple of tiny nits

docs/design/moe_kernel_features.md

Co-authored-by: Harry Mellor <[email protected]> Signed-off-by: bnellnm <[email protected]>

Signed-off-by: Bill Nell <[email protected]> Signed-off-by: bnellnm <[email protected]> Co-authored-by: Harry Mellor <[email protected]>

Signed-off-by: Bill Nell <[email protected]> Signed-off-by: bnellnm <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Signed-off-by: yewentao256 <[email protected]>

Signed-off-by: Bill Nell <[email protected]> Signed-off-by: bnellnm <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Bill Nell <[email protected]> Signed-off-by: bnellnm <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

Signed-off-by: Bill Nell <[email protected]> Signed-off-by: bnellnm <[email protected]> Co-authored-by: Harry Mellor <[email protected]>

Signed-off-by: Bill Nell <[email protected]> Signed-off-by: bnellnm <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

Signed-off-by: Bill Nell <[email protected]> Signed-off-by: bnellnm <[email protected]> Co-authored-by: Harry Mellor <[email protected]>

mergify bot added the documentation Improvements or additions to documentation label Sep 20, 2025

bnellnm marked this pull request as ready for review September 23, 2025 21:13

hmellor reviewed Sep 24, 2025

View reviewed changes

docs/design/moe_kernel_features.md Outdated Show resolved Hide resolved