Support DeepEP for Kimi-k2-thinking through enabling gemm selection for compressed-tensor marlin wna16 #28574

luccafong · 2025-11-12T17:53:34Z

Purpose

Current models with int4 weight such as Kimi-K2-thinking are using CompressedTensorsWNA16MarlinMoEMethod which returns None fused_moe_quant_config and does not have gemm impl, so can not go through the prepare finalize path with DP/EP, only naive all2all backend can be used in DP/EP mode, which is slow.

Adding the missing config and gemm selection to enable other all2all backend like deepep.

Test Plan

VLLM_ALL2ALL_BACKEND=deepep_low_latency vllm serve /data/local/models/oss/Kimi-K2-Thinking -dp 8  --max-model-len 16384 --max-num-seqs 32 --block-size 64 --trust-remote-code --enable-expert-parallel 
VLLM_ALL2ALL_BACKEND=deepep_high_throughput vllm serve /data/local/models/oss/Kimi-K2-Thinking -dp 8  --max-model-len 16384 --max-num-seqs 32 --block-size 64 --trust-remote-code --enable-expert-parallel

Test Result

lm_eval(gsm8k) on par with main branch with naive a2a backend"VLLM_ALL2ALL_BACKEND=naive vllm serve /data/local/models/oss/Kimi-K2-Thinking -dp 8 --max-model-len 32768 --max-num-seqs 32 --block-size 64 --trust-remote-code --enable-expert-parallel

baseline

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9416|±  |0.0065|
|     |       |strict-match    |     5|exact_match|↑  |0.9416|±  |0.0065|

ht

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9454|±  |0.0063|
|     |       |strict-match    |     5|exact_match|↑  |0.9447|±  |0.0063|

ll

|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9424|±  |0.0064|
|     |       |strict-match    |     5|exact_match|↑  |0.9409|±  |0.0065|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-11-12T17:54:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @luccafong.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request aims to add support for DeepEP for Kimi2-thinking models by enabling gemm selection for CompressedTensorsWNA16MarlinMoEMethod. The changes correctly implement get_fused_moe_quant_config and select_gemm_impl to handle int4 weights and select the appropriate Marlin expert implementation. The supportive changes in other files are also correct. However, I've found a critical issue in the implementation of select_gemm_impl which will cause a runtime error. Please see the detailed comment below.

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

Signed-off-by: Lu Fang <[email protected]>

luccafong · 2025-11-12T18:13:37Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

vllm/model_executor/layers/fused_moe/fused_marlin_moe.py

Signed-off-by: Lu Fang <[email protected]>

mgoin · 2025-11-12T18:34:29Z

cc @varun-sundar-rabindranath since you looked into mxfp4 for gpt-oss

Signed-off-by: Lu Fang <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-12T18:38:19Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

    def get_fused_moe_quant_config(
        self, layer: torch.nn.Module
    ) -> FusedMoEQuantConfig | None:
-        return None
+        if self.num_bits != 4:
+            return None
+        return int4_w4a16_moe_quant_config(
+            w1_scale=layer.w13_weight_scale,
+            w2_scale=layer.w2_weight_scale,
+            w1_zp=None,
+            w2_zp=None,
+            block_shape=[0, self.group_size],
+        )
+
+    def select_gemm_impl(
+        self,
+        prepare_finalize: mk.FusedMoEPrepareAndFinalize,
+        layer: torch.nn.Module,
+    ) -> mk.FusedMoEPermuteExpertsUnpermute:
+        layer.w13_weight = layer.w13_weight_packed
+        layer.w2_weight = layer.w2_weight_packed
+        assert all([w is not None for w in [layer.w13_weight, layer.w2_weight]])
+        assert self.moe_quant_config is not None
+        if (
+            prepare_finalize.activation_format
+            == mk.FusedMoEActivationFormat.BatchedExperts
+        ):
+            return BatchedMarlinExperts(
+                max_num_tokens=prepare_finalize.max_num_tokens_per_rank(),
+                num_dispatchers=prepare_finalize.num_dispatchers(),
+                quant_config=self.moe_quant_config,


Lose act-order indices when routing through modular Marlin experts

The new select_gemm_impl now returns BatchedMarlinExperts/MarlinExperts for CompressedTensorsWNA16MarlinMoEMethod, but those experts never forward the g_idx* and sort_indices* tensors that are passed to fused_marlin_moe in the non-modular path (apply still calls fused_marlin_moe(..., g_idx1=..., g_idx2=..., sort_indices1=..., sort_indices2=...)). For models quantized with grouped activation ordering (which populate these tensors during process_weights_after_loading), the modular kernel used for DP/EP will silently drop the act-order permutation information, causing incorrect expert outputs when the DeepEP/prepare-finalize path is enabled. Consider wiring the g‑index tensors through the modular Marlin experts or gating the modular path off for act-ordered weights.

Useful? React with 👍 / 👎.

@luccafong - This call out seems reasonable. Looks like you'd need to plumb through g_idx1 , g_idx2 , sort_indices1 and sort_indices2 ? Can you please take a look. Thanks.

hmm, seems we will need to touch the base method signature of FusedMoEPrepareAndFinalize to add them, or we init them in MarlinExperts,

let me try the later approach

resolved in 1e18fc8

vllm/model_executor/layers/fused_moe/fused_marlin_moe.py

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

varun-sundar-rabindranath · 2025-11-12T19:59:22Z

@luccafong - along with

 VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA vllm serve /data/local/models/oss/Kimi-K2-Thinking -dp 8 --kv-cache-dtype fp8 --max-model-len 16384 --max-num-seqs 32 --block-size 64 --trust-remote-code --enable-expert-parallel

can you also try running deepep_high_throughput and provide lm_eval comparison against main ? Thanks.

Signed-off-by: Lu Fang <[email protected]>

luccafong · 2025-11-12T23:16:20Z

@luccafong - along with
 VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA vllm serve /data/local/models/oss/Kimi-K2-Thinking -dp 8 --kv-cache-dtype fp8 --max-model-len 16384 --max-num-seqs 32 --block-size 64 --trust-remote-code --enable-expert-parallel 
can you also try running deepep_high_throughput and provide lm_eval comparison against main ? Thanks.

updated with test plan and results

vllm/model_executor/layers/fused_moe/fused_marlin_moe.py

Signed-off-by: Lu Fang <[email protected]>

varun-sundar-rabindranath

LGTM! Thanks @luccafong

vllm/model_executor/layers/fused_moe/fused_marlin_moe.py

luccafong · 2025-11-13T02:02:15Z

@mgoin @houseroad could you review to get committer approval? thanks!

Signed-off-by: Lu Fang <[email protected]>

mgoin

LGTM, nice!

…or compressed-tensor marlin wna16 (vllm-project#28574) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: George D. Torres <[email protected]>

…or compressed-tensor marlin wna16 (vllm-project#28574) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Bram Wasti <[email protected]>

luccafong requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners November 12, 2025 17:53

mergify bot added the needs-rebase label Nov 12, 2025

gemini-code-assist bot reviewed Nov 12, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Outdated Show resolved Hide resolved

luccafong changed the title ~~Support DeepEP for Kimi2-thinking through enabling gemm selection for compressed-tensor marlin wna16~~ Support DeepEP for Kimi-k2-thinking through enabling gemm selection for compressed-tensor marlin wna16 Nov 12, 2025

chatgpt-codex-connector bot reviewed Nov 12, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Outdated Show resolved Hide resolved

enable modular experts for compressed tensor marlin wn16 moe

bdc02ca

Signed-off-by: Lu Fang <[email protected]>

luccafong force-pushed the kimi_moe_marlin branch from e4fff80 to bdc02ca Compare November 12, 2025 18:09

mergify bot removed the needs-rebase label Nov 12, 2025

chatgpt-codex-connector bot reviewed Nov 12, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 12, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/fused_moe/fused_marlin_moe.py Outdated Show resolved Hide resolved

return empty still if not bit 4

0dcdd48

Signed-off-by: Lu Fang <[email protected]>

luccafong force-pushed the kimi_moe_marlin branch from 1b882af to 0dcdd48 Compare November 12, 2025 18:32

fix the missing args

60c880b

Signed-off-by: Lu Fang <[email protected]>

chatgpt-codex-connector bot reviewed Nov 12, 2025

View reviewed changes

mgoin added the nvidia label Nov 12, 2025

github-project-automation bot added this to NVIDIA Nov 12, 2025

varun-sundar-rabindranath reviewed Nov 12, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_marlin_moe.py Outdated Show resolved Hide resolved

varun-sundar-rabindranath reviewed Nov 12, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Show resolved Hide resolved

pass corret args

1e18fc8

Signed-off-by: Lu Fang <[email protected]>

luccafong force-pushed the kimi_moe_marlin branch from e9d369a to 1e18fc8 Compare November 12, 2025 21:39

Merge remote-tracking branch 'origin/main' into kimi_moe_marlin

5a73a86

luccafong requested a review from varun-sundar-rabindranath November 12, 2025 23:16

varun-sundar-rabindranath reviewed Nov 12, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_marlin_moe.py Outdated Show resolved Hide resolved

varun-sundar-rabindranath reviewed Nov 12, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_marlin_moe.py Show resolved Hide resolved

fix lint

95225ce

Signed-off-by: Lu Fang <[email protected]>

luccafong force-pushed the kimi_moe_marlin branch from ada5cef to 95225ce Compare November 12, 2025 23:28

fix default is_k_full value

3d9ab2f

Signed-off-by: Lu Fang <[email protected]>

luccafong requested a review from varun-sundar-rabindranath November 12, 2025 23:51

varun-sundar-rabindranath approved these changes Nov 12, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_marlin_moe.py Show resolved Hide resolved

add comment

72b3265

Signed-off-by: Lu Fang <[email protected]>

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 13, 2025

mgoin approved these changes Nov 13, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Nov 13, 2025

luccafong merged commit 7e082bc into vllm-project:main Nov 13, 2025
56 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Nov 13, 2025

luccafong deleted the kimi_moe_marlin branch November 18, 2025 23:59

Uh oh!

Support DeepEP for Kimi-k2-thinking through enabling gemm selection for compressed-tensor marlin wna16 #28574

Support DeepEP for Kimi-k2-thinking through enabling gemm selection for compressed-tensor marlin wna16 #28574

Uh oh!

Conversation

luccafong commented Nov 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Nov 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

luccafong commented Nov 12, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

mgoin commented Nov 12, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

luccafong Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luccafong Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

luccafong Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

varun-sundar-rabindranath commented Nov 12, 2025

Uh oh!

luccafong commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

varun-sundar-rabindranath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

luccafong commented Nov 13, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

luccafong commented Nov 12, 2025 •

edited by github-actions bot

Loading

luccafong Nov 12, 2025 •

edited

Loading