-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Support DeepEP for Kimi-k2-thinking through enabling gemm selection for compressed-tensor marlin wna16 #28574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to add support for DeepEP for Kimi2-thinking models by enabling gemm selection for CompressedTensorsWNA16MarlinMoEMethod. The changes correctly implement get_fused_moe_quant_config and select_gemm_impl to handle int4 weights and select the appropriate Marlin expert implementation. The supportive changes in other files are also correct. However, I've found a critical issue in the implementation of select_gemm_impl which will cause a runtime error. Please see the detailed comment below.
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lu Fang <[email protected]>
e4fff80 to
bdc02ca
Compare
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lu Fang <[email protected]>
1b882af to
0dcdd48
Compare
|
cc @varun-sundar-rabindranath since you looked into mxfp4 for gpt-oss |
Signed-off-by: Lu Fang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| def get_fused_moe_quant_config( | ||
| self, layer: torch.nn.Module | ||
| ) -> FusedMoEQuantConfig | None: | ||
| return None | ||
| if self.num_bits != 4: | ||
| return None | ||
| return int4_w4a16_moe_quant_config( | ||
| w1_scale=layer.w13_weight_scale, | ||
| w2_scale=layer.w2_weight_scale, | ||
| w1_zp=None, | ||
| w2_zp=None, | ||
| block_shape=[0, self.group_size], | ||
| ) | ||
|
|
||
| def select_gemm_impl( | ||
| self, | ||
| prepare_finalize: mk.FusedMoEPrepareAndFinalize, | ||
| layer: torch.nn.Module, | ||
| ) -> mk.FusedMoEPermuteExpertsUnpermute: | ||
| layer.w13_weight = layer.w13_weight_packed | ||
| layer.w2_weight = layer.w2_weight_packed | ||
| assert all([w is not None for w in [layer.w13_weight, layer.w2_weight]]) | ||
| assert self.moe_quant_config is not None | ||
| if ( | ||
| prepare_finalize.activation_format | ||
| == mk.FusedMoEActivationFormat.BatchedExperts | ||
| ): | ||
| return BatchedMarlinExperts( | ||
| max_num_tokens=prepare_finalize.max_num_tokens_per_rank(), | ||
| num_dispatchers=prepare_finalize.num_dispatchers(), | ||
| quant_config=self.moe_quant_config, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lose act-order indices when routing through modular Marlin experts
The new select_gemm_impl now returns BatchedMarlinExperts/MarlinExperts for CompressedTensorsWNA16MarlinMoEMethod, but those experts never forward the g_idx* and sort_indices* tensors that are passed to fused_marlin_moe in the non-modular path (apply still calls fused_marlin_moe(..., g_idx1=..., g_idx2=..., sort_indices1=..., sort_indices2=...)). For models quantized with grouped activation ordering (which populate these tensors during process_weights_after_loading), the modular kernel used for DP/EP will silently drop the act-order permutation information, causing incorrect expert outputs when the DeepEP/prepare-finalize path is enabled. Consider wiring the g‑index tensors through the modular Marlin experts or gating the modular path off for act-ordered weights.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luccafong - This call out seems reasonable. Looks like you'd need to plumb through g_idx1 , g_idx2 , sort_indices1 and sort_indices2 ? Can you please take a look. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, seems we will need to touch the base method signature of FusedMoEPrepareAndFinalize to add them, or we init them in MarlinExperts,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me try the later approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved in 1e18fc8
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Show resolved
Hide resolved
|
@luccafong - along with can you also try running |
Signed-off-by: Lu Fang <[email protected]>
e9d369a to
1e18fc8
Compare
updated with test plan and results |
Signed-off-by: Lu Fang <[email protected]>
ada5cef to
95225ce
Compare
Signed-off-by: Lu Fang <[email protected]>
varun-sundar-rabindranath
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @luccafong
|
@mgoin @houseroad could you review to get committer approval? thanks! |
Signed-off-by: Lu Fang <[email protected]>
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice!
…or compressed-tensor marlin wna16 (vllm-project#28574) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: George D. Torres <[email protected]>
…or compressed-tensor marlin wna16 (vllm-project#28574) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Bram Wasti <[email protected]>
Purpose
Current models with int4 weight such as
Kimi-K2-thinkingare usingCompressedTensorsWNA16MarlinMoEMethodwhich returns None fused_moe_quant_config and does not have gemm impl, so can not go through the prepare finalize path with DP/EP, only naive all2all backend can be used in DP/EP mode, which is slow.Adding the missing config and gemm selection to enable other all2all backend like deepep.
Test Plan
Test Result
lm_eval(gsm8k) on par with main branch with naive a2a backend"
VLLM_ALL2ALL_BACKEND=naive vllm serve /data/local/models/oss/Kimi-K2-Thinking -dp 8 --max-model-len 32768 --max-num-seqs 32 --block-size 64 --trust-remote-code --enable-expert-parallelbaseline
ht
ll
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.