enable moe marlin fp8 for Ampere GPU#9754
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @ehuaa, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the FP8 quantization capabilities by enabling Marlin FP8 support for Mixture-of-Experts (MoE) models. The primary goal is to resolve a known issue and allow these highly efficient models to run on sm8x hardware. The changes involve integrating Marlin MoE specific utilities, implementing logic for automatic activation based on hardware support, and adapting the weight processing and inference paths to utilize the optimized Marlin kernels.
Highlights
- Marlin FP8 MoE Integration: This PR integrates Marlin FP8 quantization support specifically for Mixture-of-Experts (MoE) layers, leveraging vLLM's
fused_marlin_moekernel for improved performance. - Automatic Marlin Activation: The system now automatically enables Marlin FP8 for supported CUDA architectures (like sm8x, sm9x, and sm100) or when explicitly forced via an environment variable, streamlining its adoption.
- Optimized Weight Processing: Modifications to the weight processing pipeline ensure that MoE layers are correctly prepared for Marlin, including the necessary handling and removal of activation scales that are not utilized by the Marlin kernel.
- Direct Marlin Kernel Integration: The core inference path now directly calls the
fused_marlin_moekernel, which is optimized for FP8 MoE operations, with a current constraint that only SiLU activation functions are supported.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request enables FP8 quantization for Mixture-of-Experts (MoE) layers using the Marlin kernel. The changes are concentrated in python/sglang/srt/layers/quantization/fp8.py, where the Fp8MoEMethod is updated to support Marlin. This includes adding logic to detect Marlin availability, preparing weights for the Marlin kernel, and implementing the forward pass using the fused_marlin_moe operator.
The implementation looks solid, but I've identified a potential runtime error in the apply method. The code assumes the topk_output from the router can always be unpacked into three tensors, which is not always the case. I've added a comment with a suggestion to make the code more robust by checking the type of topk_output before unpacking.
| moe_runner_config.activation == "silu" | ||
| ), "Only SiLU activation is supported." | ||
|
|
||
| topk_weights, topk_ids, router_logits = topk_output |
There was a problem hiding this comment.
The topk_output is assumed to be unpackable into three values (topk_weights, topk_ids, router_logits). However, topk_output is of type TopKOutput, which is a protocol. If the TopK router uses Triton kernels (use_triton_kernels=True), it will return a TritonKernelTopKOutput object, which is not a tuple and cannot be unpacked this way. This will lead to a TypeError at runtime.
You should add a check to ensure topk_output is of the expected type before unpacking.
from sglang.srt.layers.moe.topk import StandardTopKOutput
assert isinstance(topk_output, StandardTopKOutput), f"Marlin moe requires StandardTopKOutput, but got {type(topk_output).__name__}"
topk_weights, topk_ids, router_logits = topk_output|
Hi @zhyncs , can you help review this pr? It can enable Ampere users to deploy DeepSeek and Qwen3 Moe FP8 checkpoints with fewer cards and lower latency. |
|
Hey @ehuaa thanks for working on it! Is this somewhat stable? |
|
Great work, I tested it with qwen3-30b using a Nvidia A5000 and it worked gracefully. |
Hi @Anaudia, I have tested qwen moe on a40 and deepseek on a100, the ci test failed I’ll checked this Sunday when I finished my vacation ,thanks |
|
Any update on this? Would love this feature |
|
Would also prefer this to be merged. Closes #12887 |
|
@ehuaa |
@aidendle94 @ehfd Sure, I'll resolve the conflicts this week, and in current pr above, After testing, the qwen3 moe fp8 version maintains the same accuracy as bf16, but deepseek v3 fp8 experiences an accuracy drop of approximately 10% compared to bf16. I will try to address this issue this week. |
|
A few things: |
6c9b517 to
55db869
Compare
| prepare_fp8_layer_for_marlin, | ||
| prepare_moe_fp8_layer_for_marlin, | ||
| ) | ||
| from vllm.scalar_type import scalar_types |
There was a problem hiding this comment.
It looks like this will continue to introduce vllm dependencies. A good solution would be to directly implement them in sglang.
There was a problem hiding this comment.
It looks like this will continue to introduce vllm dependencies. A good solution would be to directly implement them in sglang.
@FlamingoPg Ok, i'll try to re-implement them for moe fp8 models. And I noticed that in #13524 you have fixed the vllm dependencies for dense models, right?
|
Any news? |
Motivation
Update the Marlin Moe FP8 implementation to resolve #8990 (comment) , allow moe fp8 models running on sm8x hardwares.
This pr has been verified on Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 with 1,2,4 A40 and , Qwen3-235B-A22B-Thinking-2507-FP8, and DeepSeek-V3.1-FP8 on 2*8gpus
Modifications
Add Marlin Moe support in fp8.py, delete some early return before MarlinMoe. Auto enable Marlin for supported archs.
Checklist