Skip to content

enable moe marlin fp8 for Ampere GPU#9754

Open
ehuaa wants to merge 6 commits intosgl-project:mainfrom
ehuaa:moe-marlin-fp8
Open

enable moe marlin fp8 for Ampere GPU#9754
ehuaa wants to merge 6 commits intosgl-project:mainfrom
ehuaa:moe-marlin-fp8

Conversation

@ehuaa
Copy link
Contributor

@ehuaa ehuaa commented Aug 28, 2025

Motivation

Update the Marlin Moe FP8 implementation to resolve #8990 (comment) , allow moe fp8 models running on sm8x hardwares.
This pr has been verified on Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 with 1,2,4 A40 and , Qwen3-235B-A22B-Thinking-2507-FP8, and DeepSeek-V3.1-FP8 on 2*8gpus

Modifications

Add Marlin Moe support in fp8.py, delete some early return before MarlinMoe. Auto enable Marlin for supported archs.

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ehuaa, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the FP8 quantization capabilities by enabling Marlin FP8 support for Mixture-of-Experts (MoE) models. The primary goal is to resolve a known issue and allow these highly efficient models to run on sm8x hardware. The changes involve integrating Marlin MoE specific utilities, implementing logic for automatic activation based on hardware support, and adapting the weight processing and inference paths to utilize the optimized Marlin kernels.

Highlights

  • Marlin FP8 MoE Integration: This PR integrates Marlin FP8 quantization support specifically for Mixture-of-Experts (MoE) layers, leveraging vLLM's fused_marlin_moe kernel for improved performance.
  • Automatic Marlin Activation: The system now automatically enables Marlin FP8 for supported CUDA architectures (like sm8x, sm9x, and sm100) or when explicitly forced via an environment variable, streamlining its adoption.
  • Optimized Weight Processing: Modifications to the weight processing pipeline ensure that MoE layers are correctly prepared for Marlin, including the necessary handling and removal of activation scales that are not utilized by the Marlin kernel.
  • Direct Marlin Kernel Integration: The core inference path now directly calls the fused_marlin_moe kernel, which is optimized for FP8 MoE operations, with a current constraint that only SiLU activation functions are supported.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables FP8 quantization for Mixture-of-Experts (MoE) layers using the Marlin kernel. The changes are concentrated in python/sglang/srt/layers/quantization/fp8.py, where the Fp8MoEMethod is updated to support Marlin. This includes adding logic to detect Marlin availability, preparing weights for the Marlin kernel, and implementing the forward pass using the fused_marlin_moe operator.

The implementation looks solid, but I've identified a potential runtime error in the apply method. The code assumes the topk_output from the router can always be unpacked into three tensors, which is not always the case. I've added a comment with a suggestion to make the code more robust by checking the type of topk_output before unpacking.

moe_runner_config.activation == "silu"
), "Only SiLU activation is supported."

topk_weights, topk_ids, router_logits = topk_output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The topk_output is assumed to be unpackable into three values (topk_weights, topk_ids, router_logits). However, topk_output is of type TopKOutput, which is a protocol. If the TopK router uses Triton kernels (use_triton_kernels=True), it will return a TritonKernelTopKOutput object, which is not a tuple and cannot be unpacked this way. This will lead to a TypeError at runtime.

You should add a check to ensure topk_output is of the expected type before unpacking.

            from sglang.srt.layers.moe.topk import StandardTopKOutput
            assert isinstance(topk_output, StandardTopKOutput), f"Marlin moe requires StandardTopKOutput, but got {type(topk_output).__name__}"
            topk_weights, topk_ids, router_logits = topk_output

@ehuaa ehuaa changed the title enable moe marlin fp8 enable moe marlin fp8 for Ampere GPU Aug 29, 2025
@ehuaa
Copy link
Contributor Author

ehuaa commented Aug 29, 2025

Hi @zhyncs , can you help review this pr? It can enable Ampere users to deploy DeepSeek and Qwen3 Moe FP8 checkpoints with fewer cards and lower latency.

@PaulRoeseler
Copy link

Hey @ehuaa thanks for working on it! Is this somewhat stable?

@hauck-jvsh
Copy link

hauck-jvsh commented Sep 4, 2025

Great work, I tested it with qwen3-30b using a Nvidia A5000 and it worked gracefully.

@ehuaa
Copy link
Contributor Author

ehuaa commented Sep 8, 2025

Hey @ehuaa thanks for working on it! Is this somewhat stable?

Hi @Anaudia, I have tested qwen moe on a40 and deepseek on a100, the ci test failed I’ll checked this Sunday when I finished my vacation ,thanks

@ehuaa
Copy link
Contributor Author

ehuaa commented Sep 17, 2025

Hey @ehuaa thanks for working on it! Is this somewhat stable?

Hi @Anaudia, I have fixed the ci bugs and it's stable now. @zhyncs please retrigger the ci tests again, Thanks.

@aidendle94
Copy link

Any update on this? Would love this feature

@ehfd
Copy link

ehfd commented Nov 9, 2025

Would also prefer this to be merged.

Closes #12887

@ehfd
Copy link

ehfd commented Nov 14, 2025

@ehuaa
Are you able to resolve the conflicts? Would really like this feature.

@ehuaa
Copy link
Contributor Author

ehuaa commented Nov 16, 2025

@ehuaa Are you able to resolve the conflicts? Would really like this feature.

@aidendle94 @ehfd Sure, I'll resolve the conflicts this week, and in current pr above, After testing, the qwen3 moe fp8 version maintains the same accuracy as bf16, but deepseek v3 fp8 experiences an accuracy drop of approximately 10% compared to bf16. I will try to address this issue this week.

@ehfd
Copy link

ehfd commented Nov 16, 2025

A few things:
vLLM already has the Marlin MoE FP8 implementation for Ampere, so worth comparing the accuracy with it.
GLM-4.5V, GLM-4.6, MiniMax M2 all have official FP8 quantizations. They should all be tested as well, and the GLM models don't work well with tensor parallelism in A100 GPUs in vLLM.
Since DeepSeek is native FP8, it should be compared with vLLM, and also with Ada or Hopper.

prepare_fp8_layer_for_marlin,
prepare_moe_fp8_layer_for_marlin,
)
from vllm.scalar_type import scalar_types
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this will continue to introduce vllm dependencies. A good solution would be to directly implement them in sglang.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this will continue to introduce vllm dependencies. A good solution would be to directly implement them in sglang.

@FlamingoPg Ok, i'll try to re-implement them for moe fp8 models. And I noticed that in #13524 you have fixed the vllm dependencies for dense models, right?

@FlamingoPg FlamingoPg self-assigned this Nov 18, 2025
@ehfd
Copy link

ehfd commented Dec 30, 2025

Any news?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants