enable moe marlin fp8 for Ampere GPU by ehuaa · Pull Request #9754 · sgl-project/sglang

ehuaa · 2025-08-28T11:42:35Z

Motivation

Update the Marlin Moe FP8 implementation to resolve #8990 (comment) , allow moe fp8 models running on sm8x hardwares.
This pr has been verified on Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 with 1,2,4 A40 and , Qwen3-235B-A22B-Thinking-2507-FP8, and DeepSeek-V3.1-FP8 on 2*8gpus

Modifications

Add Marlin Moe support in fp8.py, delete some early return before MarlinMoe. Auto enable Marlin for supported archs.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist

Summary of Changes

Hello @ehuaa, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the FP8 quantization capabilities by enabling Marlin FP8 support for Mixture-of-Experts (MoE) models. The primary goal is to resolve a known issue and allow these highly efficient models to run on sm8x hardware. The changes involve integrating Marlin MoE specific utilities, implementing logic for automatic activation based on hardware support, and adapting the weight processing and inference paths to utilize the optimized Marlin kernels.

Highlights

Marlin FP8 MoE Integration: This PR integrates Marlin FP8 quantization support specifically for Mixture-of-Experts (MoE) layers, leveraging vLLM's fused_marlin_moe kernel for improved performance.
Automatic Marlin Activation: The system now automatically enables Marlin FP8 for supported CUDA architectures (like sm8x, sm9x, and sm100) or when explicitly forced via an environment variable, streamlining its adoption.
Optimized Weight Processing: Modifications to the weight processing pipeline ensure that MoE layers are correctly prepared for Marlin, including the necessary handling and removal of activation scales that are not utilized by the Marlin kernel.
Direct Marlin Kernel Integration: The core inference path now directly calls the fused_marlin_moe kernel, which is optimized for FP8 MoE operations, with a current constraint that only SiLU activation functions are supported.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request enables FP8 quantization for Mixture-of-Experts (MoE) layers using the Marlin kernel. The changes are concentrated in python/sglang/srt/layers/quantization/fp8.py, where the Fp8MoEMethod is updated to support Marlin. This includes adding logic to detect Marlin availability, preparing weights for the Marlin kernel, and implementing the forward pass using the fused_marlin_moe operator.

The implementation looks solid, but I've identified a potential runtime error in the apply method. The code assumes the topk_output from the router can always be unpacked into three tensors, which is not always the case. I've added a comment with a suggestion to make the code more robust by checking the type of topk_output before unpacking.

gemini-code-assist · 2025-08-28T11:46:27Z

python/sglang/srt/layers/quantization/fp8.py

+                moe_runner_config.activation == "silu"
+            ), "Only SiLU activation is supported."
+
+            topk_weights, topk_ids, router_logits = topk_output


The topk_output is assumed to be unpackable into three values (topk_weights, topk_ids, router_logits). However, topk_output is of type TopKOutput, which is a protocol. If the TopK router uses Triton kernels (use_triton_kernels=True), it will return a TritonKernelTopKOutput object, which is not a tuple and cannot be unpacked this way. This will lead to a TypeError at runtime.

You should add a check to ensure topk_output is of the expected type before unpacking.

from sglang.srt.layers.moe.topk import StandardTopKOutput assert isinstance(topk_output, StandardTopKOutput), f"Marlin moe requires StandardTopKOutput, but got {type(topk_output).__name__}" topk_weights, topk_ids, router_logits = topk_output

ehuaa · 2025-08-29T13:02:23Z

Hi @zhyncs , can you help review this pr? It can enable Ampere users to deploy DeepSeek and Qwen3 Moe FP8 checkpoints with fewer cards and lower latency.

PaulRoeseler · 2025-09-03T09:09:20Z

Hey @ehuaa thanks for working on it! Is this somewhat stable?

hauck-jvsh · 2025-09-04T22:47:17Z

Great work, I tested it with qwen3-30b using a Nvidia A5000 and it worked gracefully.

ehuaa · 2025-09-08T06:44:43Z

Hey @ehuaa thanks for working on it! Is this somewhat stable?

Hi @Anaudia, I have tested qwen moe on a40 and deepseek on a100, the ci test failed I’ll checked this Sunday when I finished my vacation ,thanks

ehuaa · 2025-09-17T12:47:51Z

Hey @ehuaa thanks for working on it! Is this somewhat stable?

Hi @Anaudia, I have fixed the ci bugs and it's stable now. @zhyncs please retrigger the ci tests again, Thanks.

aidendle94 · 2025-10-30T09:01:22Z

Any update on this? Would love this feature

ehfd · 2025-11-09T04:52:56Z

Would also prefer this to be merged.

Closes #12887

ehfd · 2025-11-14T09:09:23Z

@ehuaa
Are you able to resolve the conflicts? Would really like this feature.

ehuaa · 2025-11-16T07:33:30Z

@ehuaa Are you able to resolve the conflicts? Would really like this feature.

@aidendle94 @ehfd Sure, I'll resolve the conflicts this week, and in current pr above, After testing, the qwen3 moe fp8 version maintains the same accuracy as bf16, but deepseek v3 fp8 experiences an accuracy drop of approximately 10% compared to bf16. I will try to address this issue this week.

ehfd · 2025-11-16T16:46:02Z

A few things:
vLLM already has the Marlin MoE FP8 implementation for Ampere, so worth comparing the accuracy with it.
GLM-4.5V, GLM-4.6, MiniMax M2 all have official FP8 quantizations. They should all be tested as well, and the GLM models don't work well with tensor parallelism in A100 GPUs in vLLM.
Since DeepSeek is native FP8, it should be compared with vLLM, and also with Ada or Hopper.

FlamingoPg · 2025-11-18T06:09:34Z

python/sglang/srt/layers/quantization/fp8.py

        prepare_fp8_layer_for_marlin,
+        prepare_moe_fp8_layer_for_marlin,
    )
+    from vllm.scalar_type import scalar_types


It looks like this will continue to introduce vllm dependencies. A good solution would be to directly implement them in sglang.

It looks like this will continue to introduce vllm dependencies. A good solution would be to directly implement them in sglang.

@FlamingoPg Ok, i'll try to re-implement them for moe fp8 models. And I noticed that in #13524 you have fixed the vllm dependencies for dense models, right?

ehfd · 2025-12-30T11:04:49Z

Any news?

init moe marlin fp8

4678fd2

ehuaa requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, ispobock, kushanam, merrymercy and zhyncs as code owners August 28, 2025 11:42

gemini-code-assist bot reviewed Aug 28, 2025

View reviewed changes

ehuaa mentioned this pull request Aug 28, 2025

enable marlin fp8 blockwise #8990

Merged

gemini-code-assist bot reviewed Aug 28, 2025

View reviewed changes

ehuaa changed the title ~~enable moe marlin fp8~~ enable moe marlin fp8 for Ampere GPU Aug 29, 2025

Merge branch 'main' into moe-marlin-fp8

21f5d09

ehuaa added 2 commits September 15, 2025 09:51

Merge branch 'main' into moe-marlin-fp8

b0945ba

fixed an incorrect merge that overwrote the changes from PR#9679

5acbff0

ehfd mentioned this pull request Nov 9, 2025

[Feature] Support NVIDIA Ampere (A100, 3090, A6000) MoE FP8 W8A8 quantization through Marlin #12887

Closed

2 tasks

merge and fix conlicts of fp8.py

55db869

ehuaa force-pushed the moe-marlin-fp8 branch from 6c9b517 to 55db869 Compare November 18, 2025 04:31

ehuaa requested review from AniZpZ and FlamingoPg as code owners November 18, 2025 04:31

Merge branch 'main' into moe-marlin-fp8

dd148ee

FlamingoPg reviewed Nov 18, 2025

View reviewed changes

FlamingoPg self-assigned this Nov 18, 2025

FlamingoPg added the run-ci label Nov 18, 2025

KMSorSMS mentioned this pull request Jan 4, 2026

KTransformers Roadmap (2026 Q1) kvcache-ai/ktransformers#1779

Open

This was referenced Mar 24, 2026

[Feature]: Implement TRITON_MLA_SPARSE backend for sm80 support of Sparse MLA vllm-project/vllm#38006

Open

[Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function vllm-project/vllm#37968

Merged

Conversation

ehuaa commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

ehuaa commented Aug 29, 2025

Uh oh!

PaulRoeseler commented Sep 3, 2025

Uh oh!

hauck-jvsh commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehuaa commented Sep 8, 2025

Uh oh!

ehuaa commented Sep 17, 2025

Uh oh!

aidendle94 commented Oct 30, 2025

Uh oh!

ehfd commented Nov 9, 2025

Uh oh!

ehfd commented Nov 14, 2025

Uh oh!

ehuaa commented Nov 16, 2025

Uh oh!

ehfd commented Nov 16, 2025

Uh oh!

FlamingoPg Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

ehuaa Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

ehfd commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ehuaa commented Aug 28, 2025 •

edited

Loading

hauck-jvsh commented Sep 4, 2025 •

edited

Loading