-
Notifications
You must be signed in to change notification settings - Fork 5k
Refactor Marlin MoeRunner #14554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
ch-wan
merged 15 commits into
sgl-project:main
from
trangdough:refactor-marlin-moe-path
Dec 11, 2025
Merged
Refactor Marlin MoeRunner #14554
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
5199746
add marlin.py
trangdough 38808f4
register MarlinMoeRunner runner.py utils.py, refactor GPTQMarlinMoEMe…
trangdough a49e3b7
modified marlin.py, integrated new MoeRunner into awq.py and gptq.py
trangdough 8da2ece
added float16 awq test to test_awq.py
trangdough 69c5392
fixed lint issues
trangdough b1a408a
Update python/sglang/srt/layers/quantization/gptq.py
trangdough 90f2012
added assertion get_moe_runner_backend().is_auto()
trangdough e97026f
move workspace from MoeQuantInfo to MarlinRunnerCore
trangdough c296ddb
cleaned up awq.py
trangdough cf18fef
fixed marlin_make_workspace circular import error
trangdough d76eae8
registered fused func
trangdough 27ac1d3
Merge branch 'main' into refactor-marlin-moe-path
ch-wan bb6b2ca
removed dead code, changed MARLIN_MOE_WORKSPACE to global buffer
trangdough ac8336b
removed MarlinRunnerCore
trangdough 0f24355
Merge branch 'main' into refactor-marlin-moe-path
ch-wan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,125 @@ | ||
| from __future__ import annotations | ||
|
|
||
| from dataclasses import dataclass | ||
| from typing import TYPE_CHECKING, Optional | ||
|
|
||
| import torch | ||
|
|
||
| from sglang.srt.layers.moe.moe_runner.base import ( | ||
| MoeQuantInfo, | ||
| MoeRunnerConfig, | ||
| RunnerInput, | ||
| RunnerOutput, | ||
| register_fused_func, | ||
| ) | ||
| from sglang.srt.layers.moe.utils import MoeRunnerBackend | ||
|
|
||
| if TYPE_CHECKING: | ||
| from sglang.srt.layers.moe.token_dispatcher import ( | ||
| StandardCombineInput, | ||
| StandardDispatchOutput, | ||
| ) | ||
|
|
||
| MARLIN_MOE_WORKSPACE: Optional[torch.Tensor] = None | ||
|
|
||
|
|
||
| @dataclass | ||
| class MarlinRunnerInput(RunnerInput): | ||
| """Input bundle passed to the Marlin runner core.""" | ||
|
|
||
| hidden_states: torch.Tensor | ||
| topk_weights: torch.Tensor | ||
| topk_ids: torch.Tensor | ||
| router_logits: torch.Tensor | ||
|
|
||
| @property | ||
| def runner_backend(self) -> MoeRunnerBackend: | ||
| return MoeRunnerBackend.MARLIN | ||
|
|
||
|
|
||
| @dataclass | ||
| class MarlinRunnerOutput(RunnerOutput): | ||
| """Output bundle returned from the Marlin runner core.""" | ||
|
|
||
| hidden_states: torch.Tensor | ||
|
|
||
| @property | ||
| def runner_backend(self) -> MoeRunnerBackend: | ||
| return MoeRunnerBackend.MARLIN | ||
|
|
||
|
|
||
| @dataclass | ||
| class MarlinMoeQuantInfo(MoeQuantInfo): | ||
| """Quantization payload consumed by the Marlin backend.""" | ||
|
|
||
| w13_qweight: torch.Tensor | ||
| w2_qweight: torch.Tensor | ||
| w13_scales: torch.Tensor | ||
| w2_scales: torch.Tensor | ||
| w13_g_idx_sort_indices: Optional[torch.Tensor] | ||
| w2_g_idx_sort_indices: Optional[torch.Tensor] | ||
| weight_bits: int | ||
|
|
||
| # GPTQ specific (Optional) | ||
| w13_g_idx: Optional[torch.Tensor] = None | ||
| w2_g_idx: Optional[torch.Tensor] = None | ||
| is_k_full: bool = True | ||
|
|
||
| # AWQ specific (Optional) | ||
| w13_qzeros: Optional[torch.Tensor] = None | ||
| w2_qzeros: Optional[torch.Tensor] = None | ||
|
|
||
| # Optional | ||
| expert_map: Optional[torch.Tensor] = None | ||
|
|
||
|
|
||
| @register_fused_func("none", "marlin") | ||
| def fused_experts_none_to_marlin( | ||
| dispatch_output: StandardDispatchOutput, | ||
| quant_info: MarlinMoeQuantInfo, | ||
| runner_config: MoeRunnerConfig, | ||
| ) -> StandardCombineInput: | ||
| global MARLIN_MOE_WORKSPACE | ||
| from sglang.srt.layers.moe.fused_moe_triton.fused_marlin_moe import fused_marlin_moe | ||
| from sglang.srt.layers.moe.token_dispatcher.standard import StandardCombineInput | ||
| from sglang.srt.layers.quantization.marlin_utils import marlin_make_workspace | ||
|
|
||
| hidden_states = dispatch_output.hidden_states | ||
| topk_output = dispatch_output.topk_output | ||
|
|
||
| assert runner_config.activation == "silu", "Only SiLU activation is supported." | ||
|
|
||
| if ( | ||
| MARLIN_MOE_WORKSPACE is None | ||
| or MARLIN_MOE_WORKSPACE.device != hidden_states.device | ||
| ): | ||
| MARLIN_MOE_WORKSPACE = marlin_make_workspace( | ||
| hidden_states.device, max_blocks_per_sm=4 | ||
| ) | ||
|
|
||
| output = fused_marlin_moe( | ||
| hidden_states=hidden_states, | ||
| w1=quant_info.w13_qweight, | ||
| w2=quant_info.w2_qweight, | ||
| w1_scale=quant_info.w13_scales, | ||
| w2_scale=quant_info.w2_scales, | ||
| gating_output=topk_output.router_logits, | ||
| topk_weights=topk_output.topk_weights, | ||
| topk_ids=topk_output.topk_ids, | ||
| expert_map=quant_info.expert_map, | ||
| g_idx1=quant_info.w13_g_idx, | ||
| g_idx2=quant_info.w2_g_idx, | ||
| sort_indices1=quant_info.w13_g_idx_sort_indices, | ||
| sort_indices2=quant_info.w2_g_idx_sort_indices, | ||
| w1_zeros=quant_info.w13_qzeros, | ||
| w2_zeros=quant_info.w2_qzeros, | ||
| workspace=MARLIN_MOE_WORKSPACE, | ||
| num_bits=quant_info.weight_bits, | ||
| is_k_full=quant_info.is_k_full, | ||
| inplace=runner_config.inplace, | ||
| routed_scaling_factor=runner_config.routed_scaling_factor, | ||
| ).to(hidden_states.dtype) | ||
|
|
||
| return StandardCombineInput( | ||
| hidden_states=output, | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.