[4/N] Quantization Refactor: Quark MoE schemes by TamirBaydasov · Pull Request #18252 · sgl-project/sglang

TamirBaydasov · 2026-02-04T14:15:17Z

Motivation

Add MoE schemes to quark instead of storing all classes in a single file. Follow up to #17503

Images and motivation for this PR can be viewed in our roadmap: #15194

Modifications

Moved all classes from quark_moe.py to new schemes in quantization/quark/schemes/
Removed quark_moe.py file
Added get_moe_scheme function to quark.py file
Changed MoriEPMoE class to reflect scheme refactor

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-04T14:15:54Z

Summary of Changes

Hello @TamirBaydasov, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the quantization implementation for Mixture-of-Experts (MoE) layers within the Quark framework. The primary goal is to enhance code organization and maintainability by separating different MoE quantization schemes into distinct, dedicated files. This change centralizes the scheme selection logic and streamlines how quantized MoE layers are initialized and processed, laying a clearer foundation for future quantization method expansions.

Highlights

MoE Scheme Refactoring: The core logic for Mixture-of-Experts (MoE) quantization schemes has been refactored into dedicated files, moving classes from a single quark_moe.py file to a new quantization/quark/schemes/ directory. This improves modularity and organization.
New MoE Scheme Classes: Two new classes, QuarkW4A4MXFp4MoE and QuarkW8A8FP8MoE, have been introduced to represent specific MoE quantization schemes. These classes encapsulate the logic for creating weights, processing them after loading, and applying the forward pass for their respective quantization types.
Centralized MoE Scheme Selection: A new get_moe_scheme function has been added to quark.py. This function is responsible for identifying the appropriate MoE quantization scheme (e.g., W4A4MXFp4MoE or W8A8FP8MoE) based on the layer's configuration and returning an instance of the corresponding scheme class.
Updated MoE Layer Integration: The MoriEPMoE class has been updated to integrate with the new scheme-based approach. It now uses the QuarkW4A4MXFp4MoE class directly for type checking and scheme management, rather than the older QuarkW4A4MXFp4MoEMethod.
Introduction of QuarkFusedMoEMethod: A new QuarkFusedMoEMethod class has been added, which acts as a wrapper to manage the lifecycle of MoE schemes, including weight creation, processing, and application, leveraging the newly defined QuarkScheme instances.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/srt/layers/moe/ep_moe/layer.py
- Updated import path for Quark MoE schemes from quark_moe to schemes.
- Modified isinstance check to use the new QuarkW4A4MXFp4MoE class instead of QuarkW4A4MXFp4MoEMethod.
python/sglang/srt/layers/quantization/quark/quark.py
- Added imports for TYPE_CHECKING, MoeRunnerConfig, FusedMoEMethodBase, QuarkW4A4MXFp4MoE, and QuarkW8A8FP8MoE.
- Removed import for QuarkMoEMethod.
- Added QuarkFusedMoEMethod to the __all__ export list.
- Modified get_quant_method to assign the determined MoE scheme to the layer and return a QuarkFusedMoEMethod instance.
- Introduced a new get_moe_scheme method to dynamically select and instantiate the correct MoE quantization scheme based on configuration.
- Updated docstrings for create_weights and apply methods to refer to QuarkScheme instead of CompressedTensorsScheme.
- Added the QuarkFusedMoEMethod class, which implements FusedMoEMethodBase and delegates operations to the specific QuarkScheme assigned to a layer.
python/sglang/srt/layers/quantization/quark/schemes/init.py
- Imported and added QuarkW4A4MXFp4MoE and QuarkW8A8FP8MoE to the __all__ list, making them discoverable.
python/sglang/srt/layers/quantization/quark/schemes/quark_w4a4_mxfp4_moe.py
- New file added, defining the QuarkW4A4MXFp4MoE class.
- This class implements the QuarkScheme interface for MX(FP4) Fused MoE layers, including methods for weight creation, post-loading processing, MoE runner creation, and applying weights.
python/sglang/srt/layers/quantization/quark/schemes/quark_w8a8_fp8_moe.py
- File renamed from python/sglang/srt/layers/quantization/quark/quark_moe.py.
- The QuarkMoEMethod and QuarkW4A4MXFp4MoEMethod classes were removed from this file.
- The QuarkW8A8FP8MoEMethod class was renamed to QuarkW8A8FP8MoE and now inherits from QuarkScheme.
- The apply method within QuarkW8A8FP8MoE was renamed to apply_weights to align with the QuarkScheme interface.

Activity

This pull request is part of a series of quantization refactoring efforts, indicated by the title '[4/N]'.
It is a follow-up to a previous pull request ([2/N] Quantization Refactor: Compressed tensors MoE schemes #17503), suggesting an ongoing architectural improvement.
The provided checklist items (e.g., format code, add unit tests, update documentation, provide benchmarks) are currently unchecked, indicating that these tasks are either pending or not yet completed for this PR.
No human review comments or specific progress updates are available in the provided context.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a well-executed refactoring that improves the structure of Quark's MoE quantization schemes. Moving the schemes into their own files under a dedicated directory and using a common QuarkScheme interface is a good step towards better modularity and maintainability.

My review has identified a few areas for improvement:

There is some code duplication in quark.py that could be refactored.
A docstring in QuarkFusedMoEMethod is incorrect and should be updated.
A more significant design issue is that the new MoE schemes do not correctly implement the QuarkScheme abstract base class interface, which could lead to maintenance issues. I've suggested creating a separate base class for MoE schemes.

Addressing these points will further enhance the quality of this refactoring.

gemini-code-assist · 2026-02-04T14:18:14Z

+        if layer_quant_config.get("output_tensors") or layer_quant_config.get("bias"):
+            raise NotImplementedError(
+                "Currently, Quark models with "
+                "output_tensors and bias "
+                "quantized are not supported"
+            )
+        weight_config = layer_quant_config.get("weight")
+        input_config = layer_quant_config.get("input_tensors")


There's some code duplication between this new get_moe_scheme method and the existing _get_scheme_from_config method (lines 319-325). Specifically, the logic for checking output_tensors and bias, and for getting weight_config and input_config is repeated. To improve maintainability and adhere to the DRY (Don't Repeat Yourself) principle, consider extracting this common logic into a helper method.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

HandH1998 · 2026-02-13T03:28:42Z

/rerun-failed-ci

ping1jing2 · 2026-02-17T18:48:55Z

/tag-and-rerun-ci

ping1jing2 · 2026-02-18T16:44:03Z

I merged it since AniZpZ and HandH1998 reviewed and all CIs passed except three tests. here are the initial analysis results:

for stage-b-test-small-1-gpu-amd (linux-mi325-gpu-1, 11), the first Error is OSError: libavutil.so.58: cannot open shared object file: No such file or directory, which obviously has nothing to do with our PR.
for stage-b-test-small-1-gpu-amd (linux-mi325-gpu-1, 13), the first Error is AssertionError: ROUGE-L score 0.9773755656108598 below tolerance 1.0 for base 'meta-llama/Llama-2-7b-hf', i believe this has nothing to do with our PR, as I found many of same error in AMD-related CI runners.
for stage-c-test-4-gpu-h100 (0), i have confirmed with Kangyan-Zhou that this issue is a known regression.

please let me know if there are any issues.

HaiShaw · 2026-02-20T07:14:11Z

@kkHuang-amd @BowenBao Please have a review (from AMD/Quark).

HaiShaw · 2026-02-20T07:15:09Z

Please tag AMDiers to review Quark (code owner).

ping1jing2 · 2026-02-20T07:28:17Z

Please tag AMDiers to review Quark (code owner).

Hi, could you please send me the name list of AMDiers

BowenBao · 2026-02-20T17:46:00Z

@@ -600,7 +600,7 @@ def forward(
        output_dtype = hidden_states.dtype
        scale = None
        is_fp8_quant = isinstance(self.quant_method, Fp8MoEMethod)


why is fp8 self.quant_method, while quark_w4a4 self.scheme, what's the difference between quant_method and scheme?

As per our quantization roadmap we have made a commitment to move existing MoE methods into scheme format. Scheme structure allows for easier implementation and review process.

Practically, the difference is that now for MoE layers in Quark the quant_method is QuarkMoEScheme class, which calls certain scheme under the hood. We add abstraction layer here (and call it scheme) to improve the code structure and readability, basically.

Motivation for this change came from one of the TODO in compressed-tensors structure. The problem statement is that with more and more MoE quantization types being supported, the files where all the classes are stored (compressed_tensors_moe.py,modelslim_moe.py,quark_moe.py) get excessively large, with 2k+ lines. This PR is one of 3 that is intended to fix this problem.

Just like for compressed-tensors, there was a scheme logic already implemented for Quark linear layers, so I decided to refactor moe layer logic the same way.

The FP8MoEMethod has not yet been refactored, because we have not yet gotten to refactoring other quantization formats that do not have a built-in scheme structure.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Peng Zhang <aniz1905@gmail.com> Co-authored-by: ronnie_zheng <zl19940307@163.com>

Quark schemes initial implementation

f3dc575

TamirBaydasov requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners February 4, 2026 14:15

TamirBaydasov mentioned this pull request Feb 4, 2026

[Roadmap] Quantization Modifications #15194

Open

27 tasks

gemini-code-assist Bot reviewed Feb 4, 2026

View reviewed changes

ping1jing2 self-assigned this Feb 4, 2026

TamirBaydasov and others added 7 commits February 4, 2026 17:33

Apply suggestion from @gemini-code-assist[bot]

e146e32

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Merge branch 'main' into quark_moe_schemes

49b6b2b

Quark moe base scheme implementation

bce52d5

Remove device capability from base schemes

c4089eb

Docsting fix

b754216

Merge branch 'main' into quark_moe_schemes

9a02067

Merge branch 'main' into quark_moe_schemes

8c15e0f

AniZpZ and others added 3 commits February 16, 2026 16:29

Merge branch 'main' into quark_moe_schemes

751cdb3

Merge branch 'main' into quark_moe_schemes

1e78070

Merge branch 'main' into quark_moe_schemes

2ec5143

Merge branch 'main' into quark_moe_schemes

a300bd4

github-actions Bot added the run-ci label Feb 17, 2026

sglang-npu-bot merged commit 150ed88 into sgl-project:main Feb 18, 2026
267 of 284 checks passed

Duyi-Wang mentioned this pull request Feb 19, 2026

[AMD] fix: ensure scheme attribute exists before type checking in MoriEPMoE #19012

Closed

5 tasks

BowenBao reviewed Feb 20, 2026

View reviewed changes

michaelzhang-ai mentioned this pull request Feb 25, 2026

[AMD] Fix AMD CI test of TestToolChoiceLfm2Moe #19113

Merged

2 tasks

Conversation

TamirBaydasov commented Feb 4, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Feb 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HandH1998 commented Feb 13, 2026

Uh oh!

ping1jing2 commented Feb 17, 2026

Uh oh!

ping1jing2 commented Feb 18, 2026

Uh oh!

Uh oh!

HaiShaw commented Feb 20, 2026

Uh oh!

HaiShaw commented Feb 20, 2026

Uh oh!

ping1jing2 commented Feb 20, 2026

Uh oh!

BowenBao Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

TamirBaydasov Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants