Skip to content

[4/N] Quantization Refactor: Quark MoE schemes#18252

Merged
sglang-npu-bot merged 12 commits into
sgl-project:mainfrom
TamirBaydasov:quark_moe_schemes
Feb 18, 2026
Merged

[4/N] Quantization Refactor: Quark MoE schemes#18252
sglang-npu-bot merged 12 commits into
sgl-project:mainfrom
TamirBaydasov:quark_moe_schemes

Conversation

@TamirBaydasov
Copy link
Copy Markdown
Contributor

Motivation

Add MoE schemes to quark instead of storing all classes in a single file. Follow up to #17503

Images and motivation for this PR can be viewed in our roadmap: #15194

Modifications

Moved all classes from quark_moe.py to new schemes in quantization/quark/schemes/
Removed quark_moe.py file
Added get_moe_scheme function to quark.py file
Changed MoriEPMoE class to reflect scheme refactor

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @TamirBaydasov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the quantization implementation for Mixture-of-Experts (MoE) layers within the Quark framework. The primary goal is to enhance code organization and maintainability by separating different MoE quantization schemes into distinct, dedicated files. This change centralizes the scheme selection logic and streamlines how quantized MoE layers are initialized and processed, laying a clearer foundation for future quantization method expansions.

Highlights

  • MoE Scheme Refactoring: The core logic for Mixture-of-Experts (MoE) quantization schemes has been refactored into dedicated files, moving classes from a single quark_moe.py file to a new quantization/quark/schemes/ directory. This improves modularity and organization.
  • New MoE Scheme Classes: Two new classes, QuarkW4A4MXFp4MoE and QuarkW8A8FP8MoE, have been introduced to represent specific MoE quantization schemes. These classes encapsulate the logic for creating weights, processing them after loading, and applying the forward pass for their respective quantization types.
  • Centralized MoE Scheme Selection: A new get_moe_scheme function has been added to quark.py. This function is responsible for identifying the appropriate MoE quantization scheme (e.g., W4A4MXFp4MoE or W8A8FP8MoE) based on the layer's configuration and returning an instance of the corresponding scheme class.
  • Updated MoE Layer Integration: The MoriEPMoE class has been updated to integrate with the new scheme-based approach. It now uses the QuarkW4A4MXFp4MoE class directly for type checking and scheme management, rather than the older QuarkW4A4MXFp4MoEMethod.
  • Introduction of QuarkFusedMoEMethod: A new QuarkFusedMoEMethod class has been added, which acts as a wrapper to manage the lifecycle of MoE schemes, including weight creation, processing, and application, leveraging the newly defined QuarkScheme instances.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/layers/moe/ep_moe/layer.py
    • Updated import path for Quark MoE schemes from quark_moe to schemes.
    • Modified isinstance check to use the new QuarkW4A4MXFp4MoE class instead of QuarkW4A4MXFp4MoEMethod.
  • python/sglang/srt/layers/quantization/quark/quark.py
    • Added imports for TYPE_CHECKING, MoeRunnerConfig, FusedMoEMethodBase, QuarkW4A4MXFp4MoE, and QuarkW8A8FP8MoE.
    • Removed import for QuarkMoEMethod.
    • Added QuarkFusedMoEMethod to the __all__ export list.
    • Modified get_quant_method to assign the determined MoE scheme to the layer and return a QuarkFusedMoEMethod instance.
    • Introduced a new get_moe_scheme method to dynamically select and instantiate the correct MoE quantization scheme based on configuration.
    • Updated docstrings for create_weights and apply methods to refer to QuarkScheme instead of CompressedTensorsScheme.
    • Added the QuarkFusedMoEMethod class, which implements FusedMoEMethodBase and delegates operations to the specific QuarkScheme assigned to a layer.
  • python/sglang/srt/layers/quantization/quark/schemes/init.py
    • Imported and added QuarkW4A4MXFp4MoE and QuarkW8A8FP8MoE to the __all__ list, making them discoverable.
  • python/sglang/srt/layers/quantization/quark/schemes/quark_w4a4_mxfp4_moe.py
    • New file added, defining the QuarkW4A4MXFp4MoE class.
    • This class implements the QuarkScheme interface for MX(FP4) Fused MoE layers, including methods for weight creation, post-loading processing, MoE runner creation, and applying weights.
  • python/sglang/srt/layers/quantization/quark/schemes/quark_w8a8_fp8_moe.py
    • File renamed from python/sglang/srt/layers/quantization/quark/quark_moe.py.
    • The QuarkMoEMethod and QuarkW4A4MXFp4MoEMethod classes were removed from this file.
    • The QuarkW8A8FP8MoEMethod class was renamed to QuarkW8A8FP8MoE and now inherits from QuarkScheme.
    • The apply method within QuarkW8A8FP8MoE was renamed to apply_weights to align with the QuarkScheme interface.
Activity
  • This pull request is part of a series of quantization refactoring efforts, indicated by the title '[4/N]'.
  • It is a follow-up to a previous pull request ([2/N] Quantization Refactor: Compressed tensors MoE schemes #17503), suggesting an ongoing architectural improvement.
  • The provided checklist items (e.g., format code, add unit tests, update documentation, provide benchmarks) are currently unchecked, indicating that these tasks are either pending or not yet completed for this PR.
  • No human review comments or specific progress updates are available in the provided context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a well-executed refactoring that improves the structure of Quark's MoE quantization schemes. Moving the schemes into their own files under a dedicated directory and using a common QuarkScheme interface is a good step towards better modularity and maintainability.

My review has identified a few areas for improvement:

  • There is some code duplication in quark.py that could be refactored.
  • A docstring in QuarkFusedMoEMethod is incorrect and should be updated.
  • A more significant design issue is that the new MoE schemes do not correctly implement the QuarkScheme abstract base class interface, which could lead to maintenance issues. I've suggested creating a separate base class for MoE schemes.

Addressing these points will further enhance the quality of this refactoring.

Comment thread python/sglang/srt/layers/quantization/quark/schemes/quark_w4a4_mxfp4_moe.py Outdated
Comment thread python/sglang/srt/layers/quantization/quark/schemes/quark_w8a8_fp8_moe.py Outdated
Comment on lines +362 to +369
if layer_quant_config.get("output_tensors") or layer_quant_config.get("bias"):
raise NotImplementedError(
"Currently, Quark models with "
"output_tensors and bias "
"quantized are not supported"
)
weight_config = layer_quant_config.get("weight")
input_config = layer_quant_config.get("input_tensors")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's some code duplication between this new get_moe_scheme method and the existing _get_scheme_from_config method (lines 319-325). Specifically, the logic for checking output_tensors and bias, and for getting weight_config and input_config is repeated. To improve maintainability and adhere to the DRY (Don't Repeat Yourself) principle, consider extracting this common logic into a helper method.

Comment thread python/sglang/srt/layers/quantization/quark/quark.py Outdated
@ping1jing2 ping1jing2 self-assigned this Feb 4, 2026
@HandH1998
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@ping1jing2
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@ping1jing2
Copy link
Copy Markdown
Collaborator

I merged it since AniZpZ and HandH1998 reviewed and all CIs passed except three tests. here are the initial analysis results:

for stage-b-test-small-1-gpu-amd (linux-mi325-gpu-1, 11), the first Error is OSError: libavutil.so.58: cannot open shared object file: No such file or directory, which obviously has nothing to do with our PR.
for stage-b-test-small-1-gpu-amd (linux-mi325-gpu-1, 13), the first Error is AssertionError: ROUGE-L score 0.9773755656108598 below tolerance 1.0 for base 'meta-llama/Llama-2-7b-hf', i believe this has nothing to do with our PR, as I found many of same error in AMD-related CI runners.
for stage-c-test-4-gpu-h100 (0), i have confirmed with Kangyan-Zhou that this issue is a known regression.

please let me know if there are any issues.

@sglang-npu-bot sglang-npu-bot merged commit 150ed88 into sgl-project:main Feb 18, 2026
267 of 284 checks passed
@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Feb 20, 2026

@kkHuang-amd @BowenBao Please have a review (from AMD/Quark).

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Feb 20, 2026

Please tag AMDiers to review Quark (code owner).

@ping1jing2
Copy link
Copy Markdown
Collaborator

Please tag AMDiers to review Quark (code owner).

Hi, could you please send me the name list of AMDiers

@@ -600,7 +600,7 @@ def forward(
output_dtype = hidden_states.dtype
scale = None
is_fp8_quant = isinstance(self.quant_method, Fp8MoEMethod)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is fp8 self.quant_method, while quark_w4a4 self.scheme, what's the difference between quant_method and scheme?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per our quantization roadmap we have made a commitment to move existing MoE methods into scheme format. Scheme structure allows for easier implementation and review process.

Practically, the difference is that now for MoE layers in Quark the quant_method is QuarkMoEScheme class, which calls certain scheme under the hood. We add abstraction layer here (and call it scheme) to improve the code structure and readability, basically.

Motivation for this change came from one of the TODO in compressed-tensors structure. The problem statement is that with more and more MoE quantization types being supported, the files where all the classes are stored (compressed_tensors_moe.py,modelslim_moe.py,quark_moe.py) get excessively large, with 2k+ lines. This PR is one of 3 that is intended to fix this problem.

Just like for compressed-tensors, there was a scheme logic already implemented for Quark linear layers, so I decided to refactor moe layer logic the same way.

The FP8MoEMethod has not yet been refactored, because we have not yet gotten to refactoring other quantization formats that do not have a built-in scheme structure.

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Peng Zhang <aniz1905@gmail.com>
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Peng Zhang <aniz1905@gmail.com>
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants