[4/N] Quantization Refactor: Quark MoE schemes#18252
Conversation
Summary of ChangesHello @TamirBaydasov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the quantization implementation for Mixture-of-Experts (MoE) layers within the Quark framework. The primary goal is to enhance code organization and maintainability by separating different MoE quantization schemes into distinct, dedicated files. This change centralizes the scheme selection logic and streamlines how quantized MoE layers are initialized and processed, laying a clearer foundation for future quantization method expansions. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request is a well-executed refactoring that improves the structure of Quark's MoE quantization schemes. Moving the schemes into their own files under a dedicated directory and using a common QuarkScheme interface is a good step towards better modularity and maintainability.
My review has identified a few areas for improvement:
- There is some code duplication in
quark.pythat could be refactored. - A docstring in
QuarkFusedMoEMethodis incorrect and should be updated. - A more significant design issue is that the new MoE schemes do not correctly implement the
QuarkSchemeabstract base class interface, which could lead to maintenance issues. I've suggested creating a separate base class for MoE schemes.
Addressing these points will further enhance the quality of this refactoring.
| if layer_quant_config.get("output_tensors") or layer_quant_config.get("bias"): | ||
| raise NotImplementedError( | ||
| "Currently, Quark models with " | ||
| "output_tensors and bias " | ||
| "quantized are not supported" | ||
| ) | ||
| weight_config = layer_quant_config.get("weight") | ||
| input_config = layer_quant_config.get("input_tensors") |
There was a problem hiding this comment.
There's some code duplication between this new get_moe_scheme method and the existing _get_scheme_from_config method (lines 319-325). Specifically, the logic for checking output_tensors and bias, and for getting weight_config and input_config is repeated. To improve maintainability and adhere to the DRY (Don't Repeat Yourself) principle, consider extracting this common logic into a helper method.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/rerun-failed-ci |
|
/tag-and-rerun-ci |
|
I merged it since AniZpZ and HandH1998 reviewed and all CIs passed except three tests. here are the initial analysis results: for stage-b-test-small-1-gpu-amd (linux-mi325-gpu-1, 11), the first Error is OSError: libavutil.so.58: cannot open shared object file: No such file or directory, which obviously has nothing to do with our PR. please let me know if there are any issues. |
|
@kkHuang-amd @BowenBao Please have a review (from AMD/Quark). |
|
Please tag AMDiers to review Quark (code owner). |
Hi, could you please send me the name list of AMDiers |
| @@ -600,7 +600,7 @@ def forward( | |||
| output_dtype = hidden_states.dtype | |||
| scale = None | |||
| is_fp8_quant = isinstance(self.quant_method, Fp8MoEMethod) | |||
There was a problem hiding this comment.
why is fp8 self.quant_method, while quark_w4a4 self.scheme, what's the difference between quant_method and scheme?
There was a problem hiding this comment.
As per our quantization roadmap we have made a commitment to move existing MoE methods into scheme format. Scheme structure allows for easier implementation and review process.
Practically, the difference is that now for MoE layers in Quark the quant_method is QuarkMoEScheme class, which calls certain scheme under the hood. We add abstraction layer here (and call it scheme) to improve the code structure and readability, basically.
Motivation for this change came from one of the TODO in compressed-tensors structure. The problem statement is that with more and more MoE quantization types being supported, the files where all the classes are stored (compressed_tensors_moe.py,modelslim_moe.py,quark_moe.py) get excessively large, with 2k+ lines. This PR is one of 3 that is intended to fix this problem.
Just like for compressed-tensors, there was a scheme logic already implemented for Quark linear layers, so I decided to refactor moe layer logic the same way.
The FP8MoEMethod has not yet been refactored, because we have not yet gotten to refactoring other quantization formats that do not have a built-in scheme structure.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Peng Zhang <aniz1905@gmail.com> Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Peng Zhang <aniz1905@gmail.com> Co-authored-by: ronnie_zheng <zl19940307@163.com>
Motivation
Add MoE schemes to quark instead of storing all classes in a single file. Follow up to #17503
Images and motivation for this PR can be viewed in our roadmap: #15194
Modifications
Moved all classes from
quark_moe.pyto new schemes in quantization/quark/schemes/Removed
quark_moe.pyfileAdded
get_moe_schemefunction toquark.pyfileChanged
MoriEPMoEclass to reflect scheme refactorAccuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci