[AMD][Quantization] Add int4fp8_moe online quantization on ROCm#7392
[AMD][Quantization] Add int4fp8_moe online quantization on ROCm#7392HaiShaw merged 43 commits intosgl-project:mainfrom
int4fp8_moe online quantization on ROCm#7392Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @fxmarty-amd, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a new quark_int4fp8_moe online quantization scheme, primarily targeting Mixture-of-Experts (MoE) models on ROCm-enabled AMD GPUs. It provides the core implementation for quantizing both linear and MoE layers to INT4/FP8 formats during model loading, along with necessary configuration and utility functions to integrate this new method into the system.
Highlights
- New Quantization Scheme: Introduced
quark_int4fp8_moeonline quantization, enabling models to be quantized on-the-fly during loading, specifically designed for Mixture-of-Experts (MoE) models. - ROCm-Specific MoE Support: Added
QuarkInt4Fp8MoEMethodfor MoE layers, which handles INT4 weights and FP8 computation, with specific optimizations and support exclusively for AMD GPUs (ROCm). - Core Quantization Utilities: Implemented new utility functions for tensor-wise FP8 quantization, column-wise INT4 quantization, and efficient packing of INT4 values into INT32.
- System Integration: Integrated the new
quark_int4fp8_moemethod across various configuration files and the model loader, making it selectable via command-line arguments and compatible with models like Mixtral.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces quark_int4fp8_moe online quantization, primarily targeting ROCm. It adds new configuration classes (QuarkInt4Fp8Config) and corresponding methods for linear layers (QuarkInt4Fp8LinearMethod) and MoE layers (QuarkInt4Fp8MoEMethod). Utility functions for FP8 and INT4 quantization and packing are also included. The changes seem well-structured but there are a few areas for improvement, particularly around error handling in quantization utilities and attribute initialization.
…r-shard, on device & update progress bar
|
@HaiShaw @fxmarty-amd seems the test files "test_int4fp8_moe.py" in this PR should be added into 'not_in_ci' block or '"per-commit-amd"' ci testcase error |
… into int4fp8_moe_new
|
The failing tests come from startup errors as https://github.com/sgl-project/sglang/actions/runs/20846115308/job/59890510312?pr=7392 ( |
|
@fxmarty-amd Would you please add/update the use case examples in PR message body (the one there was out dated), and show accuracy data together? |
|
Updated |
int4fp8_moe online quantization on ROCmint4fp8_moe online quantization on ROCm
…l-project#7392) Co-authored-by: Dehua Tang <dehtang@amd.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: YC Tseng <yctseng@amd.com>
As per title, this PR supersedes #6238.
This PR implements loading MOE models checkpoints in high-precision (fp16, bf16), quantizing online the MOE experts to int4 and the attention projections to float8.
During inference the int4 moe weights are upcasted to float8 in order to use fp8 math.
Runnable with
Left to do: