Skip to content

[AMD][Quantization] Add int4fp8_moe online quantization on ROCm#7392

Merged
HaiShaw merged 43 commits intosgl-project:mainfrom
fxmarty-amd:int4fp8_moe_new
Jan 14, 2026
Merged

[AMD][Quantization] Add int4fp8_moe online quantization on ROCm#7392
HaiShaw merged 43 commits intosgl-project:mainfrom
fxmarty-amd:int4fp8_moe_new

Conversation

@fxmarty-amd
Copy link
Copy Markdown
Contributor

@fxmarty-amd fxmarty-amd commented Jun 20, 2025

As per title, this PR supersedes #6238.

This PR implements loading MOE models checkpoints in high-precision (fp16, bf16), quantizing online the MOE experts to int4 and the attention projections to float8.

During inference the int4 moe weights are upcasted to float8 in order to use fp8 math.

Runnable with

export MODEL_ID="/large_models/mistralai_Mixtral-8x7B-Instruct-v0.1"
python3 -m sglang.launch_server --model-path ${MODEL_ID} --tensor-parallel-size 1 --disable-cuda-graph --quantization quark_int4fp8_moe

Left to do:

  • better tests
  • shard non-quantized weights before online quantization
  • eval grok1
  • benchmark latency/throughput
  • doc
  • merge main
  • Add grok-1 test

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @fxmarty-amd, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new quark_int4fp8_moe online quantization scheme, primarily targeting Mixture-of-Experts (MoE) models on ROCm-enabled AMD GPUs. It provides the core implementation for quantizing both linear and MoE layers to INT4/FP8 formats during model loading, along with necessary configuration and utility functions to integrate this new method into the system.

Highlights

  • New Quantization Scheme: Introduced quark_int4fp8_moe online quantization, enabling models to be quantized on-the-fly during loading, specifically designed for Mixture-of-Experts (MoE) models.
  • ROCm-Specific MoE Support: Added QuarkInt4Fp8MoEMethod for MoE layers, which handles INT4 weights and FP8 computation, with specific optimizations and support exclusively for AMD GPUs (ROCm).
  • Core Quantization Utilities: Implemented new utility functions for tensor-wise FP8 quantization, column-wise INT4 quantization, and efficient packing of INT4 values into INT32.
  • System Integration: Integrated the new quark_int4fp8_moe method across various configuration files and the model loader, making it selectable via command-line arguments and compatible with models like Mixtral.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces quark_int4fp8_moe online quantization, primarily targeting ROCm. It adds new configuration classes (QuarkInt4Fp8Config) and corresponding methods for linear layers (QuarkInt4Fp8LinearMethod) and MoE layers (QuarkInt4Fp8MoEMethod). Utility functions for FP8 and INT4 quantization and packing are also included. The changes seem well-structured but there are a few areas for improvement, particularly around error handling in quantization utilities and attribute initialization.

@fxmarty-amd fxmarty-amd marked this pull request as ready for review June 25, 2025 10:46
@HaiShaw HaiShaw self-assigned this Jun 25, 2025
@HaiShaw HaiShaw marked this pull request as draft June 25, 2025 22:08
@fxmarty-amd fxmarty-amd marked this pull request as ready for review June 26, 2025 11:15
@yctseng0211
Copy link
Copy Markdown
Collaborator

@HaiShaw @fxmarty-amd seems the test files "test_int4fp8_moe.py" in this PR should be added into 'not_in_ci' block or '"per-commit-amd"'
sglang/test/srt/run_suite.py at main · sgl-project/sglang ,
sglang/test/srt/run_suite.py at main · sgl-project/sglang

ci testcase error

File "/sglang-checkout/test/srt/run_suite.py", line 453, in _sanity_check_suites
    assert len(missing_files) == 0, (
AssertionError: Some test files are not in test suite. If this is intentional, please add the following to `not_in_ci` section:
TestFile("test_int4fp8_moe.py"),
args=Namespace(timeout_per_file=1200, suite='per-commit-amd', auto_partition_id=0, auto_partition_size=12, continue_on_error=False)

@fxmarty-amd
Copy link
Copy Markdown
Contributor Author

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Jan 6, 2026

@fxmarty-amd

@fxmarty-amd
Copy link
Copy Markdown
Contributor Author

The failing tests come from startup errors as DeepEP error: CPU recv timeout, TimeoutError: Server failed to start within the timeout period, The action 'Run test' has timed out after 30 minutes, ValueError: Global server args is not set yet!, All retry attempts exhausted for request. Returning empty response., Rate limit exception so wait and retry 2 after 4 sec Connection error., [2026-01-09 09:37:27 TP0] Prefill transfer failed for request rank=0 req.rid='c227824606f144a9a7b58d1c20b2550a' req.bootstrap_room=5552253286653449896 with exception KVTransferError(bootstrap_room=5552253286653449896): Decode instance could be dead, remote mooncake session 10.192.69.120:16274 is not alive, which do not look related to this PR, however main branch is green, so I am not sure.

https://github.com/sgl-project/sglang/actions/runs/20846115308/job/59890510312?pr=7392 (PR Test / stage-b-test-small-1-gpu (9)) also has illegal memory access errors.

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Jan 12, 2026

@fxmarty-amd Would you please add/update the use case examples in PR message body (the one there was out dated), and show accuracy data together?

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Jan 14, 2026

Updated int4fp8_moe to quark_int4fp8_moe in PR body

@HaiShaw HaiShaw changed the title [Quantization] Add int4fp8_moe online quantization on ROCm [AMD][Quantization] Add int4fp8_moe online quantization on ROCm Jan 14, 2026
@HaiShaw HaiShaw merged commit 5af84c8 into sgl-project:main Jan 14, 2026
126 of 140 checks passed
yingluosanqian pushed a commit to AichenF/sglang that referenced this pull request Jan 14, 2026
…l-project#7392)

Co-authored-by: Dehua Tang <dehtang@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: YC Tseng <yctseng@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants