[AMD][Quantization] Add `int4fp8_moe` online quantization on ROCm by fxmarty-amd · Pull Request #7392 · sgl-project/sglang

fxmarty-amd · 2025-06-20T11:04:06Z

As per title, this PR supersedes #6238.

This PR implements loading MOE models checkpoints in high-precision (fp16, bf16), quantizing online the MOE experts to int4 and the attention projections to float8.

During inference the int4 moe weights are upcasted to float8 in order to use fp8 math.

Runnable with

export MODEL_ID="/large_models/mistralai_Mixtral-8x7B-Instruct-v0.1"
python3 -m sglang.launch_server --model-path ${MODEL_ID} --tensor-parallel-size 1 --disable-cuda-graph --quantization quark_int4fp8_moe

Left to do:

gemini-code-assist

Summary of Changes

Hello @fxmarty-amd, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new quark_int4fp8_moe online quantization scheme, primarily targeting Mixture-of-Experts (MoE) models on ROCm-enabled AMD GPUs. It provides the core implementation for quantizing both linear and MoE layers to INT4/FP8 formats during model loading, along with necessary configuration and utility functions to integrate this new method into the system.

Highlights

New Quantization Scheme: Introduced quark_int4fp8_moe online quantization, enabling models to be quantized on-the-fly during loading, specifically designed for Mixture-of-Experts (MoE) models.
ROCm-Specific MoE Support: Added QuarkInt4Fp8MoEMethod for MoE layers, which handles INT4 weights and FP8 computation, with specific optimizations and support exclusively for AMD GPUs (ROCm).
Core Quantization Utilities: Implemented new utility functions for tensor-wise FP8 quantization, column-wise INT4 quantization, and efficient packing of INT4 values into INT32.
System Integration: Integrated the new quark_int4fp8_moe method across various configuration files and the model loader, making it selectable via command-line arguments and compatible with models like Mixtral.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces quark_int4fp8_moe online quantization, primarily targeting ROCm. It adds new configuration classes (QuarkInt4Fp8Config) and corresponding methods for linear layers (QuarkInt4Fp8LinearMethod) and MoE layers (QuarkInt4Fp8MoEMethod). Utility functions for FP8 and INT4 quantization and packing are also included. The changes seem well-structured but there are a few areas for improvement, particularly around error handling in quantization utilities and attribute initialization.

python/sglang/srt/layers/quantization/quark_w4a8_int4fp8.py

python/sglang/srt/layers/quark_utils.py

python/sglang/srt/layers/quantization/quark_w4a8_int4fp8.py

…r-shard, on device & update progress bar

yctseng0211 · 2025-12-20T04:25:07Z

@HaiShaw @fxmarty-amd seems the test files "test_int4fp8_moe.py" in this PR should be added into 'not_in_ci' block or '"per-commit-amd"'
sglang/test/srt/run_suite.py at main · sgl-project/sglang ,
sglang/test/srt/run_suite.py at main · sgl-project/sglang

ci testcase error

File "/sglang-checkout/test/srt/run_suite.py", line 453, in _sanity_check_suites
    assert len(missing_files) == 0, (
AssertionError: Some test files are not in test suite. If this is intentional, please add the following to `not_in_ci` section:
TestFile("test_int4fp8_moe.py"),
args=Namespace(timeout_per_file=1200, suite='per-commit-amd', auto_partition_id=0, auto_partition_size=12, continue_on_error=False)

…un_suite.py

… into int4fp8_moe_new

fxmarty-amd · 2025-12-22T07:46:55Z

Hi @HaiShaw, @yctseng0211, I added https://github.com/fxmarty-amd/sglang/blob/cf80ab52d604b51e8e50c9a989e691189c50f686/test/srt/run_suite.py#L260

python/sglang/srt/layers/quantization/__init__.py

HaiShaw · 2026-01-06T11:12:26Z

@fxmarty-amd

… into int4fp8_moe_new

fxmarty-amd · 2026-01-09T14:01:53Z

The failing tests come from startup errors as DeepEP error: CPU recv timeout, TimeoutError: Server failed to start within the timeout period, The action 'Run test' has timed out after 30 minutes, ValueError: Global server args is not set yet!, All retry attempts exhausted for request. Returning empty response., Rate limit exception so wait and retry 2 after 4 sec Connection error., [2026-01-09 09:37:27 TP0] Prefill transfer failed for request rank=0 req.rid='c227824606f144a9a7b58d1c20b2550a' req.bootstrap_room=5552253286653449896 with exception KVTransferError(bootstrap_room=5552253286653449896): Decode instance could be dead, remote mooncake session 10.192.69.120:16274 is not alive, which do not look related to this PR, however main branch is green, so I am not sure.

https://github.com/sgl-project/sglang/actions/runs/20846115308/job/59890510312?pr=7392 (PR Test / stage-b-test-small-1-gpu (9)) also has illegal memory access errors.

HaiShaw · 2026-01-12T18:49:04Z

@fxmarty-amd Would you please add/update the use case examples in PR message body (the one there was out dated), and show accuracy data together?

HaiShaw · 2026-01-14T09:43:36Z

Updated int4fp8_moe to quark_int4fp8_moe in PR body

…l-project#7392) Co-authored-by: Dehua Tang <dehtang@amd.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: YC Tseng <yctseng@amd.com>

Dehua Tang and others added 2 commits May 12, 2025 06:46

addd quark_int4_fp8_moe feature

70d6f6a

wip int4fp8_moe version using weight_loader for online quantization

676f854

fxmarty-amd requested review from BBuf, ByronHsu, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, merrymercy and zhyncs as code owners June 20, 2025 11:04

fxmarty-amd marked this pull request as draft June 20, 2025 11:04

gemini-code-assist bot reviewed Jun 20, 2025

View reviewed changes

fxmarty-amd added 6 commits June 20, 2025 13:05

simplifications

adb7121

fix accuracy issue, support tp>1

25e214b

add doc

2040998

fix get_name

3df5e1e

simplifications

a475709

pre-shard high precision weight in order to do online quantization pe…

a7e6598

…r-shard, on device & update progress bar

fxmarty-amd marked this pull request as ready for review June 25, 2025 10:46

fxmarty-amd requested a review from zhaochenyang20 as a code owner June 25, 2025 10:46

fxmarty-amd added 2 commits June 25, 2025 12:49

Merge branch 'main' into int4fp8_moe_new

70e157a

fix merge issues

d2f6f34

BowenBao mentioned this pull request Jun 25, 2025

[Feature][ROCM] add online int4_fp8_moe quant feature #6238

Draft

HaiShaw self-assigned this Jun 25, 2025

HaiShaw marked this pull request as draft June 25, 2025 22:08

fxmarty-amd marked this pull request as ready for review June 26, 2025 11:15

support pre-sharded MOE

ba89177

github-actions bot added the run-ci label Dec 16, 2025

HaiShaw added 3 commits December 15, 2025 22:00

Merge branch 'main' into int4fp8_moe_new

f59499b

Merge branch 'main' into int4fp8_moe_new

5f1fbeb

Merge branch 'main' into int4fp8_moe_new

9c1da32

fxmarty-amd added 4 commits December 22, 2025 08:17

Merge branch 'main' into int4fp8_moe_new

839cd85

rename int4fp8_moe to quark_int4fp8_moe, add test_int4fp8_moe.py to r…

50aa07f

…un_suite.py

linting

3815e81

Merge branch 'int4fp8_moe_new' of https://github.com/fxmarty-amd/sglang…

cf80ab5

… into int4fp8_moe_new

fxmarty-amd and others added 3 commits December 22, 2025 00:48

fix names

3615eb8

Merge branch 'main' into int4fp8_moe_new

847744e

Merge branch 'main' into int4fp8_moe_new

5221165

yctseng0211 reviewed Jan 5, 2026

View reviewed changes

python/sglang/srt/layers/quantization/__init__.py Outdated Show resolved Hide resolved

fxmarty-amd and others added 7 commits January 6, 2026 07:01

fix wrong import

8c4a8fe

Merge branch 'main' into int4fp8_moe_new

d9122e9

Merge branch 'int4fp8_moe_new' of https://github.com/fxmarty-amd/sglang…

653874d

… into int4fp8_moe_new

linting

ba306f5

Merge branch 'main' into int4fp8_moe_new

46639a0

Merge branch 'main' into int4fp8_moe_new

ca48935

Merge branch 'main' into int4fp8_moe_new

f751cff

HaiShaw approved these changes Jan 14, 2026

View reviewed changes

HaiShaw changed the title ~~[Quantization] Add int4fp8_moe online quantization on ROCm~~ [AMD][Quantization] Add int4fp8_moe online quantization on ROCm Jan 14, 2026

HaiShaw merged commit 5af84c8 into sgl-project:main Jan 14, 2026
126 of 140 checks passed

fxmarty-amd mentioned this pull request Jan 29, 2026

Do online fp8 quantization while loading weights instead of in process_weights_after_loading, reducing memory overhead #17945

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD][Quantization] Add `int4fp8_moe` online quantization on ROCm#7392

[AMD][Quantization] Add `int4fp8_moe` online quantization on ROCm#7392
HaiShaw merged 43 commits intosgl-project:mainfrom
fxmarty-amd:int4fp8_moe_new

fxmarty-amd commented Jun 20, 2025 •

edited by HaiShaw

Loading

Uh oh!

gemini-code-assist bot left a comment •

edited by HaiShaw

Loading

Uh oh!

gemini-code-assist bot left a comment •

edited by HaiShaw

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yctseng0211 commented Dec 20, 2025

Uh oh!

fxmarty-amd commented Dec 22, 2025

Uh oh!

Uh oh!

HaiShaw commented Jan 6, 2026

Uh oh!

fxmarty-amd commented Jan 9, 2026

Uh oh!

HaiShaw commented Jan 12, 2026

Uh oh!

HaiShaw commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fxmarty-amd commented Jun 20, 2025 • edited by HaiShaw Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment • edited by HaiShaw Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment • edited by HaiShaw Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yctseng0211 commented Dec 20, 2025

Uh oh!

fxmarty-amd commented Dec 22, 2025

Uh oh!

Uh oh!

HaiShaw commented Jan 6, 2026

Uh oh!

fxmarty-amd commented Jan 9, 2026

Uh oh!

HaiShaw commented Jan 12, 2026

Uh oh!

HaiShaw commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fxmarty-amd commented Jun 20, 2025 •

edited by HaiShaw

Loading

gemini-code-assist bot left a comment •

edited by HaiShaw

Loading

gemini-code-assist bot left a comment •

edited by HaiShaw

Loading