Skip to content

[Diffusion] [AMD] Online MXFP4 and FP8 Quantization for Multimodal Generation#21431

Merged
mickqian merged 20 commits into
sgl-project:mainfrom
ColinZ22:mxfp4
May 14, 2026
Merged

[Diffusion] [AMD] Online MXFP4 and FP8 Quantization for Multimodal Generation#21431
mickqian merged 20 commits into
sgl-project:mainfrom
ColinZ22:mxfp4

Conversation

@ColinZ22
Copy link
Copy Markdown
Contributor

@ColinZ22 ColinZ22 commented Mar 25, 2026

Motivation

Adding Online MXFP4 (For AMD GPUs) and FP8 Quantization for multimodal (image and video) generation with models like Z-Image-Turbo and Wan 2.2.

Modifications

  • New --quantization server argument allowing loading unquantized model and quantizing weights and activations to MXFP4.
  • New --quantization-ignored-layers server argument allows skipping certain layers for online quantization (keeping in full precision)
  • New Mxfp4Config and Mxfp4LinearMethod classes utilizing AITER dynamic MXFP4 quantization and MXFP4 GEMM kernels.
  • Enabling FP8 online quantization via --quantization.

Usage Example

To online quantize a Diffusion Model to FP8 or MXFP4, simply add the --quantization argument:

sglang generate \
  --model-path Tongyi-MAI/Z-Image-Turbo \
  --prompt "A beautiful sunset over the mountains" \
  --save-output
  --quantization fp8
sglang generate \
  --model-path Tongyi-MAI/Z-Image-Turbo \
  --prompt "A beautiful sunset over the mountains" \
  --save-output
  --quantization mxfp4

Generation Quality Comparison

Prompt 1: "A cat sitting at the top of a mountain looking down at a futuristic city"

FP16 FP8 MXFP4
FP16 FP8 MXFP4

Prompt 2: "A crowd of people of various age at a busy outdoor marketplace"

FP16 FP8 MXFP4
FP16 FP8 MXFP4

Prompt 3: "A young child blowing dandelion seeds, golden hour lighting"

FP16 FP8 MXFP4
FP16 FP8 MXFP4

Prompt 4: "A city street at sunset with snow-capped mountain in the distant background"

FP16 FP8 MXFP4
FP16 FP8 MXFP4

Performance Benchmarking

Model: Z-Image-Turbo
Dataset: 200 images from HuggingFace Parti-Prompts

Online Quant Method Transformer Size (GB) Peak Mem Size (GB) Total Gen Time (sec) Denoise Time (sec) Avg CLIP Score (↑)
bf16 (baseline) 11.46 19.00 201.06 132.54 32.20
fp8 5.86 (-49%) 13.42 (-29%) 191.91 (-5%) 131.74 (-1%) 32.31
mxfp4 3.23 (-72%) 10.77 (-43%) 165.05 (-18%) 104.78 (-21%) 32.22

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the diffusion SGLang Diffusion label Mar 25, 2026
@ColinZ22 ColinZ22 changed the title Online MXFP4 and FP8 Quantization for Multimodal Generation [Diffusion] Online MXFP4 and FP8 Quantization for Multimodal Generation Mar 26, 2026
Comment thread python/sglang/multimodal_gen/runtime/loader/fsdp_load.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/server_args.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/layers/quantization/mxfp4.py Outdated
ColinZ22 and others added 3 commits March 26, 2026 16:34
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown
Collaborator

@mickqian mickqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also mention this new server arg in cli.md, quantization.md or other related places?

Comment thread python/sglang/multimodal_gen/runtime/layers/quantization/mxfp4.py
Comment thread python/sglang/multimodal_gen/runtime/server_args.py
@github-actions github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization labels Apr 7, 2026
@ColinZ22
Copy link
Copy Markdown
Contributor Author

ColinZ22 commented Apr 7, 2026

could you also mention this new server arg in cli.md, quantization.md or other related places?

Added documentation in cli.md and quantization.md

@ColinZ22 ColinZ22 changed the title [Diffusion] Online MXFP4 and FP8 Quantization for Multimodal Generation [ROCM][Diffusion] Online MXFP4 and FP8 Quantization for Multimodal Generation Apr 16, 2026
@ColinZ22 ColinZ22 changed the title [ROCM][Diffusion] Online MXFP4 and FP8 Quantization for Multimodal Generation [ROCM] [Diffusion] Online MXFP4 and FP8 Quantization for Multimodal Generation Apr 16, 2026
@ColinZ22 ColinZ22 changed the title [ROCM] [Diffusion] Online MXFP4 and FP8 Quantization for Multimodal Generation [Diffusion] [ROCM] Online MXFP4 and FP8 Quantization for Multimodal Generation Apr 16, 2026
@ColinZ22 ColinZ22 changed the title [Diffusion] [ROCM] Online MXFP4 and FP8 Quantization for Multimodal Generation [Diffusion] [AMD] Online MXFP4 and FP8 Quantization for Multimodal Generation Apr 16, 2026
…fig fix for zimage, and mxfp4 perf improvements
@ColinZ22 ColinZ22 requested a review from DarkSharpness as a code owner April 16, 2026 22:06
@ColinZ22
Copy link
Copy Markdown
Contributor Author

ColinZ22 commented Apr 22, 2026

@mickqian Friendly ping for review, all comments addressed, hoping to land this PR soon!

Copy link
Copy Markdown
Collaborator

@BowenBao BowenBao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@mickqian , @HaiShaw , @avjves This PR covers a superset of the functionality in #23373. Would it make sense to consolidate the effort and land this one instead? We’re happy to rebase if #23373 lands first, though it seems unnecessary from our perspective.

@avjves
Copy link
Copy Markdown
Contributor

avjves commented Apr 22, 2026

LGTM.

@mickqian , @HaiShaw , @avjves This PR covers a superset of the functionality in #23373. Would it make sense to consolidate the effort and land this one instead? We’re happy to rebase if #23373 lands first, though it seems unnecessary from our perspective.

Definitely, I'm happy either way as long as the functionality lands! I originally didn't notice this PR before I had already created a new one.

@ColinZ22 ColinZ22 requested a review from mickqian April 24, 2026 18:29

## Online Quantization

Online quantization applies quantization to unquantized models at load time. This is useful for when pre-quantized checkpoints are not available.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add (on-the-fly / load-time quantization) as well

@mickqian
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@avjves
Copy link
Copy Markdown
Contributor

avjves commented May 11, 2026

@ColinZ22 PR #20922 was merged, which adds the initial support for online quantization, including FP8 quantization. It's missing MXFP4 quantization still though. Are you planning on updating this PR to match the current state to get it merged? :)

@ColinZ22
Copy link
Copy Markdown
Contributor Author

@ColinZ22 PR #20922 was merged, which adds the initial support for online quantization, including FP8 quantization. It's missing MXFP4 quantization still though. Are you planning on updating this PR to match the current state to get it merged? :)

@avjves Updated, thanks for letting me know!

@BowenBao
Copy link
Copy Markdown
Collaborator

@ColinZ22 please fix lint checks

@ColinZ22
Copy link
Copy Markdown
Contributor Author

Fixed, @mickqian @wisclmy0611 Re-review would be greatly appreciated! Hoping to land this PR soon.

@BowenBao
Copy link
Copy Markdown
Collaborator

@amd-bot ci-status

1 similar comment
@ColinZ22
Copy link
Copy Markdown
Contributor Author

@amd-bot ci-status

@mickqian mickqian merged commit 34c0029 into sgl-project:main May 14, 2026
119 of 145 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion documentation Improvements or additions to documentation jit-kernel quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants