:sparkles: [diffusion][npu][quant] Add MXFP8 quantization support for Wan2.2 Diffusion on Ascend NPU by TallMessiWu · Pull Request #20922 · sgl-project/sglang

TallMessiWu · 2026-03-19T08:31:52Z

Summary

This PR adds MXFP8 (Microscaling FP8) quantization support for Wan2.2 diffusion models on Ascend NPU. It closes part of the NPU MXFP8 gap tracked in issue #14424.

Hardware requirement: Ascend A5 series or newer. npu_dynamic_mx_quant is not available on A2/A3.

Two modes are supported:

Online quantization (--quantization mxfp8)

Adds MXFP8Config + NPUMXFP8DiffusionLinearMethod (multimodal_gen/runtime/layers/quantization/mxfp8_npu.py) for the diffusion subsystem.
At load time, FP16/BF16 weights are quantized online to MXFP8 via npu_dynamic_mx_quant; at inference, activations are quantized per-token and the matmul is executed by npu_quant_matmul with group_sizes=[1,1,32] (block_size=32).

Offline quantization (msmodelslim pre-quantized weights)

Adds ModelSlimMXFP8Scheme (multimodal_gen/runtime/layers/quantization/modelslim_mxfp8_scheme.py) for loading weights pre-quantized by msmodelslim (float8_e4m3fn weights + uint8 scale in float8_e8m0fnu encoding).

wan_repack.py refactor

Refactors multimodal_gen/tools/wan_repack.py into a one-step repack tool: copies the original HF Diffusers model, converts msmodelslim quant weights (renaming keys to Diffusers format), and restores config.json — replacing a multi-step manual workflow. Fixes multiple bugs in the original script (glob patterns passed as literal paths, unconditional quant_config key update causing KeyError). Supports Wan2.2-TI2V-5B (single transformer) and Wan2.2-T2V-A14B / Wan2.2-I2V-A14B (Cascade dual-transformer).

Key NPU APIs used

API	Purpose
`torch_npu.npu_dynamic_mx_quant(x, dst_type=torch_npu.float8_e4m3fn)`	Dynamic MXFP8 quantization of activations/weights
`torch_npu.npu_quant_matmul(..., group_sizes=[1,1,32])`	MXFP8 quantized matmul (block_size=32)
`torch_npu.float8_e4m3fn` / `torch_npu.float8_e8m0fnu`	FP8 weight dtype / scale factor dtype

Files Changed

New files

File	Change
`multimodal_gen/runtime/layers/quantization/mxfp8_npu.py`	New — online MXFP8 (`MXFP8Config` + `NPUMXFP8DiffusionLinearMethod`) for Wan2.2 diffusion
`multimodal_gen/runtime/layers/quantization/modelslim_mxfp8_scheme.py`	New — offline MXFP8 (`ModelSlimMXFP8Scheme`) for msmodelslim pre-quantized weights

Modified — MXFP8 registration & dispatch

File	Change
`multimodal_gen/runtime/layers/quantization/__init__.py`	Register `MXFP8Config`; add `"mxfp8"` to `QuantizationMethods` literal
`multimodal_gen/runtime/layers/quantization/modelslim.py`	Add `W8A8_MXFP8` branch → `ModelSlimMXFP8Scheme` in `_get_scheme_from_parts()`

Modified — CLI & loader support

File	Change
`multimodal_gen/runtime/server_args.py`	Add `--quantization` CLI arg (explicit method override, e.g. `--quantization mxfp8`)
`multimodal_gen/runtime/loader/component_loaders/transformer_loader.py`	Honor `--quantization` flag; takes priority over auto-detection from config.json / metadata
`multimodal_gen/runtime/loader/fsdp_load.py`	Add `weight_scale` to FSDP unused-key list (prevents crash on offline MXFP8 weight load)
`multimodal_gen/runtime/utils/quantization_utils.py`	Glob fallback for `quant_model_description*.json` (supports repacked filenames)

Modified — tooling

File	Change
`multimodal_gen/tools/wan_repack.py`	Refactor — one-step repack CLI; fix glob/KeyError bugs; add T2V-A14B, I2V-A14B, TI2V-5B support

Modified — minor refactor (srt)

File	Change
`srt/layers/quantization/fp8.py`	Import cleanup (`apply_fp8_marlin_linear` → `torch.ops.sglang.apply_fp8_marlin_linear`); restructure `Fp8MoEMethod.process_weights_after_loading()` to move weight shuffle inside scale-processing block

wan_repack.py: Design Details

Bug Fixes

The original script contained four bugs that made it entirely non-functional:

#	Location	Bug	Root Cause
1	`load_sharded_safetensors()`	`pathlib.Path(dir, "model.safetensors")` passed directly to `load_file()`	`pathlib.Path(dir, ".safetensors")` creates a literal* path with `*` in the filename — not a glob. `load_file()` does not expand globs, so every run raises `FileNotFoundError`.
2	`convert_transformer()`	Same pattern applied to `open(pathlib.Path(dir, "quant_model_description.json"))`	Same root cause as above.
3	`get_transformer_config()`	No `else` branch — unknown `model_type` causes `NameError: name 'RENAME_DICT' is not defined`	Local variable only assigned inside `if model_type == "Wan-T2V-14B"`, then referenced unconditionally outside.
4	`convert_transformer()`	`update_dict_(original_quant_config, key, new_key)` called unconditionally for every key	The quant description JSON only contains entries for quantized Linear layers, not all model keys. Calling `dict.pop()` on a missing key raises `KeyError`.

Fix for bugs 1 & 2: replaced glob-as-literal-path with directory.glob(pattern), which returns a proper file list. Added existence and uniqueness checks with descriptive error messages.

Fix for bug 3: added else: raise ValueError(...) and extended support to Wan2.2-I2V-A14B and Wan2.2-TI2V-5B.

Fix for bug 4: added if key in quant_config guard before updating the quant description dict.

One-Step Repack Workflow

Before — users had to run these steps manually:

# 1. Copy original HF model
cp -r Wan2.2-TI2V-5B-Diffusers Wan2.2-TI2V-5B-Diffusers-MXFP8

# 2. Delete transformer dir(s) from the copy
rm -rf Wan2.2-TI2V-5B-Diffusers-MXFP8/transformer

# 3. Run weight conversion (also broken — see bugs above)
python wan_repack.py \
    --input-path Wan2.2-TI2V-5B-quantized \
    --output-path Wan2.2-TI2V-5B-Diffusers-MXFP8

# 4. Restore config.json that was deleted in step 2
cp Wan2.2-TI2V-5B-Diffusers/transformer/config.json \
   Wan2.2-TI2V-5B-Diffusers-MXFP8/transformer/config.json

After — single command:

python wan_repack.py \
    --model-type Wan2.2-TI2V-5B \
    --original-model-path Wan2.2-TI2V-5B-Diffusers \
    --quant-path        Wan2.2-TI2V-5B-quantized \
    --output-path       Wan2.2-TI2V-5B-Diffusers-MXFP8

Internally, the new repack() orchestrator runs three steps:

shutil.copytree(original, output, ignore=transformer_dirs) — copies the full model (VAE, text encoder, scheduler, etc.) to the output path, skipping transformer dirs.
convert_transformer() for each transformer dir — converts quantized weights to diffusion_pytorch_model.safetensors and renames keys to HF Diffusers format.
shutil.copy2(original/transformer/config.json, output/transformer/config.json) — restores the architecture config that was excluded in step 1.

For Cascade models (Wan2.2-T2V-A14B, Wan2.2-I2V-A14B), steps 2–3 repeat for both transformer/ (sourced from quant_path/high_noise_model/) and transformer_2/ (sourced from quant_path/low_noise_model/). The cascade vs. single-model dispatch is driven by CASCADE_MODEL_TYPES.

Summary

Aspect	Before	After
Glob file matching	`pathlib.Path(dir, "*.safetensors")` — broken, `FileNotFoundError` on every run	`dir.glob("*.safetensors")` with existence and uniqueness checks
`get_transformer_config()`	Only `"Wan-T2V-14B"`; crashes with `NameError` for any other type	Supports `Wan2.2-T2V-A14B`, `Wan2.2-I2V-A14B`, `Wan2.2-TI2V-5B`; raises `ValueError` on unsupported type
`quant_config` key update	Unconditional `dict.pop()` — `KeyError` for non-quantized layers	`if key in quant_config` guard
Workflow	4 manual steps (copy model, delete transformer, convert, restore config.json)	Single `repack()` call handles all steps
CLI arguments	`--input-path`, `--output-path` only	`--model-type`, `--original-model-path`, `--quant-path`, `--output-path`

Performance Comparison Report

1. Scripts

# Base Model
SGLANG_CACHE_DIT_FN=2
SGLANG_CACHE_DIT_BN=1
SGLANG_CACHE_DIT_WARMUP=4
SGLANG_CACHE_DIT_RDT=0.4 
SGLANG_CACHE_DIT_MC=4 
SGLANG_CACHE_DIT_TAYLORSEER=true 
SGLANG_CACHE_DIT_TS_ORDER=2 
SGLANG_CACHE_DIT_ENABLED=true
sglang generate --model-path /home/weights/Wan2.2-TI2V-5B-Diffusers \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
--height 704 --width 1280 --num-gpus 1 --num-frames 81 --num-inference-steps 40  --warmup \
--perf-dump-path baseline.json

sleep 15

# Inference using Offline Modelslim Pre-Quantized Weights
SGLANG_CACHE_DIT_FN=2
SGLANG_CACHE_DIT_BN=1
SGLANG_CACHE_DIT_WARMUP=4
SGLANG_CACHE_DIT_RDT=0.4 
SGLANG_CACHE_DIT_MC=4 
SGLANG_CACHE_DIT_TAYLORSEER=true 
SGLANG_CACHE_DIT_TS_ORDER=2 
SGLANG_CACHE_DIT_ENABLED=true
sglang generate --model-path /home/weights/Wan2.2-TI2V-5B-Diffusers-mxfp8 \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
--height 704 --width 1280 --num-gpus 1 --num-frames 81 --num-inference-steps 40  --warmup \
--perf-dump-path offline.json

sleep 15

# Online Quantization Inference
SGLANG_CACHE_DIT_FN=2
SGLANG_CACHE_DIT_BN=1
SGLANG_CACHE_DIT_WARMUP=4
SGLANG_CACHE_DIT_RDT=0.4
SGLANG_CACHE_DIT_MC=4
SGLANG_CACHE_DIT_TAYLORSEER=true
SGLANG_CACHE_DIT_TS_ORDER=2
SGLANG_CACHE_DIT_ENABLED=true
sglang generate --model-path /home/weights/Wan2.2-TI2V-5B-Diffusers \
--quantization mxfp8 \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
--height 704 --width 1280 --num-gpus 1 --num-frames 81 --num-inference-steps 40  --warmup \
--perf-dump-path online.json

2. High-level Summary

Metric	Baseline	offline.json	online.json
E2E Latency	102922.86 ms	91619.75 ms (-11.0%) ✅	91413.05 ms (-11.2%) ✅

3. Stage Breakdown

Stage Name	Baseline	offline.json	online.json
InputValidationStage	0.11	0.08 (-30.0%) ⚪️	0.06 (-49.2%) ⚪️
TextEncodingStage	12573.16	12653.25 (+0.6%) ⚪️	12476.38 (-0.8%) ⚪️
LatentPreparationStage	0.27	0.21 (-24.7%) ⚪️	0.16 (-42.3%) ⚪️
TimestepPreparationStage	1.21	0.95 (-21.0%) ⚪️	0.90 (-25.5%) ⚪️
DenoisingStage	83386.55	72910.67 (-12.6%) 🟢	72868.55 (-12.6%) 🟢
DecodingStage	6952.78	6046.59 (-13.0%) 🟢	6058.86 (-12.9%) 🟢

Metadata

Baseline Commit: ef874c0a1c92bf29a35e7f2e7efaf2bdaed748fa
offline.json Commit: ef874c0a1c92bf29a35e7f2e7efaf2bdaed748fa
online.json Commit: ef874c0a1c92bf29a35e7f2e7efaf2bdaed748fa
Timestamp: 2026-03-25T16:09:56.257827

Related Issues

Closes part of #14424 (MXFP8/MXFP4 support on Ascend NPU for SGLang).

…th B) Add NPUMXFP8LinearMethod that enables --quantization mxfp8 on Ascend NPU, supporting both online (FP16/BF16 → MXFP8) and offline (serialized FP8 checkpoint) quantization via torch_npu APIs (npu_dynamic_mx_quant + npu_quant_matmul with group_sizes=[1,1,32]).

…n Ascend NPU Add MXFP8Config and NPUMXFP8DiffusionLinearMethod for the diffusion subsystem (multimodal_gen), enabling --quantization mxfp8 for Wan2.2 and other diffusion models on Ascend NPU. Also adds explicit quantization field to diffusion ServerArgs so online quantization can be specified without pre-quantized weights.

- Ensure weight tensor is on NPU device before npu_dynamic_mx_quant call - Flatten input x to 2D before quantization so input_scale is 3D (required by npu_quant_matmul) - Simplify output shape restoration logic Fixes: dimension of x1Scale(pertoken_scale) should be 3 but was 4

gemini-code-assist · 2026-03-19T08:32:13Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for online MXFP8 quantization on Ascend NPUs, significantly enhancing the efficiency and performance of both Large Language Models (LLMs) and Wan2.2 Diffusion models within the SGLang framework. By integrating NPU-specific quantization methods and updating the model loading mechanisms, it allows users to leverage the --quantization mxfp8 flag for optimized inference, streamlining the deployment of models on Ascend hardware.

Highlights

MXFP8 Quantization for Ascend NPU: Implemented online MXFP8 quantization support for SGLang on Ascend NPU (Path B), enabling --quantization mxfp8 for both LLM and Wan2.2 Diffusion inference.
Core LLM Serving Implementation: Added sglang/srt/hardware_backend/npu/quantization/mxfp8_method_npu.py which implements NPUMXFp8LinearMethod for online MXFP8 quantization, supporting FP16/BF16 to MXFP8 conversion and offline FP8 checkpoints using torch_npu APIs.
Wan2.2 Diffusion Support: Introduced sglang/multimodal_gen/runtime/layers/quantization/mxfp8_npu.py to extend MXFP8 support to Wan2.2 video diffusion models and updated model loader and server arguments to handle this quantization method.
Quantization Method Dispatch: Modified python/sglang/srt/layers/quantization/fp8.py to dispatch to the NPU MXFP8 backend when running on NPU and use_mxfp8 is enabled.
Bug Fixes and Cleanup: Addressed MXFP8 scale dimension mismatch in Diffusion models and NPU method call compatibility issues. The ModelSlim MXFP8 Scheme (Path A) was removed in favor of the unified online MXFP8 (Path B).
Testing: Added test/srt/ascend/test_ascend_mxfp8_quantization.py to cover online quantization correctness and performance for Ascend NPU.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for MXFP8 quantization on Ascend NPU for both LLM serving and Diffusion models. It introduces new quantization methods (NPUMXFP8LinearMethod and NPUMXFP8DiffusionLinearMethod) that leverage torch_npu for online and offline quantization. The changes also include updates to server arguments and model loading logic to enable this feature, along with a new test suite. The implementation appears solid, with separate, clean implementations for the LLM and diffusion model paths. I have one suggestion to improve robustness in the LLM quantization path by ensuring the weight tensor is on the NPU device before quantization, mirroring the practice in the diffusion path.

按 reviewer 建议重构架构分层： - 在 fp8.py 新增 MXFP8LinearAscendMethod，负责权重定义（__init__、create_weights） - 简化 mxfp8_method_npu.py 中的 NPUMXFP8LinearMethod，只保留权重处理和 kernel 调用 - 改进架构分层，符合现有 NPU INT8 方法模式

… Wan2.2 TI2V

Fix weight loading for msmodelslim pre-quantized MXFP8 weights: - Change weight dtype from int8 to float8_e4m3fn (actual storage format in safetensors) - Fix weight_scale shape from [out, in/32*2] to [out, in/32] (actual msmodelslim export) - Update process_weights_after_loading to reshape weight_scale [out, in/32] -> [out, -1, 2]

… scripts

ping1jing2 · 2026-04-30T09:55:14Z

/rerun-failed-ci

TallMessiWu · 2026-05-04T08:16:58Z

CI failure analysis

I went through every failed/cancelled job. None of the failures are caused by changes in this PR. Breakdown by root cause:

1. NPU runner infra issue

stage-b-test-2-npu-a2 (0) (FAILURE) / (1) (CANCELLED) — failed during Install dependencies, before any test code ran:
```
error: could not download file from 'https://rsproxy.cn/dist/channel-rust-stable.toml'
       ... connection reset
##[error]Executing the custom container implementation failed.
```
Rust toolchain mirror (rsproxy.cn) network failure on the self-hosted NPU runner. Pure infra.

2. Pre-existing NPU test failure (exists on `main`, not introduced by this PR)

multimodal-gen-test-8-npu-a3 — test_diffusion_generation[wan2_2_t2v_14b_w8a8_8npu] fails with:
```
NotImplementedError: No modelslim compatible scheme was found.
```
This PR only adds a W8A8_MXFP8 branch to _get_scheme_from_parts(); it does not remove or alter any existing branch. The failure is reproducible on the unrelated PR bbuf/hunyuan3d-dit-fusions at the same line — same model, same trace. It looks like the Wan2.2-T2V-A14B-Diffusers-w8a8 checkpoint's quant_type doesn't match any of W8A8_DYNAMIC / W8A8 / W4A4_DYNAMIC / W8A8_MXFP8 and needs a separate scheme registered (out of scope here).
pr-test-npu-finish — rollup of the two NPU jobs above.

3. AMD runner environment issues

multimodal-gen-test-2-gpu-amd (linux-mi300-2gpu-sglang, 0) — all 11 tests ERROR at setup:

torch.distributed.DistBackendError: NCCL error in: NCCLUtils.cpp:94,
    unhandled cuda error, NCCL version 2.26.6
ncclUnhandledCudaError: Cuda failure 'invalid argument'

Distributed init blows up on the MI300 runner — environment, not code.

stage-b-test-1-gpu-small-amd-mi35x (linux-mi35x-gpu-1) — VRAM cleanup failed before tests started:

ERROR: VRAM usage exceeds threshold (5%) on some GPUs:
    GPU[0] : GPU Memory Allocated (VRAM%): 94
=== FAILED: VRAM cleanup unsuccessful after 3 attempts ===

Stale process from a prior run holding the device.

stage-b-test-1-gpu-large-amd (linux-mi300-1gpu-sglang, 1) — test_score_api_latency_throughput on Qwen/Qwen3-Reranker-0.6B:
```
AssertionError: 51.71446999302134 not less than 50
```
Latency exceeded the 50 ms threshold by 1.7 ms. Unrelated to this PR (no Qwen reranker / score API changes).

4. Pre-existing flaky multimodal-gen GPU failures (also fail on unrelated PRs)

call-multimodal-gen-tests / multimodal-gen-test-2-gpu (0)/(1)/(2) — 12 cases failing the [consistency] or [performance] checks:
- consistency (extremely poor SSIM/PSNR, e.g. ssim=0.03, psnr=9.6): wan2_2_i2v_a14b_2gpu, wan2_2_t2v_a14b_2gpu, wan2_2_t2v_a14b_teacache_2gpu, mova_360p_ring1_uly2, mova_360p_tp2, wan2_1_i2v_14b_480P_2gpu, wan2_1_i2v_14b_720P_2gpu, wan2_1_i2v_14b_lora_2gpu, fsdp-inference, ltx_2_3_two_stage_ti2v_2gpus
- performance: ltx_2.3_one_stage_ti2v (983 ms vs 949 ms limit, ~3.5% over), ltx_2.3_two_stage_t2v_2gpus (322.5 ms vs 320.7 ms limit)
None of these models/paths are touched by this PR (this PR only adds NPU MXFP8 for Wan2.2 diffusion + a tiny SRT FP8 import refactor that doesn't change Wan2.x/LTX/mova/fsdp behavior).

The same Wan2.2/LTX consistency failures reproduce on the unrelated PR codex/default-vae-channels-last-3d — same wan2_2_t2v_a14b_2gpu (clip=0.52, ssim=0.03) and ltx_2.3_one_stage_ti2v performance miss. Strongly looks like CI-data drift / flakiness on main.
call-multimodal-gen-tests / diffusion-coverage-check — fails because the test jobs above failed.

5. Cascading fast-fail (auto-skipped after a root cause failed)

All of these explicitly say Fast-fail: skipping — root cause job(s): call-multimodal-gen-tests / multimodal-gen-test-2-gpu (0), wait-for-stage-b:

wait-for-stage-b
stage-b-test-1-gpu-small (3)..(7)
stage-b-test-1-gpu-large (3)..(13)
stage-b-test-2-gpu-large (2), (3)
pr-test-finish

Summary

Category	Jobs	Cause
NPU infra	2	Rust mirror network failure on self-hosted runner
Pre-existing NPU test	2	`wan2_2_t2v_14b_w8a8_8npu` quant_type unmapped (also fails on unrelated PRs)
AMD environment	3	NCCL init / VRAM cleanup / 1.7 ms latency overshoot
Pre-existing flaky GPU	4	Wan2.x / LTX / mova / fsdp consistency or performance (also fails on unrelated PRs)
Cascading fast-fail	20	Auto-skipped due to the above

No action required on this PR's code. Re-running CI once the runner infra recovers should clear the cascading failures; the NPU wan2_2_t2v_14b_w8a8_8npu and the GPU consistency drifts need a separate fix on main.

Restore docs_new/index.mdx to upstream/main state. This file was modified by an automated github-actions[bot] commit (762f21f) that ran the LMSYS blog sync workflow against the fork's junlin branch, unrelated to the diffusion MXFP8 changes in this PR.

ping1jing2 · 2026-05-07T18:30:19Z

I merged it as we already analysis all CIs and these failed tests are unrelated to our change. please let me know if there are some other issues

OrangeRedeng · 2026-05-07T19:18:55Z

it won't work until #24540 is merged

…iffusion on Ascend NPU (sgl-project#20922) Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…#22352 conflicts Resolved 10 conflicts: - Diffusion side (7 files): used upstream/main (includes merged sgl-project#20922/sgl-project#22338) - LLM side fp8.py: kept both NPU and MUSA capability bypasses - LLM side modelslim.py: added W8A8_MXFP8 to upstream's table-driven scheme dispatch - LLM side transformers.py: used upstream/main (MoE refactoring with proper prefix)

… prerequisites - Prerequisites sgl-project#20922 (Diffusion MXFP8) and sgl-project#22338 (Diffusion MXFP4) merged upstream — accept their canonical versions and remove our duplicate diffusion modifications from this PR. - Adapt offline MXFP8 dispatch to upstream's table-driven ModelSlimConfig.get_linear_scheme by registering W8A8_MXFP8 → ModelSlimMXFP8Scheme; add a no-op __init__ on the scheme so its signature matches the other entries. - Keep NPU bypass in Fp8Config.get_min_capability alongside the new upstream _is_musa branch; revert an unrelated MoE _use_aiter style change that drifted in via earlier merges.

TallMessiWu added 4 commits March 18, 2026 15:53

🐛 fix(diffusion): fix npu method call error

c838ade

TallMessiWu requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, HaiShaw, b8zhong, ch-wan, iforgetmyname, mickqian, ping1jing2, yhyang201 and yingluosanqian as code owners March 19, 2026 08:31

github-actions Bot added quant LLM Quantization npu diffusion SGLang Diffusion labels Mar 19, 2026

gemini-code-assist Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread python/sglang/srt/hardware_backend/npu/quantization/mxfp8_method_npu.py Outdated

ping1jing2 self-assigned this Mar 19, 2026

OrangeRedeng mentioned this pull request Mar 19, 2026

[NPU] [Roadmap] NPU quantization 2026 Q2 Roadmap #14424

Open

34 tasks

TamirBaydasov reviewed Mar 19, 2026

View reviewed changes

Comment thread python/sglang/srt/hardware_backend/npu/quantization/mxfp8_method_npu.py Outdated

TallMessiWu added 2 commits March 20, 2026 09:38

🔀 merge: sync from upstream

df61b29

TallMessiWu force-pushed the junlin branch from c5d303c to df61b29 Compare March 20, 2026 01:45

TallMessiWu added 2 commits March 20, 2026 16:47

✨ feat(diffusion): add offline MXFP8 pre-quantized weight support for…

490ad0b

… Wan2.2 TI2V

TallMessiWu changed the title ~~🚧 WIP: feat(npu): add MXFP8 quantization support for Ascend NPU~~ ✨ [NPU][MXFP8] Add MXFP8 quantization support for Ascend NPU (LLM + Wan2.2 Diffusion) Mar 24, 2026

TallMessiWu force-pushed the junlin branch from a2eeb08 to 8f54ac3 Compare April 25, 2026 08:12

TallMessiWu and others added 5 commits April 27, 2026 10:11

🐛 fix(ci): fix Windows encoding and path separator bugs in pre-commit…

a0ee0be

… scripts

🔀 Merge remote-tracking branch 'upstream/main' into junlin

f92f820

💚 fix(docs): fix broken relative path to ascend_npu_quantization.md

e1dd1d3

Merge branch 'main' into junlin

1787605

💚 fix(test): add missing quantization attr to _make_server_args

b3cd75b

TallMessiWu force-pushed the junlin branch from 39c1975 to b3cd75b Compare April 29, 2026 07:55

Merge branch 'main' into junlin

4f79522

ping1jing2 and others added 2 commits May 1, 2026 15:11

Merge branch 'main' into junlin

d87905a

docs: sync LMSYS SGLang blog cards

762f21f

github-actions Bot requested a review from JustinTong0323 as a code owner May 1, 2026 13:03

ping1jing2 added 2 commits May 2, 2026 19:32

Merge branch 'main' into junlin

ba014f2

Merge branch 'main' into junlin

9be9672

ping1jing2 approved these changes May 4, 2026

View reviewed changes

ping1jing2 reviewed May 4, 2026

View reviewed changes

Comment thread docs_new/index.mdx Outdated

ping1jing2 approved these changes May 7, 2026

View reviewed changes

sglang-npu-bot merged commit 80a6014 into sgl-project:main May 7, 2026
83 of 123 checks passed

TallMessiWu mentioned this pull request May 11, 2026

📝 docs(diffusion): add MXFP8 quantization docs for Wan2.2 on Ascend NPU #24918

Merged

This was referenced May 11, 2026

[diffusion] quant: Add flag for runtime quantization #23373

Closed

[Diffusion] [AMD] Online MXFP4 and FP8 Quantization for Multimodal Generation #21431

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ [diffusion][npu][quant] Add MXFP8 quantization support for Wan2.2 Diffusion on Ascend NPU#20922

✨ [diffusion][npu][quant] Add MXFP8 quantization support for Wan2.2 Diffusion on Ascend NPU#20922
sglang-npu-bot merged 34 commits into
sgl-project:mainfrom
TallMessiWu:junlin

TallMessiWu commented Mar 19, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

ping1jing2 commented Apr 30, 2026

Uh oh!

TallMessiWu commented May 4, 2026

Uh oh!

Uh oh!

ping1jing2 commented May 7, 2026

Uh oh!

Uh oh!

OrangeRedeng commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

TallMessiWu commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key NPU APIs used

Files Changed

wan_repack.py: Design Details

Bug Fixes

One-Step Repack Workflow

Summary

Performance Comparison Report

1. Scripts

2. High-level Summary

3. Stage Breakdown

Related Issues

Uh oh!

gemini-code-assist Bot commented Mar 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

ping1jing2 commented Apr 30, 2026

Uh oh!

TallMessiWu commented May 4, 2026

CI failure analysis

1. NPU runner infra issue

2. Pre-existing NPU test failure (exists on main, not introduced by this PR)

3. AMD runner environment issues

4. Pre-existing flaky multimodal-gen GPU failures (also fail on unrelated PRs)

5. Cascading fast-fail (auto-skipped after a root cause failed)

Summary

Uh oh!

Uh oh!

ping1jing2 commented May 7, 2026

Uh oh!

Uh oh!

OrangeRedeng commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

TallMessiWu commented Mar 19, 2026 •

edited

Loading

2. Pre-existing NPU test failure (exists on `main`, not introduced by this PR)