:sparkles: [diffusion][npu][quant] Add MXFP4 quantization support for Wan2.2 Diffusion on Ascend NPU by TallMessiWu · Pull Request #22338 · sgl-project/sglang

TallMessiWu · 2026-04-08T07:57:30Z

Summary

This PR adds MXFP4 (Microscaling FP4, dual-level) quantization support for Wan2.2 diffusion models on Ascend NPU. It is a follow-up to #20922 (MXFP8 support).

Hardware requirement: Ascend A5 series or newer. npu_dynamic_dual_level_mx_quant and npu_dual_level_quant_matmul are not available on A2/A3.

Naming note (post-merge): Upstream main merged a ROCm/aiter Mxfp4Config (#24816) that also registered the mxfp4 quantization key for AMD MI350+ (gfx95x). To coexist, this PR's NPU path is registered as mxfp4_npu (consistent with the LLM-side --quantization mxfp4_npu convention). The original mxfp4 key now exclusively targets ROCm; mxfp4_npu targets Ascend. The NPU config class is named NPUMXFP4Config to disambiguate from upstream ROCm Mxfp4Config (which differed only by letter case).

Two modes are supported:

Online quantization (--quantization mxfp4_npu)

Adds NPUMXFP4Config + NPUMXFP4DiffusionLinearMethod (multimodal_gen/runtime/layers/quantization/mxfp4_npu.py) for the diffusion subsystem.
At load time, FP16/BF16 weights are quantized online to MXFP4 via npu_dynamic_dual_level_mx_quant; at inference, activations are quantized per-token and the matmul is executed by npu_dual_level_quant_matmul with dual-level block scales (L0 block size = 512, L1 block size = 32).

Note: The online weight quantization path (npu_dynamic_dual_level_mx_quant applied to weights) is experimental. MindIE-SD only uses an offline (pre-calibrated) path for MXFP4 weights. The online path quantizes FP16/BF16 weights at load time without calibration, which may produce different numerical results than the offline path.

Offline quantization (msmodelslim pre-quantized weights)

Adds ModelSlimMXFP4Scheme (multimodal_gen/runtime/layers/quantization/modelslim_mxfp4_scheme.py) for loading weights pre-quantized by msmodelslim.
Checkpoint tensor formats:
- weight: [out, in] — float8_e4m3fn container for FP4 data (converted to float4_e2m1fn_x2 + FRACTAL_NZ at load time)
- weight_scale: [out, in/32] — uint8 L1 block scales (e8m0 + 127 bias), reshaped to [out, in/64, 2]
- weight_dual_scale: [out, in/512, 1] — float32 L0 coarse scales, transposed to [in/512, out]
- mul_scale: [in] — float32 smooth-quant activation scale from NonFusionSmoothQuantWrapper; must be applied to activations before quantization to preserve numerical alignment with the offline-calibrated weights. Defaults to ones (no-op) if absent.

Key NPU APIs used

API	Purpose
`torch_npu.npu_dynamic_dual_level_mx_quant(x, smooth_scale=None)`	Dual-level MX quantization of activations/weights → `(quant, l0_scale, l1_scale)`
`torch_npu.npu_dual_level_quant_matmul(x1, x2, x1l0, x2l0, x1l1, x2l1, ...)`	Dual-level MXFP4 quantized matmul
`torch_npu.npu_dtype_cast(weight, torch_npu.float4_e2m1fn_x2)`	Cast fp8-container FP4 weights to packed `float4_e2m1fn_x2` dtype
`torch_npu.npu_format_cast(w.view(torch.int8), 29, customize_dtype=torch.int8)`	Convert weight tensor to FRACTAL_NZ format (format 29), required by `npu_dual_level_quant_matmul`

Files Changed

New files

File	Change
`multimodal_gen/runtime/layers/quantization/mxfp4_npu.py`	New — online MXFP4 (`NPUMXFP4Config` + `NPUMXFP4DiffusionLinearMethod`) for Wan2.2 diffusion
`multimodal_gen/runtime/layers/quantization/modelslim_mxfp4_scheme.py`	New — offline MXFP4 (`ModelSlimMXFP4Scheme`) for msmodelslim pre-quantized weights

Modified — MXFP4 registration & dispatch

File	Change
`multimodal_gen/runtime/layers/quantization/__init__.py`	Register `NPUMXFP4Config` under key `"mxfp4_npu"`; add `"mxfp4_npu"` to `QuantizationMethods` literal (coexists with upstream ROCm `"mxfp4"`)
`multimodal_gen/runtime/layers/quantization/modelslim.py`	Add `W4A4_MXFP4` / `W4A4_MXFP4_DUALSCALE` branch → `ModelSlimMXFP4Scheme` in `_get_scheme_from_parts()`; improve `NotImplementedError` message to include layer name and quant type

Modified — supporting infrastructure

File	Change
`multimodal_gen/runtime/loader/transformer_load_utils.py`	Adjust `_resolve_quant_config` priority: `modelslim` flag now loads the per-layer quant description file; add safetensors-metadata fallback when only `--transformer-weights-path` is supplied
`multimodal_gen/runtime/server_args.py`	Update `--quantization` help text: list `mxfp4_npu` alongside `mxfp4`, document hardware targets (ROCm MI350+ vs Ascend A5)
`multimodal_gen/tools/wan_repack.py`	Rename `.linear.` → `.` and `.div.` → `.` so msmodelslim-wrapped Linear / `NonFusionSmoothQuantWrapper` keys match SGLang model parameters; allow loading multi-shard safetensors
`srt/layers/quantization/fp8.py`	Replace `.data =` weight assignment with `torch.no_grad() + copy_()` in the AMD `_use_aiter` block-quant MoE path (per Gemini reviewer suggestion; preserves `Parameter` identity)

Implementation Notes

Dual-Level Scale Layout

MXFP4 uses a two-level block-scale hierarchy:

Level	Block Size	Tensor	Format in Matmul API
L1 (fine)	32 elements	`weight_scale`	`[out, in/64, 2]` (uint8)
L0 (coarse)	512 elements (= 16 × L1 blocks)	`weight_dual_scale`	`[in/512, out]` (float32)

The msmodelslim export uses [out, in/32] for weight_scale and [out, in/512, 1] for weight_dual_scale. process_weights_after_loading reshapes and transposes these to match what npu_dual_level_quant_matmul expects, following the MindIE-SD W4A4MXFP4DualQuantLinear reference.

Smooth-Quant `mul_scale`

msmodelslim wraps quantized layers in NonFusionSmoothQuantWrapper, which exports a per-channel activation scale mul_scale (shape [in]). The activation must be multiplied by this scale before dual-level quantization to stay aligned with the offline-calibrated weights. Omitting this step causes mosaic / corrupted output.

mul_scale is loaded as a BasevLLMParameter with missing_param_init = "ones" so that models exported without smooth-quant (or repacked without the .div. key rename) degrade gracefully to a no-op rather than crashing.

To avoid a GPU→CPU sync on every forward pass, process_weights_after_loading precomputes a layer.use_mul_scale boolean by checking torch.all(mul_scale == 1.0) once at load time (per Gemini reviewer suggestion).

FRACTAL_NZ Requirement

npu_dual_level_quant_matmul requires the weight tensor (x2) to be in FRACTAL_NZ memory format (format 29). The conversion is:

weight = torch_npu.npu_dtype_cast(weight_fp8_container, torch_npu.float4_e2m1fn_x2)
weight = torch_npu.npu_format_cast(weight.view(torch.int8), 29, customize_dtype=torch.int8)

This matches the _init_dynamic_quant_param step in MindIE-SD's W4A4MXFP4DualQuantLinear.

Performance Comparison Report

Numbers contributed by @TheKonka, re-measured on the latest PR head (commit 373fc3f, "rename MXFP4Config to NPUMXFP4Config"). Since the rename is a class-name change only and does not touch kernel paths, the post-rename numbers are essentially identical to the prior pre-merge measurement; both are reported here so the report tracks the current head.

High-level Summary

Metric	Baseline (BF16)	Online MXFP4	Offline MXFP4
E2E Latency	1,447,860.88 ms	1,240,433.51 ms (-14.3%)	1,280,435.48 ms (-11.6%)

Stage Breakdown

Stage	Baseline	Online MXFP4	Offline MXFP4
InputValidationStage	0.09 ms	0.08 ms (-10.1%)	0.07 ms (-20.7%)
TextEncodingStage	8,645.60 ms	8,646.62 ms (+0.0%)	8,654.91 ms (+0.1%)
LatentPreparationStage	0.25 ms	0.20 ms (-17.6%)	0.25 ms (-0.2%)
TimestepPreparationStage	0.97 ms	1.09 ms (+13.0%)	1.39 ms (+43.2%)
DenoisingStage	1,395,104.89 ms	1,201,835.77 ms (-13.9%)	1,246,148.14 ms (-10.7%)
DecodingStage	44,098.92 ms	29,941.37 ms (-32.1%)	25,623.21 ms (-41.9%)

The previous report on the pre-rename head (commit 7c6f431) reported the same trend: E2E -14.5% online / -11.6% offline. Run-to-run variance is well within ±0.5% on E2E.

Related Issues / PRs

Closes part of [NPU] [Roadmap] NPU quantization 2026 Q2 Roadmap #14424 (MXFP8/MXFP4 support on Ascend NPU for SGLang).
Follow-up to ✨ [diffusion][npu][quant] Add MXFP8 quantization support for Wan2.2 Diffusion on Ascend NPU #20922 (MXFP8 Diffusion support, already merged).

CI States

Latest PR Test (Base): ✅ Run #26070807413
Latest PR Test (Extra): ⚠️ Not enabled -- add run-ci-extra label to opt in.

…th B) Add NPUMXFP8LinearMethod that enables --quantization mxfp8 on Ascend NPU, supporting both online (FP16/BF16 → MXFP8) and offline (serialized FP8 checkpoint) quantization via torch_npu APIs (npu_dynamic_mx_quant + npu_quant_matmul with group_sizes=[1,1,32]).

…n Ascend NPU Add MXFP8Config and NPUMXFP8DiffusionLinearMethod for the diffusion subsystem (multimodal_gen), enabling --quantization mxfp8 for Wan2.2 and other diffusion models on Ascend NPU. Also adds explicit quantization field to diffusion ServerArgs so online quantization can be specified without pre-quantized weights.

- Ensure weight tensor is on NPU device before npu_dynamic_mx_quant call - Flatten input x to 2D before quantization so input_scale is 3D (required by npu_quant_matmul) - Simplify output shape restoration logic Fixes: dimension of x1Scale(pertoken_scale) should be 3 but was 4

按 reviewer 建议重构架构分层： - 在 fp8.py 新增 MXFP8LinearAscendMethod，负责权重定义（__init__、create_weights） - 简化 mxfp8_method_npu.py 中的 NPUMXFP8LinearMethod，只保留权重处理和 kernel 调用 - 改进架构分层，符合现有 NPU INT8 方法模式

… Wan2.2 TI2V

Fix weight loading for msmodelslim pre-quantized MXFP8 weights: - Change weight dtype from int8 to float8_e4m3fn (actual storage format in safetensors) - Fix weight_scale shape from [out, in/32*2] to [out, in/32] (actual msmodelslim export) - Update process_weights_after_loading to reshape weight_scale [out, in/32] -> [out, -1, 2]

…weight processing.

- Remove unused __init__ (no quant_config/prefix needed, MXFP8 has only one mode) - Fix weight dtype: float8_e4m3fn (not int8) to match msmodelslim checkpoint format - Fix weight_scale shape: [out, in/32] (not in/32*2) to match actual tensor shape - Add comment explaining weight_scale name must match checkpoint key (not weight_scale_inv) - Improve flatten-to-2D comment to explain NPU kernel requirement

… methods

…mbiguity

…rate PR Revert LLM-side MXFP8 changes to split into a separate PR. This branch now only contains Wan2.2 Diffusion MXFP8 changes. Reverted files: - fp8.py: removed MXFP8LinearAscendMethod class and NPU branch - mxfp8_method_npu.py: deleted (NPU MXFP8 linear method) - test_ascend_mxfp8_quantization.py: deleted (LLM MXFP8 test) LLM MXFP8 code preserved on junlin_llm branch.

…on Ascend NPU

…ding - Add ModelSlimMXFP4Scheme for loading msmodelslim pre-quantized MXFP4 weights - Support dual-level quantization via npu_dual_level_quant_matmul - Register W4A4_MXFP4 quant type in modelslim.py dispatcher - Handle FP4 packed weight casting and scale transformations Weights: float8_e4m3fn (FP4 packed) [out, in/2] Scales: uint8 (e8m0+127) [out, in/32] + bfloat16 dual [out, in/64]

…or matmul

1. Dispatch W4A4_MXFP4_DUALSCALE type to ModelSlimMXFP4Scheme in modelslim.py\n2. Add .linear. key stripping in wan_repack RENAME_DICT for MXFP4 checkpoints\n3. Support multi-shard safetensors loading in load_sharded_safetensors

- Add W4A4_MXFP4_DUALSCALE type to modelslim scheme dispatcher - Support .linear. key stripping in wan_repack for MXFP4 msmodelslim exports - Support multi-shard safetensors loading in repack tool - Fix modelslim quantization config loading from component directory - Add detailed error messages for unsupported quantization schemes

… flag is explicit When --quantization modelslim is explicitly passed, the loader must load the per-layer quant_model_description.json from the transformer directory rather than creating an empty config. This ensures ModelSlimConfig receives the quantization type mappings required for proper scheme dispatch.

…r msmodelslim export - weight: [out, in] float8_e4m3fn (not [out, in/2]) - weight_dual_scale: [out, in/512, 1] float32 (not [out, in/64] bfloat16) L1 scale groups 16 L0 blocks = 512 elements - Fix create_weights allocation and process_weights_after_loading transforms to match actual checkpoint tensor formats from msmodelslim

Bring in MXFP4 offline (ModelSlim) loading support including dual-scale weight format, smooth quant mul_scale, and npu_format_cast fix.

Merge latest upstream changes and migrate modelslim/quantization explicit-flag support to refactored transformer_load_utils.py.

gemini-code-assist

Code Review

This pull request introduces support for MXFP4 and MXFP8 quantization on Ascend NPUs, including both offline schemes for pre-quantized weights and experimental online quantization methods. Key changes include the addition of ModelSlimMXFP4Scheme and ModelSlimMXFP8Scheme, updates to the model loader to support an explicit --quantization flag, and significant enhancements to the wan_repack.py tool for Wan2.2 models. Review feedback focuses on performance optimizations in the MXFP4 forward pass to avoid GPU-to-CPU synchronization and improving the robustness of weight shuffling logic in the FP8 implementation by using standard PyTorch in-place update patterns.

Resolve conflicts across 6 diffusion quant files: 1. quantization/__init__.py: keep mxfp4 registration; add upstream modelopt/modelopt_fp8 to literal 2. modelslim.py: keep W4A4_MXFP4 dispatch + verbose error; dedupe W8A8_MXFP8 branch 3. modelslim_mxfp8_scheme.py: adopt upstream platform-gated torch_npu import 4. mxfp8_npu.py: adopt upstream platform-gated torch_npu import 5. transformer_load_utils.py: keep modelslim special case + safetensors-metadata fallback 6. tools/wan_repack.py: keep .linear./.div. rename rules and sharded safetensors loader Skipping pre-commit (--no-verify): check-no-docs-changes hook blocks docs/ changes, but those are legitimately introduced by 1394 upstream commits, not local edits.

ping1jing2 · 2026-05-11T19:01:39Z

/tag-and-rerun-ci

…ners 1. Guard `import torch_npu` with `if _is_npu:` in mxfp4_npu.py and modelslim_mxfp4_scheme.py -- fixes ModuleNotFoundError on all GPU/AMD/MUSA CI runners\n2. Precompute layer.use_mul_scale flag in process_weights_after_loading to avoid GPU-to-CPU sync on every forward pass\n3. Use torch.no_grad() + copy_() instead of .data= for weight shuffle in fp8.py elif _use_aiter: block

TheKonka · 2026-05-14T00:57:13Z

Performance Comparison Report

1. High-level Summary

Metric	Baseline	online.json	offline.json
E2E Latency	1447979.04 ms	1238595.17 ms (-14.5%) ✅	1280412.90 ms (-11.6%) ✅

2. Stage Breakdown

Stage Name	Baseline	online.json	offline.json
InputValidationStage	0.08	0.07 (-20.3%) ⚪️	0.07 (-16.3%) ⚪️
TextEncodingStage	8634.15	8649.57 (+0.2%) ⚪️	8643.39 (+0.1%) ⚪️
LatentPreparationStage	0.29	0.17 (-41.8%) ⚪️	0.18 (-39.9%) ⚪️
TimestepPreparationStage	1.36	0.82 (-39.6%) ⚪️	0.72 (-46.8%) ⚪️
DenoisingStage	1395298.89	1200083.16 (-14.0%) 🟢	1246111.17 (-10.7%) 🟢
DecodingStage	44033.72	29852.85 (-32.2%) 🟢	25647.92 (-41.8%) 🟢
Scheduler.return_result.spill_arrays	0.06	0.07 (+9.0%) ⚪️	0.06 (-1.5%) ⚪️
SchedulerClient.materialize_file_refs	0.01	0.01 (-17.3%) ⚪️	0.01 (+33.5%) ⚪️

Metadata

Baseline Commit: 7c6f4314d20874b07a4c422e630016bdecd799c7
online.json Commit: 7c6f4314d20874b07a4c422e630016bdecd799c7
offline.json Commit: 7c6f4314d20874b07a4c422e630016bdecd799c7
Timestamp: 2026-05-13T18:34:21.548515

Upstream main introduced Mxfp4Config (ROCm/aiter, MI350+) registered as `--quantization mxfp4` in `mxfp4.py`, colliding with this branch's NPU MXFP4Config that previously used the same key. Resolution: - Rename NPU diffusion MXFP4 key to `mxfp4_npu` (consistent with LLM-side `--quantization mxfp4_npu` convention) - Register both `mxfp4` (ROCm) and `mxfp4_npu` (Ascend) in the quantization registry; deduplicate the dict entry - Update server_args help text and transformer_load_utils comment to list both options and their hardware targets Note: --no-verify used because upstream main contains legacy `docs/` changes that this repo's check-no-docs-changes hook rejects; those changes are inherited from upstream and not introduced by this merge.

Disambiguate from upstream ROCm Mxfp4Config (mxfp4.py) which differs only by letter case. NPU prefix aligns with LLM-side npu_mxfp4 naming convention.

TheKonka · 2026-05-17T02:37:57Z

Performance Comparison Report

1. High-level Summary

Metric	Baseline	online.json	offline.json
E2E Latency	1447860.88 ms	1240433.51 ms (-14.3%) ✅	1280435.48 ms (-11.6%) ✅

2. Stage Breakdown

Stage Name	Baseline	online.json	offline.json
InputValidationStage	0.09	0.08 (-10.1%) ⚪️	0.07 (-20.7%) ⚪️
TextEncodingStage	8645.60	8646.62 (+0.0%) ⚪️	8654.91 (+0.1%) ⚪️
LatentPreparationStage	0.25	0.20 (-17.6%) ⚪️	0.25 (-0.2%) ⚪️
TimestepPreparationStage	0.97	1.09 (+13.0%) ⚪️	1.39 (+43.2%) ⚪️
DenoisingStage	1395104.89	1201835.77 (-13.9%) 🟢	1246148.14 (-10.7%) 🟢
DecodingStage	44098.92	29941.37 (-32.1%) 🟢	25623.21 (-41.9%) 🟢
Scheduler.return_result.spill_arrays	0.07	0.07 (+6.6%) ⚪️	0.07 (+6.9%) ⚪️
SchedulerClient.materialize_file_refs	0.01	0.01 (+21.0%) ⚪️	0.01 (+92.0%) ⚪️

Metadata

Baseline Commit: 7c6f4314d20874b07a4c422e630016bdecd799c7
online.json Commit: 7c6f4314d20874b07a4c422e630016bdecd799c7
offline.json Commit: 7c6f4314d20874b07a4c422e630016bdecd799c7
Timestamp: 2026-05-17T10:35:12.113788

OrangeRedeng · 2026-05-18T09:13:11Z

-                "Note: MXFP4 requires ROCm and MI350+ (gfx95x)."
+                "Options: 'fp8', 'mxfp8', 'mxfp4', 'mxfp4_npu', 'modelslim'. "
+                "Note: 'mxfp4' targets ROCm + MI350+ (gfx95x); "
+                "'mxfp4_npu' / 'mxfp8' target Ascend NPU (A5 series for mxfp4_npu)."


Hi! Why are new quantization entities like mxfp8 or mxpf4_npu being created, shouldn't it be related to modelslim and handled in modelslim_config?

Hi! Great question. The distinction is between online quantization and offline (pre-quantized) loading.

modelslim is the entry point for offline pre-quantized checkpoints produced by Huawei's msmodelslim tool. It loads already-quantized weights (FP8/INT8/INT4) and dispatches to the right scheme based on quant_model_description.json. It does no quantization itself.

mxfp8 and mxfp4_npu are online quantization configs: they start from FP16/BF16 weights and perform real-time quantization inside process_weights_after_loading. This is a fundamentally different weight-loading flow.

Merging them into modelslim would conflate two separate paradigms ("adapt pre-quantized weights" vs. "quantize at runtime"), and modelslim would need to handle cases that aren't about ModelSlim checkpoints at all. This split also mirrors the broader SGLang pattern: fp8 for online FP8, compressed-tensors/quark/modelopt for their respective offline toolchain checkpoints.

Got it! Thank you for your answer!

TallMessiWu · 2026-05-18T10:34:26Z

CI Failure Analysis

I went through all failing jobs in the latest run. None of the failures are related to this PR's changes. This PR only adds MXFP4 quantization support for Wan2.2 on Ascend NPU; all failing tests run on NVIDIA, AMD, or MUSA hardware with quantization: null.

Failing Jobs Summary

NVIDIA (run 26008786983)

Job	Root Cause
`multimodal-gen-test-2-gpu (1)`	Image quality metric assertion failures — flaky, not related to this PR
`base-b-test-2-gpu-large (1)(2)`	Job-level timeout (30 min), no test error output — infrastructure issue
`wait-for-base-b`, `pr-test-finish`	Cascading from above

multimodal-gen-test-2-gpu (1) details — three diffusion tests failed with metric threshold violations, all using quantization: null:

wan2_2_i2v_a14b_2gpu: measured 2155.74 vs threshold 2077.86 (~3.7% over)
ltx_2_two_stage_t2v: measured 10441 vs threshold 9011 (~15.9% over)
ltx_2_3_two_stage_ti2v_2gpus: measured 32327 vs threshold 22075 (~46% over)

These are non-deterministic image quality metrics with tight fixed thresholds — flaky by nature, same tests pass in other partitions (e.g., partition 0 and 2 passed).

AMD (run 26008786876)

Job	Root Cause
`multimodal-gen-test-2-gpu-amd (1)`	MOVA-360p: `uint32_t` HIP JIT compile error on gfx942 — pre-existing ROCm platform issue
`stage-b-test-1-gpu-large-amd (1)`	`test_bench_serving_1gpu_part2.py`: latency `80.41ms > 80ms` / `70.68ms > 70ms` — flaky perf threshold
`stage-b-test-1-gpu-small-amd (9)`	`test_multi_tokenizer.py` timeout after 2400s — test infrastructure timeout
`stage-b-test-2-gpu-large-amd (0)`	Py-spy `Failed to get stack traces` — hang/timeout
`wait-for-stage-b-amd`, `pr-test-amd-finish`	Cascading from above

MOVA-360p error (MI325, gfx942):

RuntimeError: unknown type name 'uint32_t'; did you mean '__hip_internal::uint32_t'?
2 errors generated when compiling for gfx942.

This is a HIP JIT compilation bug in the MOVA model kernel, unrelated to this PR.

MUSA (run 26008786907)

Job	Root Cause
`multimodal-gen-test-1-gpu-musa (0)(1)`	`Failed to connect to github.com port 443` — MUSA runner network failure
`multimodal-gen-test-2-gpu-musa`	Job canceled after 4 hours — runner timeout

Conclusion

All failures are pre-existing flaky tests or infrastructure issues:

Image quality metrics with tight fixed thresholds (non-deterministic diffusion outputs)
Latency benchmarks with sub-1% margin (runner load fluctuations)
MOVA ROCm HIP compilation bug on gfx942
Test timeouts and MUSA runner network failures

Resolve fp8.py conflict in Fp8MoEMethod aiter pre-shuffle block: 1. Adopt upstream's t = shuffle_weight(...); copy_(t); del t pattern (memory-peak optimization, drops .contiguous()) 2. Drop local torch.no_grad() wrapper -- w13/w2 weights are requires_grad=False, so copy_() is autograd-safe without it Note: committed with --no-verify because the check-no-docs-changes hook is not merge-aware and flags upstream-owned legacy docs/ edits.

ping1jing2 · 2026-05-19T04:45:53Z

I merged it as AMD CIs are unrelated to this PR and MUSA CIs started 3.5h ago but still pend

TallMessiWu added 30 commits March 18, 2026 15:53

🐛 fix(diffusion): fix npu method call error

c838ade

🔀 merge: sync from upstream

df61b29

✨ feat(diffusion): add offline MXFP8 pre-quantized weight support for…

490ad0b

… Wan2.2 TI2V

✨ feat(wan22): Redesigned the wan_repack tool. Now support one-click …

b9aa785

…weight processing.

♻️ refactor(mxfp8): hoist imports and replace print with logger

22bee9e

🔀 chore(merge): sync upstream/main, keep MXFP8 and modelopt_fp4 quant…

3bbf703

… methods

🩹 fix(diffusion): register --quantization CLI arg to avoid argparse a…

250fe65

…mbiguity

🐛 fix(mxfp8_npu): move weight to current NPU device before quantization

e146b03

✨ feat(diffusion/mxfp4): add MXFP4 online quantization for Diffusion …

615dda7

…on Ascend NPU

🐛 fix(diffusion/mxfp4): add NZ format cast and dual_scale transpose f…

9baae1c

…or matmul

🐛 fix(diffusion/mxfp4): add NZ format cast and dual_scale transpose f…

a543f68

…or matmul

🐛 fix(mxfp4/modelslim): fix runtime error

9f30028

🐛 fix(mxfp4/modelslim): fix mosaic issue

4b69411

🐛 fix(mxfp4/modelslim): fix runtime error

96032d9

📈 Add temporary debugging logs

a252146

🐛 fix(mxfp4/modelslim): fix mosaic issue

3180079

🔀 merge(diffusion/mxfp4): merge junlin_mxfp4_offline into junlin_mxfp4

70452d0

Bring in MXFP4 offline (ModelSlim) loading support including dual-scale weight format, smooth quant mul_scale, and npu_format_cast fix.

⬆️ chore(deps): sync upstream/main into junlin_mxfp4

59d0b90

Merge latest upstream changes and migrate modelslim/quantization explicit-flag support to refactored transformer_load_utils.py.

TallMessiWu requested review from AniZpZ, FlamingoPg, HaiShaw, b8zhong, mickqian, ping1jing2, yhyang201 and yingluosanqian as code owners April 8, 2026 07:57

github-actions Bot added quant LLM Quantization npu diffusion SGLang Diffusion labels Apr 8, 2026

gemini-code-assist Bot reviewed Apr 8, 2026

View reviewed changes

Comment thread python/sglang/multimodal_gen/runtime/layers/quantization/modelslim_mxfp4_scheme.py

Comment thread python/sglang/multimodal_gen/runtime/layers/quantization/modelslim_mxfp4_scheme.py Outdated

Comment thread python/sglang/srt/layers/quantization/fp8.py Outdated

ping1jing2 self-assigned this Apr 9, 2026

TallMessiWu changed the title ~~🚧 [diffusion][npu][quant] Add MXFP4 quantization support for Wan2.2 Diffusion on Ascend NPU~~ ✨ [diffusion][npu][quant] Add MXFP4 quantization support for Wan2.2 Diffusion on Ascend NPU May 11, 2026

github-actions Bot added the run-ci label May 11, 2026

TallMessiWu added 2 commits May 14, 2026 15:12

♻️ refactor(diffusion/mxfp4): rename MXFP4Config to NPUMXFP4Config

373fc3f

Disambiguate from upstream ROCm Mxfp4Config (mxfp4.py) which differs only by letter case. NPU prefix aligns with LLM-side npu_mxfp4 naming convention.

Merge branch 'main' into junlin_mxfp4

7ed0fed

OrangeRedeng reviewed May 18, 2026

View reviewed changes

ping1jing2 approved these changes May 19, 2026

View reviewed changes

sglang-npu-bot merged commit 4c9f31b into sgl-project:main May 19, 2026
92 of 99 checks passed

YChange01 mentioned this pull request May 19, 2026

[RFC][NPU] Ascend NPU A5 Support for MXFP8/MXFP4 Quantization #21584

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ [diffusion][npu][quant] Add MXFP4 quantization support for Wan2.2 Diffusion on Ascend NPU#22338

✨ [diffusion][npu][quant] Add MXFP4 quantization support for Wan2.2 Diffusion on Ascend NPU#22338
sglang-npu-bot merged 37 commits into
sgl-project:mainfrom
TallMessiWu:junlin_mxfp4

TallMessiWu commented Apr 8, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ping1jing2 commented May 11, 2026

Uh oh!

TheKonka commented May 14, 2026

Uh oh!

TheKonka commented May 17, 2026

Uh oh!

OrangeRedeng May 18, 2026

Uh oh!

TallMessiWu May 18, 2026

Uh oh!

OrangeRedeng May 18, 2026

Uh oh!

TallMessiWu commented May 18, 2026 •

edited

Loading

Uh oh!

ping1jing2 commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

TallMessiWu commented Apr 8, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key NPU APIs used

Files Changed

Implementation Notes

Dual-Level Scale Layout

Smooth-Quant mul_scale

FRACTAL_NZ Requirement

Performance Comparison Report

High-level Summary

Stage Breakdown

Related Issues / PRs

CI States

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ping1jing2 commented May 11, 2026

Uh oh!

TheKonka commented May 14, 2026

Performance Comparison Report

1. High-level Summary

2. Stage Breakdown

Uh oh!

TheKonka commented May 17, 2026

Performance Comparison Report

1. High-level Summary

2. Stage Breakdown

Uh oh!

OrangeRedeng May 18, 2026

Choose a reason for hiding this comment

Uh oh!

TallMessiWu May 18, 2026

Choose a reason for hiding this comment

Uh oh!

OrangeRedeng May 18, 2026

Choose a reason for hiding this comment

Uh oh!

TallMessiWu commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI Failure Analysis

Failing Jobs Summary

NVIDIA (run 26008786983)

AMD (run 26008786876)

MUSA (run 26008786907)

Conclusion

Uh oh!

ping1jing2 commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TallMessiWu commented Apr 8, 2026 •

edited by github-actions Bot

Loading

Smooth-Quant `mul_scale`

TallMessiWu commented May 18, 2026 •

edited

Loading