Skip to content

[NPU] [Bugfix] Wan quantization fix#24540

Merged
sglang-npu-bot merged 22 commits into
sgl-project:mainfrom
OrangeRedeng:wan_quantization_fix
May 11, 2026
Merged

[NPU] [Bugfix] Wan quantization fix#24540
sglang-npu-bot merged 22 commits into
sgl-project:mainfrom
OrangeRedeng:wan_quantization_fix

Conversation

@OrangeRedeng
Copy link
Copy Markdown
Contributor

@OrangeRedeng OrangeRedeng commented May 6, 2026

Description

Fix Wan2.2-T2V-A14B-Diffusers-w8a8/w8a8/mxfp8 quant scheme recognition on NPU by threading
reverse_param_names_mapping through the ModelSlim config stack.

Motivation

Fix of #24518. After PR #23625 reshaped quantized‑diffusion prefixes, the Wan2.2‑W8A8 model stopped loading (No modelslim compatible scheme). The bug is that _get_scheme_from_parts uses the model’s internal layer name (e.g. blocks.0.self_attn.to_q) to look up the quant description, but the quant config file uses architecture‑canonical names (e.g. blocks.0.attn1.to_q).

The WanVideo arch‑config already carries a reverse_param_names_mapping that translates internal names back to canonical names, but the ModelSlim initialisation path silently discarded this mapping.

Modifications

  • modelslim.py – accept reverse_param_names_mapping in __init__ and from_config, build a mapper via get_param_names_mapping, and use it in _get_scheme_from_parts before querying quant_model_description.json.
  • transformer_load_utils.py – read reverse_param_names_mapping from the arch config and pass it into get_quant_config.
  • quantization_utils.py – accept and forward reverse_param_names_mapping.
  • fsdp_load.py – add missing quant parameter suffixes (input_offset, quant_bias, deq_scale) to the sharding allow‑list; mark sharded tensors with requires_grad=False to avoid meta‑tensor copy errors.
  • npu/utils.py – update a performance warning to mention --dit-cpu-offload false.

Accuracy Tests

Command:
SGLANG_CACHE_DIT_FN=2 SGLANG_CACHE_DIT_BN=1 SGLANG_CACHE_DIT_WARMUP=4 SGLANG_CACHE_DIT_RDT=0.4 SGLANG_CACHE_DIT_MC=4 SGLANG_CACHE_DIT_TAYLORSEER=true SGLANG_CACHE_DIT_TS_ORDER=2 SGLANG_CACHE_DIT_ENABLED=true sglang generate --model-path ./weights/Wan2.2-T2V-A14B-Diffusers-w8a8/ --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." --height 720 --width 1280 --tp-size 2 --sp-degree 2 --num-gpus 4 --num-frames 81 --num-inference-steps 40 --warmup true

Two_anthropomorphic_cats_in_comfy_boxing_gear_and_bright_gloves_fight_intensely_on_a_spotlighted_sta_20260507-001126_b2e8ddbd.mp4

Speed Tests and Profiling

image

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization diffusion SGLang Diffusion labels May 6, 2026
@OrangeRedeng OrangeRedeng force-pushed the wan_quantization_fix branch from 03aeade to a1316d3 Compare May 6, 2026 19:15
@OrangeRedeng OrangeRedeng marked this pull request as ready for review May 6, 2026 19:15
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@OrangeRedeng
Copy link
Copy Markdown
Contributor Author

/gemini review

@OrangeRedeng
Copy link
Copy Markdown
Contributor Author

/gemini summary

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces parameter name mapping for ModelSlim quantization, allowing for more flexible layer identification during the quantization process. It also updates FSDP loading to handle additional quantization-related parameters like input_offset and deq_scale, while ensuring sharded tensors are loaded with gradients disabled. Review feedback identifies a potential TypeError in ModelSlimConfig.from_config due to a signature change lacking a default value for the new mapping parameter. Additionally, a logic error was found in _get_scheme_from_parts where the .weight suffix is omitted in the fallback path, which would cause configuration lookups to fail for standard layers.

Comment thread python/sglang/multimodal_gen/runtime/layers/quantization/modelslim.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/layers/quantization/modelslim.py Outdated
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

This pull request addresses a regression in quantization scheme loading for Wan2.2 models on NPU hardware. By ensuring that parameter name mappings are correctly passed through the ModelSlim initialization flow, the system can now accurately resolve internal layer names to canonical architecture names required for quantization configuration lookups. Additionally, the PR improves FSDP robustness by expanding the allowed parameter list and adjusting tensor gradient requirements to avoid runtime errors during model loading.

Highlights

  • Quantization Scheme Recognition: Enabled correct recognition of Wan2.2-T2V-A14B-Diffusers-w8a8 quantization schemes on NPU by propagating reverse_param_names_mapping through the ModelSlim configuration stack.
  • FSDP Loading Improvements: Updated FSDP loading to include missing quantization parameter suffixes (input_offset, quant_bias, deq_scale) and marked sharded tensors with requires_grad=False to prevent meta-tensor copy errors.
  • Performance Diagnostics: Updated NPU performance warnings to explicitly suggest disabling CPU offloading for better compatibility.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Activity
  • Pull request created by OrangeRedeng.
  • Author requested a summary and review via automated commands.
  • Automated review identified potential TypeError risks in from_config and suggested improvements for name mapping logic.

@ping1jing2 ping1jing2 linked an issue May 7, 2026 that may be closed by this pull request
5 tasks
@ping1jing2
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@ping1jing2 ping1jing2 self-assigned this May 7, 2026
@github-actions github-actions Bot added the run-ci label May 7, 2026
@github-actions github-actions Bot added the npu label May 8, 2026
@ping1jing2
Copy link
Copy Markdown
Collaborator

this is a Diffusion related PR and all GPU CIs passed, so i merged it

@sglang-npu-bot sglang-npu-bot merged commit 9ec2880 into sgl-project:main May 11, 2026
103 of 135 checks passed
LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026
Co-authored-by: ronnie_zheng <zl19940307@163.com>
xjpang pushed a commit to xjpang/sglang that referenced this pull request May 13, 2026
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion documentation Improvements or additions to documentation npu quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [NPU] Wan2.2-T2V-A14B-Diffusers-w8a8 does not work

3 participants