Skip to content

✨ [diffusion][npu][quant] Add MXFP8 quantization support for Wan2.2 Diffusion on Ascend NPU#20922

Merged
sglang-npu-bot merged 34 commits into
sgl-project:mainfrom
TallMessiWu:junlin
May 7, 2026
Merged

✨ [diffusion][npu][quant] Add MXFP8 quantization support for Wan2.2 Diffusion on Ascend NPU#20922
sglang-npu-bot merged 34 commits into
sgl-project:mainfrom
TallMessiWu:junlin

Conversation

@TallMessiWu
Copy link
Copy Markdown
Contributor

@TallMessiWu TallMessiWu commented Mar 19, 2026

Summary

This PR adds MXFP8 (Microscaling FP8) quantization support for Wan2.2 diffusion models on Ascend NPU. It closes part of the NPU MXFP8 gap tracked in issue #14424.

Hardware requirement: Ascend A5 series or newer. npu_dynamic_mx_quant is not available on A2/A3.

Two modes are supported:

Online quantization (--quantization mxfp8)

  • Adds MXFP8Config + NPUMXFP8DiffusionLinearMethod (multimodal_gen/runtime/layers/quantization/mxfp8_npu.py) for the diffusion subsystem.
  • At load time, FP16/BF16 weights are quantized online to MXFP8 via npu_dynamic_mx_quant; at inference, activations are quantized per-token and the matmul is executed by npu_quant_matmul with group_sizes=[1,1,32] (block_size=32).

Offline quantization (msmodelslim pre-quantized weights)

  • Adds ModelSlimMXFP8Scheme (multimodal_gen/runtime/layers/quantization/modelslim_mxfp8_scheme.py) for loading weights pre-quantized by msmodelslim (float8_e4m3fn weights + uint8 scale in float8_e8m0fnu encoding).

wan_repack.py refactor

  • Refactors multimodal_gen/tools/wan_repack.py into a one-step repack tool: copies the original HF Diffusers model, converts msmodelslim quant weights (renaming keys to Diffusers format), and restores config.json — replacing a multi-step manual workflow. Fixes multiple bugs in the original script (glob patterns passed as literal paths, unconditional quant_config key update causing KeyError). Supports Wan2.2-TI2V-5B (single transformer) and Wan2.2-T2V-A14B / Wan2.2-I2V-A14B (Cascade dual-transformer).

Key NPU APIs used

API Purpose
torch_npu.npu_dynamic_mx_quant(x, dst_type=torch_npu.float8_e4m3fn) Dynamic MXFP8 quantization of activations/weights
torch_npu.npu_quant_matmul(..., group_sizes=[1,1,32]) MXFP8 quantized matmul (block_size=32)
torch_npu.float8_e4m3fn / torch_npu.float8_e8m0fnu FP8 weight dtype / scale factor dtype

Files Changed

New files

File Change
multimodal_gen/runtime/layers/quantization/mxfp8_npu.py New — online MXFP8 (MXFP8Config + NPUMXFP8DiffusionLinearMethod) for Wan2.2 diffusion
multimodal_gen/runtime/layers/quantization/modelslim_mxfp8_scheme.py New — offline MXFP8 (ModelSlimMXFP8Scheme) for msmodelslim pre-quantized weights

Modified — MXFP8 registration & dispatch

File Change
multimodal_gen/runtime/layers/quantization/__init__.py Register MXFP8Config; add "mxfp8" to QuantizationMethods literal
multimodal_gen/runtime/layers/quantization/modelslim.py Add W8A8_MXFP8 branch → ModelSlimMXFP8Scheme in _get_scheme_from_parts()

Modified — CLI & loader support

File Change
multimodal_gen/runtime/server_args.py Add --quantization CLI arg (explicit method override, e.g. --quantization mxfp8)
multimodal_gen/runtime/loader/component_loaders/transformer_loader.py Honor --quantization flag; takes priority over auto-detection from config.json / metadata
multimodal_gen/runtime/loader/fsdp_load.py Add weight_scale to FSDP unused-key list (prevents crash on offline MXFP8 weight load)
multimodal_gen/runtime/utils/quantization_utils.py Glob fallback for quant_model_description*.json (supports repacked filenames)

Modified — tooling

File Change
multimodal_gen/tools/wan_repack.py Refactor — one-step repack CLI; fix glob/KeyError bugs; add T2V-A14B, I2V-A14B, TI2V-5B support

Modified — minor refactor (srt)

File Change
srt/layers/quantization/fp8.py Import cleanup (apply_fp8_marlin_lineartorch.ops.sglang.apply_fp8_marlin_linear); restructure Fp8MoEMethod.process_weights_after_loading() to move weight shuffle inside scale-processing block

wan_repack.py: Design Details

Bug Fixes

The original script contained four bugs that made it entirely non-functional:

# Location Bug Root Cause
1 load_sharded_safetensors() pathlib.Path(dir, "*model*.safetensors") passed directly to load_file() pathlib.Path(dir, "*.safetensors") creates a literal path with * in the filename — not a glob. load_file() does not expand globs, so every run raises FileNotFoundError.
2 convert_transformer() Same pattern applied to open(pathlib.Path(dir, "*quant_model_description*.json")) Same root cause as above.
3 get_transformer_config() No else branch — unknown model_type causes NameError: name 'RENAME_DICT' is not defined Local variable only assigned inside if model_type == "Wan-T2V-14B", then referenced unconditionally outside.
4 convert_transformer() update_dict_(original_quant_config, key, new_key) called unconditionally for every key The quant description JSON only contains entries for quantized Linear layers, not all model keys. Calling dict.pop() on a missing key raises KeyError.

Fix for bugs 1 & 2: replaced glob-as-literal-path with directory.glob(pattern), which returns a proper file list. Added existence and uniqueness checks with descriptive error messages.

Fix for bug 3: added else: raise ValueError(...) and extended support to Wan2.2-I2V-A14B and Wan2.2-TI2V-5B.

Fix for bug 4: added if key in quant_config guard before updating the quant description dict.


One-Step Repack Workflow

Before — users had to run these steps manually:

# 1. Copy original HF model
cp -r Wan2.2-TI2V-5B-Diffusers Wan2.2-TI2V-5B-Diffusers-MXFP8

# 2. Delete transformer dir(s) from the copy
rm -rf Wan2.2-TI2V-5B-Diffusers-MXFP8/transformer

# 3. Run weight conversion (also broken — see bugs above)
python wan_repack.py \
    --input-path Wan2.2-TI2V-5B-quantized \
    --output-path Wan2.2-TI2V-5B-Diffusers-MXFP8

# 4. Restore config.json that was deleted in step 2
cp Wan2.2-TI2V-5B-Diffusers/transformer/config.json \
   Wan2.2-TI2V-5B-Diffusers-MXFP8/transformer/config.json

After — single command:

python wan_repack.py \
    --model-type Wan2.2-TI2V-5B \
    --original-model-path Wan2.2-TI2V-5B-Diffusers \
    --quant-path        Wan2.2-TI2V-5B-quantized \
    --output-path       Wan2.2-TI2V-5B-Diffusers-MXFP8

Internally, the new repack() orchestrator runs three steps:

  1. shutil.copytree(original, output, ignore=transformer_dirs) — copies the full model (VAE, text encoder, scheduler, etc.) to the output path, skipping transformer dirs.
  2. convert_transformer() for each transformer dir — converts quantized weights to diffusion_pytorch_model.safetensors and renames keys to HF Diffusers format.
  3. shutil.copy2(original/transformer/config.json, output/transformer/config.json) — restores the architecture config that was excluded in step 1.

For Cascade models (Wan2.2-T2V-A14B, Wan2.2-I2V-A14B), steps 2–3 repeat for both transformer/ (sourced from quant_path/high_noise_model/) and transformer_2/ (sourced from quant_path/low_noise_model/). The cascade vs. single-model dispatch is driven by CASCADE_MODEL_TYPES.


Summary

Aspect Before After
Glob file matching pathlib.Path(dir, "*.safetensors") — broken, FileNotFoundError on every run dir.glob("*.safetensors") with existence and uniqueness checks
get_transformer_config() Only "Wan-T2V-14B"; crashes with NameError for any other type Supports Wan2.2-T2V-A14B, Wan2.2-I2V-A14B, Wan2.2-TI2V-5B; raises ValueError on unsupported type
quant_config key update Unconditional dict.pop()KeyError for non-quantized layers if key in quant_config guard
Workflow 4 manual steps (copy model, delete transformer, convert, restore config.json) Single repack() call handles all steps
CLI arguments --input-path, --output-path only --model-type, --original-model-path, --quant-path, --output-path

Performance Comparison Report

1. Scripts

# Base Model
SGLANG_CACHE_DIT_FN=2
SGLANG_CACHE_DIT_BN=1
SGLANG_CACHE_DIT_WARMUP=4
SGLANG_CACHE_DIT_RDT=0.4 
SGLANG_CACHE_DIT_MC=4 
SGLANG_CACHE_DIT_TAYLORSEER=true 
SGLANG_CACHE_DIT_TS_ORDER=2 
SGLANG_CACHE_DIT_ENABLED=true
sglang generate --model-path /home/weights/Wan2.2-TI2V-5B-Diffusers \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
--height 704 --width 1280 --num-gpus 1 --num-frames 81 --num-inference-steps 40  --warmup \
--perf-dump-path baseline.json

sleep 15

# Inference using Offline Modelslim Pre-Quantized Weights
SGLANG_CACHE_DIT_FN=2
SGLANG_CACHE_DIT_BN=1
SGLANG_CACHE_DIT_WARMUP=4
SGLANG_CACHE_DIT_RDT=0.4 
SGLANG_CACHE_DIT_MC=4 
SGLANG_CACHE_DIT_TAYLORSEER=true 
SGLANG_CACHE_DIT_TS_ORDER=2 
SGLANG_CACHE_DIT_ENABLED=true
sglang generate --model-path /home/weights/Wan2.2-TI2V-5B-Diffusers-mxfp8 \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
--height 704 --width 1280 --num-gpus 1 --num-frames 81 --num-inference-steps 40  --warmup \
--perf-dump-path offline.json

sleep 15

# Online Quantization Inference
SGLANG_CACHE_DIT_FN=2
SGLANG_CACHE_DIT_BN=1
SGLANG_CACHE_DIT_WARMUP=4
SGLANG_CACHE_DIT_RDT=0.4
SGLANG_CACHE_DIT_MC=4
SGLANG_CACHE_DIT_TAYLORSEER=true
SGLANG_CACHE_DIT_TS_ORDER=2
SGLANG_CACHE_DIT_ENABLED=true
sglang generate --model-path /home/weights/Wan2.2-TI2V-5B-Diffusers \
--quantization mxfp8 \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
--height 704 --width 1280 --num-gpus 1 --num-frames 81 --num-inference-steps 40  --warmup \
--perf-dump-path online.json

2. High-level Summary

Metric Baseline offline.json online.json
E2E Latency 102922.86 ms 91619.75 ms (-11.0%) ✅ 91413.05 ms (-11.2%) ✅

3. Stage Breakdown

Stage Name Baseline offline.json online.json
InputValidationStage 0.11 0.08 (-30.0%) ⚪️ 0.06 (-49.2%) ⚪️
TextEncodingStage 12573.16 12653.25 (+0.6%) ⚪️ 12476.38 (-0.8%) ⚪️
LatentPreparationStage 0.27 0.21 (-24.7%) ⚪️ 0.16 (-42.3%) ⚪️
TimestepPreparationStage 1.21 0.95 (-21.0%) ⚪️ 0.90 (-25.5%) ⚪️
DenoisingStage 83386.55 72910.67 (-12.6%) 🟢 72868.55 (-12.6%) 🟢
DecodingStage 6952.78 6046.59 (-13.0%) 🟢 6058.86 (-12.9%) 🟢
Metadata
  • Baseline Commit: ef874c0a1c92bf29a35e7f2e7efaf2bdaed748fa
  • offline.json Commit: ef874c0a1c92bf29a35e7f2e7efaf2bdaed748fa
  • online.json Commit: ef874c0a1c92bf29a35e7f2e7efaf2bdaed748fa
  • Timestamp: 2026-03-25T16:09:56.257827

Related Issues

Closes part of #14424 (MXFP8/MXFP4 support on Ascend NPU for SGLang).

…th B)

Add NPUMXFP8LinearMethod that enables --quantization mxfp8 on Ascend NPU,
supporting both online (FP16/BF16 → MXFP8) and offline (serialized FP8
checkpoint) quantization via torch_npu APIs (npu_dynamic_mx_quant +
npu_quant_matmul with group_sizes=[1,1,32]).
…n Ascend NPU

Add MXFP8Config and NPUMXFP8DiffusionLinearMethod for the diffusion
subsystem (multimodal_gen), enabling --quantization mxfp8 for Wan2.2
and other diffusion models on Ascend NPU. Also adds explicit
quantization field to diffusion ServerArgs so online quantization
can be specified without pre-quantized weights.
- Ensure weight tensor is on NPU device before npu_dynamic_mx_quant call
- Flatten input x to 2D before quantization so input_scale is 3D (required by npu_quant_matmul)
- Simplify output shape restoration logic

Fixes: dimension of x1Scale(pertoken_scale) should be 3 but was 4
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for online MXFP8 quantization on Ascend NPUs, significantly enhancing the efficiency and performance of both Large Language Models (LLMs) and Wan2.2 Diffusion models within the SGLang framework. By integrating NPU-specific quantization methods and updating the model loading mechanisms, it allows users to leverage the --quantization mxfp8 flag for optimized inference, streamlining the deployment of models on Ascend hardware.

Highlights

  • MXFP8 Quantization for Ascend NPU: Implemented online MXFP8 quantization support for SGLang on Ascend NPU (Path B), enabling --quantization mxfp8 for both LLM and Wan2.2 Diffusion inference.
  • Core LLM Serving Implementation: Added sglang/srt/hardware_backend/npu/quantization/mxfp8_method_npu.py which implements NPUMXFp8LinearMethod for online MXFP8 quantization, supporting FP16/BF16 to MXFP8 conversion and offline FP8 checkpoints using torch_npu APIs.
  • Wan2.2 Diffusion Support: Introduced sglang/multimodal_gen/runtime/layers/quantization/mxfp8_npu.py to extend MXFP8 support to Wan2.2 video diffusion models and updated model loader and server arguments to handle this quantization method.
  • Quantization Method Dispatch: Modified python/sglang/srt/layers/quantization/fp8.py to dispatch to the NPU MXFP8 backend when running on NPU and use_mxfp8 is enabled.
  • Bug Fixes and Cleanup: Addressed MXFP8 scale dimension mismatch in Diffusion models and NPU method call compatibility issues. The ModelSlim MXFP8 Scheme (Path A) was removed in favor of the unified online MXFP8 (Path B).
  • Testing: Added test/srt/ascend/test_ascend_mxfp8_quantization.py to cover online quantization correctness and performance for Ascend NPU.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for MXFP8 quantization on Ascend NPU for both LLM serving and Diffusion models. It introduces new quantization methods (NPUMXFP8LinearMethod and NPUMXFP8DiffusionLinearMethod) that leverage torch_npu for online and offline quantization. The changes also include updates to server arguments and model loading logic to enable this feature, along with a new test suite. The implementation appears solid, with separate, clean implementations for the LLM and diffusion model paths. I have one suggestion to improve robustness in the LLM quantization path by ensuring the weight tensor is on the NPU device before quantization, mirroring the practice in the diffusion path.

Comment thread python/sglang/srt/hardware_backend/npu/quantization/mxfp8_method_npu.py Outdated
Comment thread python/sglang/srt/hardware_backend/npu/quantization/mxfp8_method_npu.py Outdated
按 reviewer 建议重构架构分层:
- 在 fp8.py 新增 MXFP8LinearAscendMethod,负责权重定义(__init__、create_weights)
- 简化 mxfp8_method_npu.py 中的 NPUMXFP8LinearMethod,只保留权重处理和 kernel 调用
- 改进架构分层,符合现有 NPU INT8 方法模式
Fix weight loading for msmodelslim pre-quantized MXFP8 weights:
- Change weight dtype from int8 to float8_e4m3fn (actual storage format in safetensors)
- Fix weight_scale shape from [out, in/32*2] to [out, in/32] (actual msmodelslim export)
- Update process_weights_after_loading to reshape weight_scale [out, in/32] -> [out, -1, 2]
@TallMessiWu TallMessiWu changed the title 🚧 WIP: feat(npu): add MXFP8 quantization support for Ascend NPU ✨ [NPU][MXFP8] Add MXFP8 quantization support for Ascend NPU (LLM + Wan2.2 Diffusion) Mar 24, 2026
@ping1jing2
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@github-actions github-actions Bot requested a review from JustinTong0323 as a code owner May 1, 2026 13:03
@TallMessiWu
Copy link
Copy Markdown
Contributor Author

CI failure analysis

I went through every failed/cancelled job. None of the failures are caused by changes in this PR. Breakdown by root cause:

1. NPU runner infra issue

  • stage-b-test-2-npu-a2 (0) (FAILURE) / (1) (CANCELLED) — failed during Install dependencies, before any test code ran:
    error: could not download file from 'https://rsproxy.cn/dist/channel-rust-stable.toml'
           ... connection reset
    ##[error]Executing the custom container implementation failed.
    
    Rust toolchain mirror (rsproxy.cn) network failure on the self-hosted NPU runner. Pure infra.

2. Pre-existing NPU test failure (exists on main, not introduced by this PR)

  • multimodal-gen-test-8-npu-a3test_diffusion_generation[wan2_2_t2v_14b_w8a8_8npu] fails with:
    NotImplementedError: No modelslim compatible scheme was found.
    
    This PR only adds a W8A8_MXFP8 branch to _get_scheme_from_parts(); it does not remove or alter any existing branch. The failure is reproducible on the unrelated PR bbuf/hunyuan3d-dit-fusions at the same line — same model, same trace. It looks like the Wan2.2-T2V-A14B-Diffusers-w8a8 checkpoint's quant_type doesn't match any of W8A8_DYNAMIC / W8A8 / W4A4_DYNAMIC / W8A8_MXFP8 and needs a separate scheme registered (out of scope here).
  • pr-test-npu-finish — rollup of the two NPU jobs above.

3. AMD runner environment issues

  • multimodal-gen-test-2-gpu-amd (linux-mi300-2gpu-sglang, 0) — all 11 tests ERROR at setup:
    torch.distributed.DistBackendError: NCCL error in: NCCLUtils.cpp:94,
        unhandled cuda error, NCCL version 2.26.6
    ncclUnhandledCudaError: Cuda failure 'invalid argument'
    
    Distributed init blows up on the MI300 runner — environment, not code.
  • stage-b-test-1-gpu-small-amd-mi35x (linux-mi35x-gpu-1) — VRAM cleanup failed before tests started:
    ERROR: VRAM usage exceeds threshold (5%) on some GPUs:
        GPU[0] : GPU Memory Allocated (VRAM%): 94
    === FAILED: VRAM cleanup unsuccessful after 3 attempts ===
    
    Stale process from a prior run holding the device.
  • stage-b-test-1-gpu-large-amd (linux-mi300-1gpu-sglang, 1)test_score_api_latency_throughput on Qwen/Qwen3-Reranker-0.6B:
    AssertionError: 51.71446999302134 not less than 50
    
    Latency exceeded the 50 ms threshold by 1.7 ms. Unrelated to this PR (no Qwen reranker / score API changes).

4. Pre-existing flaky multimodal-gen GPU failures (also fail on unrelated PRs)

  • call-multimodal-gen-tests / multimodal-gen-test-2-gpu (0)/(1)/(2) — 12 cases failing the [consistency] or [performance] checks:

    • consistency (extremely poor SSIM/PSNR, e.g. ssim=0.03, psnr=9.6): wan2_2_i2v_a14b_2gpu, wan2_2_t2v_a14b_2gpu, wan2_2_t2v_a14b_teacache_2gpu, mova_360p_ring1_uly2, mova_360p_tp2, wan2_1_i2v_14b_480P_2gpu, wan2_1_i2v_14b_720P_2gpu, wan2_1_i2v_14b_lora_2gpu, fsdp-inference, ltx_2_3_two_stage_ti2v_2gpus
    • performance: ltx_2.3_one_stage_ti2v (983 ms vs 949 ms limit, ~3.5% over), ltx_2.3_two_stage_t2v_2gpus (322.5 ms vs 320.7 ms limit)

    None of these models/paths are touched by this PR (this PR only adds NPU MXFP8 for Wan2.2 diffusion + a tiny SRT FP8 import refactor that doesn't change Wan2.x/LTX/mova/fsdp behavior).

    The same Wan2.2/LTX consistency failures reproduce on the unrelated PR codex/default-vae-channels-last-3d — same wan2_2_t2v_a14b_2gpu (clip=0.52, ssim=0.03) and ltx_2.3_one_stage_ti2v performance miss. Strongly looks like CI-data drift / flakiness on main.

  • call-multimodal-gen-tests / diffusion-coverage-check — fails because the test jobs above failed.

5. Cascading fast-fail (auto-skipped after a root cause failed)

All of these explicitly say Fast-fail: skipping — root cause job(s): call-multimodal-gen-tests / multimodal-gen-test-2-gpu (0), wait-for-stage-b:

  • wait-for-stage-b
  • stage-b-test-1-gpu-small (3)..(7)
  • stage-b-test-1-gpu-large (3)..(13)
  • stage-b-test-2-gpu-large (2), (3)
  • pr-test-finish

Summary

Category Jobs Cause
NPU infra 2 Rust mirror network failure on self-hosted runner
Pre-existing NPU test 2 wan2_2_t2v_14b_w8a8_8npu quant_type unmapped (also fails on unrelated PRs)
AMD environment 3 NCCL init / VRAM cleanup / 1.7 ms latency overshoot
Pre-existing flaky GPU 4 Wan2.x / LTX / mova / fsdp consistency or performance (also fails on unrelated PRs)
Cascading fast-fail 20 Auto-skipped due to the above

No action required on this PR's code. Re-running CI once the runner infra recovers should clear the cascading failures; the NPU wan2_2_t2v_14b_w8a8_8npu and the GPU consistency drifts need a separate fix on main.

Comment thread docs_new/index.mdx Outdated
Restore docs_new/index.mdx to upstream/main state. This file was
modified by an automated github-actions[bot] commit (762f21f) that
ran the LMSYS blog sync workflow against the fork's junlin branch,
unrelated to the diffusion MXFP8 changes in this PR.
@ping1jing2
Copy link
Copy Markdown
Collaborator

I merged it as we already analysis all CIs and these failed tests are unrelated to our change. please let me know if there are some other issues

@sglang-npu-bot sglang-npu-bot merged commit 80a6014 into sgl-project:main May 7, 2026
83 of 123 checks passed
@OrangeRedeng
Copy link
Copy Markdown
Contributor

it won't work until #24540 is merged

Dogacel pushed a commit to Dogacel/sglang-fork that referenced this pull request May 8, 2026
…iffusion on Ascend NPU (sgl-project#20922)

Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
LLThomas pushed a commit to LLThomas/sglang that referenced this pull request May 8, 2026
…iffusion on Ascend NPU (sgl-project#20922)

Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026
…iffusion on Ascend NPU (sgl-project#20922)

Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
TallMessiWu added a commit to TallMessiWu/sglang that referenced this pull request May 20, 2026
…#22352 conflicts

Resolved 10 conflicts:
- Diffusion side (7 files): used upstream/main (includes merged sgl-project#20922/sgl-project#22338)
- LLM side fp8.py: kept both NPU and MUSA capability bypasses
- LLM side modelslim.py: added W8A8_MXFP8 to upstream's table-driven scheme dispatch
- LLM side transformers.py: used upstream/main (MoE refactoring with proper prefix)
TallMessiWu added a commit to TallMessiWu/sglang that referenced this pull request May 20, 2026
… prerequisites

- Prerequisites sgl-project#20922 (Diffusion MXFP8) and sgl-project#22338 (Diffusion MXFP4) merged
  upstream — accept their canonical versions and remove our duplicate
  diffusion modifications from this PR.
- Adapt offline MXFP8 dispatch to upstream's table-driven
  ModelSlimConfig.get_linear_scheme by registering W8A8_MXFP8 →
  ModelSlimMXFP8Scheme; add a no-op __init__ on the scheme so its
  signature matches the other entries.
- Keep NPU bypass in Fp8Config.get_min_capability alongside the new
  upstream _is_musa branch; revert an unrelated MoE _use_aiter style
  change that drifted in via earlier merges.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion documentation Improvements or additions to documentation npu quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants