sgl-project · ping1jing2 · Mar 28, 2026 · Mar 22, 2026 · Mar 22, 2026 · Mar 23, 2026
diff --git a/.claude/skills/write-sglang-test/SKILL.md b/.claude/skills/write-sglang-test/SKILL.md
@@ -92,9 +92,22 @@ Defined in `python/sglang/test/test_utils.py`:
 | `stage-c-test-large-8-gpu-amd` | `linux-mi325-8gpu-sglang` | 8-GPU MI325 scaling and integration |
 | `stage-c-test-large-8-gpu-amd-mi35x` | `linux-mi35x-gpu-8` | 8-GPU MI35x scaling (2 partitions) |
 
+
+### Per-commit (Ascend NPU)
+
+| Suite | Runner (label) | Description |
+| --- | --- | --- |
+| `per-commit-1-npu-a2` | `linux-aarch64-a2-1` | 1-NPU LLM CI machine |
+| `per-commit-2-npu-a2` | `linux-aarch64-a2-2` | 2-NPU LLM CI machine |
+| `per-commit-4-npu-a3` | `linux-aarch64-a3-4` | 4-NPU LLM CI machine |
+| `per-commit-16-npu-a3` | `linux-aarch64-a3-16` | 16-NPU LLM CI machine  |
+| `multimodal-gen-test-1-npu-a3` | `linux-aarch64-a3-2` | 1-NPU multimodal CI machine |
+| `multimodal-gen-test-2-npu-a3` | `linux-aarch64-a3-16` | 2-NPU multimodal CI machine |
+| `multimodal-gen-test-8-npu-a3` | `linux-aarch64-a3-16` | 8-NPU multimodal CI machine |
+
 #### Nightly
 
-Nightly suites are listed in `NIGHTLY_SUITES` in [`test/run_suite.py`](../../../test/run_suite.py). They run via `nightly-test-nvidia.yml` and `nightly-test-amd.yml`, not `pr-test.yml`. Examples:
+Nightly suites are listed in `NIGHTLY_SUITES` in [`test/run_suite.py`](../../../test/run_suite.py). They run via `nightly-test-nvidia.yml`, `nightly-test-amd.yml` amd `nightly-test-npu.yml`, not `pr-test.yml`. Examples:
 
 - `nightly-1-gpu` (CUDA)
 - `nightly-kernel-1-gpu` (CUDA, JIT kernel full grids)
@@ -103,6 +116,11 @@ Nightly suites are listed in `NIGHTLY_SUITES` in [`test/run_suite.py`](../../../
 - `nightly-eval-vlm-2-gpu` (CUDA)
 - `nightly-amd` (AMD)
 - `nightly-amd-8-gpu-mi35x` (AMD)
+- `nightly-1-npu-a3` (NPU)
+- `nightly-2-npu-a3` (NPU)
+- `nightly-4-npu-a3` (NPU)
+- `nightly-8-npu-a3` (NPU)
+- `nightly-16-npu-a3` (NPU)
 
 > **Note**: Multimodal diffusion uses `python/sglang/multimodal_gen/test/run_suite.py`, not `test/run_suite.py`.
 

diff --git a/docs/advanced_features/quantization.md b/docs/advanced_features/quantization.md
@@ -19,32 +19,35 @@ to guard against abnormal quantization loss regressions.
 
 ## Platform Compatibility
 
-The following table summarizes quantization method support across NVIDIA and AMD GPUs.
-
-| Method | NVIDIA GPUs | AMD GPUs (MI300X/MI325X/MI350X) | Notes |
-|--------|:-----------:|:-------------------------------:|-------|
-| `fp8` | Yes | Yes | Aiter or Triton backend on AMD |
-| `mxfp4` | Yes | Yes | Requires CDNA3/CDNA4 with MXFP support; uses Aiter |
-| `blockwise_int8` | Yes | Yes | Triton-based, works on both platforms |
-| `w8a8_int8` | Yes | Yes | |
-| `w8a8_fp8` | Yes | Yes | Aiter or Triton FP8 on AMD |
-| `awq` | Yes | Yes | Uses Triton dequantize on AMD (vs. optimized CUDA kernels on NVIDIA) |
-| `gptq` | Yes | Yes | Uses Triton or vLLM kernels on AMD |
-| `compressed-tensors` | Yes | Yes | Aiter paths for FP8/MoE on AMD |
-| `quark` | Yes | Yes | AMD Quark quantization; Aiter GEMM paths on AMD |
-| `auto-round` | Yes | Yes | Platform-agnostic (Intel auto-round) |
-| `quark_int4fp8_moe` | No | Yes | AMD-only; online INT4-to-FP8 MoE quantization (CDNA3/CDNA4) |
-| `awq_marlin` | Yes | No | Marlin kernels are CUDA-only |
-| `gptq_marlin` | Yes | No | Marlin kernels are CUDA-only |
-| `gguf` | Yes | No | CUDA-only kernels in sgl-kernel |
-| `modelopt` / `modelopt_fp8` | Yes (Hopper/SM90+) | No | [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer); requires NVIDIA hardware |
-| `modelopt_fp4` | Yes (Blackwell/SM100+) | No | [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer); native FP4 on Blackwell (B200, GB200) |
-| `petit_nvfp4` | No | Yes (MI250/MI300X/MI325X) | Enables NVFP4 on ROCm via [Petit](https://github.com/causalflow-ai/petit-kernel); use `modelopt_fp4` on NVIDIA Blackwell. Auto-selected when loading NVFP4 models on AMD. See [LMSYS blog](https://lmsys.org/blog/2025-09-21-petit-amdgpu/) and [AMD ROCm blog](https://rocm.blogs.amd.com/artificial-intelligence/fp4-mixed-precision/README.html). |
-| `bitsandbytes` | Yes | Experimental | Depends on bitsandbytes ROCm support |
-| `torchao` (`int4wo`, etc.) | Yes | Partial | `int4wo` not supported on AMD; other methods may work |
+The following table summarizes quantization method support across NVIDIA and AMD GPUs, Ascend NPUs.
+
+| Method | NVIDIA GPUs | AMD GPUs (MI300X/MI325X/MI350X) | Ascend NPUs (A2/A3) | Notes |
+|--------|:-----------:|:-------------------------------:|:-----------------------:|-------|
+| `fp8` | Yes | Yes | WIP | Aiter or Triton backend on AMD |
+| `mxfp4` | Yes | Yes | WIP | Requires CDNA3/CDNA4 with MXFP support; uses Aiter |
+| `blockwise_int8` | Yes | Yes | No | Triton-based, works on both platforms |
+| `w8a8_int8` | Yes | Yes | No | |
+| `w8a8_fp8` | Yes | Yes | No | Aiter or Triton FP8 on AMD |
+| `awq` | Yes | Yes | Yes | Uses Triton dequantize on AMD (vs. optimized CUDA kernels on NVIDIA). Uses CANN kernels on Ascend|
+| `gptq` | Yes | Yes | Yes | Uses Triton or vLLM kernels on AMD. Uses CANN kernels on Ascend|
+| `compressed-tensors` | Yes | Yes | Partial | Aiter paths for FP8/MoE on AMD. Uses CANN kernels on Ascend, `FP8` not supported yet|
+| `quark` | Yes | Yes | No | AMD Quark quantization; Aiter GEMM paths on AMD |
+| `auto-round` | Yes | Yes | Partial | Platform-agnostic (Intel auto-round). Uses CANN kernels on Ascend|
+| `quark_int4fp8_moe` | No | Yes | No | AMD-only; online INT4-to-FP8 MoE quantization (CDNA3/CDNA4) |
+| `awq_marlin` | Yes | No | No | Marlin kernels are CUDA-only |
+| `gptq_marlin` | Yes | No | No | Marlin kernels are CUDA-only |
+| `gguf` | Yes | No | WIP | CUDA-only kernels in sgl-kernel |
+| `modelopt` / `modelopt_fp8` | Yes (Hopper/SM90+) | No | No | [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer); requires NVIDIA hardware |
+| `modelopt_fp4` | Yes (Blackwell/SM100+) | No | No | [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer); native FP4 on Blackwell (B200, GB200) |
+| `petit_nvfp4` | No | Yes (MI250/MI300X/MI325X) | No | Enables NVFP4 on ROCm via [Petit](https://github.com/causalflow-ai/petit-kernel); use `modelopt_fp4` on NVIDIA Blackwell. Auto-selected when loading NVFP4 models on AMD. See [LMSYS blog](https://lmsys.org/blog/2025-09-21-petit-amdgpu/) and [AMD ROCm blog](https://rocm.blogs.amd.com/artificial-intelligence/fp4-mixed-precision/README.html). |
+| `bitsandbytes` | Yes | Experimental | No | Depends on bitsandbytes ROCm support |
+| `torchao` (`int4wo`, etc.) | Yes | Partial | No | `int4wo` not supported on AMD; other methods may work |
+| `modelslim` | No | No | Yes | Ascend quantization; Uses CANN kernels |
 
 On AMD, several of these methods use [Aiter](https://github.com/ROCm/aiter) for acceleration -- set `SGLANG_USE_AITER=1` where noted. See [AMD GPU setup](../platforms/amd_gpu.md) for installation and configuration details.
 
+On Ascend, various layers quantization configurations are supported, see [Ascend NPU quantization](../platforms/ascend/ascend_npu_quantization.md) for details.
+
 ## GEMM Backends for FP4/FP8 Quantization
 
 :::{note}
@@ -81,7 +84,7 @@ When FlashInfer is unavailable for NVFP4, sgl-kernel CUTLASS is used as an autom
 
 To load already quantized models, simply load the model weights and config. **Again, if the model has been quantized offline,
 there's no need to add `--quantization` argument when starting the engine. The quantization method will be parsed from the
-downloaded Hugging Face config. For example, DeepSeek V3/R1 models are already in FP8, so do not add redundant parameters.**
+downloaded Hugging Face or msModelSlim config. For example, DeepSeek V3/R1 models are already in FP8, so do not add redundant parameters.**
 
 ```bash
 python3 -m sglang.launch_server \
@@ -319,7 +322,6 @@ For detailed usage and supported model architectures, see [NVIDIA Model Optimize
 
 SGLang includes a streamlined workflow for quantizing models with ModelOpt and automatically exporting them for deployment.
 
-
 ##### Installation
 
 First, install ModelOpt:
@@ -477,6 +479,74 @@ model_loader.load_model(model_config=model_config, device_config=DeviceConfig())
 - **Calibration-based**: Uses calibration datasets for optimal quantization quality
 - **Production Ready**: Enterprise-grade quantization with NVIDIA support
 
+#### Using [ModelSlim](https://gitcode.com/Ascend/msmodelslim)
+MindStudio-ModelSlim (msModelSlim) is a model offline quantization compression tool launched by MindStudio and optimized for Ascend hardware.
+
+- **Installation**
+
+    ```bash
+    # Clone repo and install msmodelslim:
+    git clone https://gitcode.com/Ascend/msmodelslim.git
+    cd msmodelslim
+    bash install.sh
+    ```
+
+- **LLM quantization**
+
+    Download the original floating-point weights of the large model. Taking Qwen3-32B as an example, you can go to [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) to obtain the original model weights. Then install other dependencies (related to the model, refer to the huggingface model card).
+    > Note: You can find pre-quantized validated models on [modelscope/Eco-Tech](https://modelscope.cn/models/Eco-Tech).
+
+    _Traditional quantification methods require the preparation of calibration data files (```.jsonl``` formats) for calibration in the quantification process._
+    ```bash
+    Qwen3-32B/      # floating-point model downloaded from official HF (or modelscope) repo
+    msmodelslim/    # msmodelslim repo
+      |----- lab_calib # calibration date folder (put your dataset here in ```.jsonl``` format or use pre-prepared ones)
+          |----- some file (such as laos_calib.jsonl)
+      |----- lab_practice # best practice folder with configs for quantization
+          |----- model folder (such as qwen3_5_moe folder) # folder with quantization configs
+              |----- quant_config (such as qwen3_5_moe_w8a8.yaml) # quantization config
+      |----- another folders
+    output_folder/   # generated by below command
+      |----- quant_model_weights-00001-of-0001.safetensors # quantized weights
+      |----- quant_model_description.json # file with description of the quantization methods for each layer (```W4A4_DYNAMIC```, etc.)
+      |----- another files (such as config.json, tokenizer.json, etc.)
+    ```
+    Run quantization using one-click quantization (recommended):
+    ```bash
+    msmodelslim quant \
+    --model_path ${MODEL_PATH} \
+    --save_path ${SAVE_PATH} \
+    --device npu:0,1 \
+    --model_type Qwen3-32B \
+    --quant_type w8a8 \
+    --trust_remote_code True
+    ```
+
+- **Usage Example**
+    ```bash
+    python3 -m sglang.launch_server \
+    --model-path $PWD/Qwen3-32B-w8a8 \
+    --port 30000 --host 0.0.0.0
+    ```
+
+- **Available Quantization Methods**:
+    - [x]  ```W4A4_DYNAMIC``` linear with online quantization of activations
+    - [x]  ```W8A8``` linear with offline quantization of activations
+    - [x]  ```W8A8_DYNAMIC``` linear with online quantization of activations
+    - [x]  ```W4A4_DYNAMIC``` MOE with online quantization of activations
+    - [x]  ```W4A8_DYNAMIC``` MOE with online quantization of activations
+    - [x]  ```W8A8_DYNAMIC``` MOE with online quantization of activations
+    - [ ]  ```W4A8``` linear TBD
+    - [ ]  ```W4A16``` linear TBD
+    - [ ]  ```W48A16``` linear TBD
+    - [ ]  ```W4A16``` MoE in progress
+    - [ ]  ```W8A16``` MoE in progress
+    - [ ]  ```KV Cache``` in progress
+    - [ ]  ```Attention``` in progress
+
+
+For more detailed examples of quantization of models, as well as information about their support, see the [examples](https://gitcode.com/Ascend/msmodelslim/blob/master/example/README.md) section in ModelSLim repo.
+
 ## Online Quantization
 
 To enable online quantization, you can simply specify `--quantization` in the command line. For example, you can launch the server with the following command to enable `FP8` quantization for model `meta-llama/Meta-Llama-3.1-8B-Instruct`:
@@ -529,3 +599,4 @@ Other layers (e.g. projections in the attention layers) have their weights quant
 - [Torchao: PyTorch Architecture Optimization](https://github.com/pytorch/ao)
 - [vLLM Quantization](https://docs.vllm.ai/en/latest/quantization/)
 - [auto-round](https://github.com/intel/auto-round)
+- [ModelSlim](https://gitcode.com/Ascend/msmodelslim)
diff --git a/docs/basic_usage/deepseek_v3.md b/docs/basic_usage/deepseek_v3.md
@@ -74,7 +74,7 @@ Detailed commands for reference:
 - [16 x A100 (INT8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization)
 - [32 x L40S (INT8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-32-l40s-with-int8-quantization)
 - [Xeon 6980P CPU](../platforms/cpu_server.md#example-running-deepseek-r1)
-- [4 x Atlas 800I A3 (int8)](../platforms/ascend_npu_deepseek_example.md#running-deepseek-with-pd-disaggregation-on-4-x-atlas-800i-a3)
+- [4 x Atlas 800I A3 (int8)](../platforms/ascend/ascend_npu_deepseek_example.md#running-deepseek-with-pd-disaggregation-on-4-x-atlas-800i-a3)
 
 ### Download Weights
 If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. Please refer to [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#61-inference-with-deepseek-infer-demo-example-only) official guide to download the weights.

diff --git a/docs/diffusion/installation.md b/docs/diffusion/installation.md
@@ -84,7 +84,7 @@ pip install -e "python[all_musa]"
 
 ## Platform-Specific: Ascend NPU
 
-For Ascend NPU, please follow the [NPU installation guide](../platforms/ascend_npu.md).
+For Ascend NPU, please follow the [NPU installation guide](../platforms/ascend/ascend_npu.md).
 
 Quick test:
 

diff --git a/docs/diffusion/quantization.md b/docs/diffusion/quantization.md
@@ -44,7 +44,8 @@ backend.
 |------------------|--------------------------------------------------------------------------------------------|------------------------------------------------------|--------------------------------------------------------------|---------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
 | `fp8`            | Quantized transformer component folder, or safetensors with `quantization_config` metadata | `--transformer-path` or `--transformer-weights-path` | ALL                                                          | None                                  | Component-folder and single-file flows are both supported                                                             |
 | `nvfp4-modelopt` | NVFP4 safetensors file, sharded directory, or repo providing transformer weights           | `--transformer-weights-path`                         | FLUX.2                                                       | `comfy-kitchen` optional on Blackwell | Blackwell can use a best-performance kit when available; otherwise SGLang falls back to the generic ModelOpt FP4 path |
-| `nunchaku-svdq`  | Pre-quantized Nunchaku transformer weights, usually named `svdq-{int4\|fp4}_r{rank}-...`  | `--transformer-weights-path`                         | Model-specific support such as Qwen-Image, FLUX, and Z-Image | `nunchaku`                            | SGLang can infer precision and rank from the filename and supports both `int4` and `nvfp4`                           |
+| `nunchaku-svdq`  | Pre-quantized Nunchaku transformer weights, usually named `svdq-{int4\|fp4}_r{rank}-...`   | `--transformer-weights-path`                         | Model-specific support such as Qwen-Image, FLUX, and Z-Image | `nunchaku`                            | SGLang can infer precision and rank from the filename and supports both `int4` and `nvfp4`                            |
+| `msmodelslim`    | Pre-quantized msmodelslim transformer weights                                              | `--model-path`                                       | Wan2.2 family                                                | None                                  | Currently only compatible with the Ascend NPU family and supports both `w8a8` and `w4a4`                              |
 
 ## NVFP4
 
@@ -171,3 +172,68 @@ sglang generate \
   as `4` or `8`.
 - Current runtime validation only allows Nunchaku on NVIDIA CUDA Ampere (SM8x)
   or SM12x GPUs. Hopper (SM90) is currently rejected.
+
+## [ModelSlim](https://gitcode.com/Ascend/msmodelslim)
+MindStudio-ModelSlim (msModelSlim) is a model offline quantization compression tool launched by MindStudio and optimized for Ascend hardware.
+
+- **Installation**
+
+    ```bash
+    # Clone repo and install msmodelslim:
+    git clone https://gitcode.com/Ascend/msmodelslim.git
+    cd msmodelslim
+    bash install.sh
+    ```
+
+- **Multimodal_sd quantization**
+
+    Download the original floating-point weights of the large model. Taking Wan2.2-T2V-A14B as an example, you can go to [Wan2.2-T2V-A14B](https://modelscope.cn/models/Wan-AI/Wan2.2-T2V-A14B) to obtain the original model weights. Then install other dependencies (related to the model, refer to the modelscope model card).
+    > Note: You can find pre-quantized validated models on [modelscope/Eco-Tech](https://modelscope.cn/models/Eco-Tech).
+
+  Run quantization using one-click quantization (recommended):
+
+  ```bash
+  msmodelslim quant \
+    --model_path /path/to/wan2_2_float_weights \
+    --save_path /path/to/wan2_2_quantized_weights \
+    --device npu \
+    --model_type Wan2_2 \
+    --quant_type w8a8 \
+    --trust_remote_code True
+  ```
+
+  For more detailed examples of quantization of models, as well as information about their support, see the [examples](https://gitcode.com/Ascend/msmodelslim/blob/master/example/multimodal_sd/README.md) section in ModelSLim repo.
+
+  > Note: SGLang does not support quantized embeddings, please disable this option when quantizing using msmodelslim.
+
+- **Auto-Detection and different formats**
+
+    For msmodelslim checkpoints, it's enough to specify only ```--model-path```, the detection of quantization occurs automatically for each layer using parsing of      `quant_model_description.json` config.
+
+    In the case of `Wan2.2` only `Diffusers` weights storage format are supported, whereas modelslim saves the quantized model in the original `Wan2.2` format,
+    for conversion in use `python/sglang/multimodal_gen/tools/wan_repack.py` script:
+
+    ```bash
+    python wan_repack.py \
+      --input-path {path_to_quantized_model} \
+      --output-path {path_to_converted_model}
+    ```
+
+    After that, please copy all files from original `Diffusers` checkpoint (instead of `transformer`/`tranfsormer_2` folders)
+
+- **Usage Example**
+
+    With auto-detected flow:
+
+    ```bash
+    sglang generate \
+      --model-path Eco-Tech/Wan2.2-T2V-A14B-Diffusers-w8a8 \
+      --prompt "a beautiful sunset" \
+      --save-output
+    ```
+
+- **Available Quantization Methods**:
+    - [x]  ```W4A4_DYNAMIC``` linear with online quantization of activations
+    - [x]  ```W8A8``` linear with offline quantization of activations
+    - [x]  ```W8A8_DYNAMIC``` linear with online quantization of activations
+    - [ ]  ```mxfp8``` linear in progress
diff --git a/docs/get_started/install.md b/docs/get_started/install.md
@@ -2,7 +2,7 @@
 
 You can install SGLang using one of the methods below.
 This page primarily applies to common NVIDIA GPU platforms.
-For other or newer platforms, please refer to the dedicated pages for [AMD GPUs](../platforms/amd_gpu.md), [Intel Xeon CPUs](../platforms/cpu_server.md), [TPU](../platforms/tpu.md), [NVIDIA DGX Spark](https://lmsys.org/blog/2025-11-03-gpt-oss-on-nvidia-dgx-spark/), [NVIDIA Jetson](../platforms/nvidia_jetson.md), [Ascend NPUs](../platforms/ascend_npu.md), and [Intel XPU](../platforms/xpu.md).
+For other or newer platforms, please refer to the dedicated pages for [AMD GPUs](../platforms/amd_gpu.md), [Intel Xeon CPUs](../platforms/cpu_server.md), [TPU](../platforms/tpu.md), [NVIDIA DGX Spark](https://lmsys.org/blog/2025-11-03-gpt-oss-on-nvidia-dgx-spark/), [NVIDIA Jetson](../platforms/nvidia_jetson.md), [Ascend NPUs](../platforms/ascend/ascend_npu.md), and [Intel XPU](../platforms/xpu.md).
 
 ## Method 1: With pip or uv
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -99,7 +99,7 @@ Its core features include:
    platforms/cpu_server.md
    platforms/tpu.md
    platforms/nvidia_jetson.md
-   platforms/ascend_npu_support.rst
+   platforms/ascend/ascend_npu_support.rst
    platforms/xpu.md
 
 .. toctree::