Skip to content

[NPU] NPU quantization refactoring & more quantization formats support#14504

Merged
iforgetmyname merged 218 commits intosgl-project:mainfrom
OrangeRedeng:npu_quantization_refactor
Jan 14, 2026
Merged

[NPU] NPU quantization refactoring & more quantization formats support#14504
iforgetmyname merged 218 commits intosgl-project:mainfrom
OrangeRedeng:npu_quantization_refactor

Conversation

@OrangeRedeng
Copy link
Contributor

@OrangeRedeng OrangeRedeng commented Dec 5, 2025

Motivation

Related to #14424 (you can found class diagramm here). Follows #13664

Continuation of the refactoring started in #13359 and feature supporting started in #11984. To simplify the support of various quantization algorithms, the code is being refactored to separate the mechanisms for loading weights and inference kernels.

Stage 1. (In progress) Separation of the inference code (kernels) and the code associated with quantization framework (msmodelslim, awq, auto-round, etc.). Supporting schemes for modelslim.

Modifications

Refactored:

  • Moved create_weights() for w8a8 linear method
    sglang/srt/hardware_backend/npu/quantization/linear_method_npu.py -> sglang/srt/layers/quantization/msmodelslim/schemes/msmodelslim_w8a8_int8.py
  • Moved create_weights() for w8a8 MOE method
    sglang/srt/hardware_backend/npu/quantization/fused_moe_method_npu.py -> sglang/srt/layers/quantization/msmodelslim/msmodelslim_moe.py
  • Moved create_weights() for w4a8 MOE method
    sglang/srt/hardware_backend/npu/quantization/fused_moe_method_npu.py -> sglang/srt/layers/quantization/msmodelslim/msmodelslim_moe.py
  • Moved and redesigned ModelSlimConfig()
    sglang/srt/hardware_backend/npu/quantization/msmodelslim.py ->
    sglang/srt/layers/quantization/msmodelslim/msmodelslim.py
  • Remove sglang/srt/hardware_backend/npu/quantization/msmodelslim.py file
  • Moved create_weights() for w4a16 from msmodelslim to compressed-tensors
    sglang/srt/hardware_backend/npu/quantization/fused_moe_method_npu.py ->
    ‎python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.py

Added:

  • Add MsModelSlim scheme structure -> sglang/srt/layers/quantization/msmodelslim/
  • Add ModelSlimW4A4Int4() class and NPU_W4A4DynamicLinearMethod() to support W4A4 linear method for NPU
  • Add NPUCompressedTensorsW8A8Int8() and NPUCompressedTensorsW8A8Int8MoEMethod() to support compressed-tensors w8a8 linear/MOE method for NPU
  • Add _find_quant_modelslim_config() method to support automated config detection for msmodelslim
  • Unit-test for w4a4 modelslim on NPU
  • Unit-test for w8a8 compressed tensors on NPU

Accuracy Tests & Benchmarking

ModelSlim tests

Server

SGLANG_SET_CPU_AFFINITY=1 PYTORCH_NPU_ALLOC_CONF=expandable_segments:True STREAMS_PER_DEVICE=32 HCCL_BUFFSIZE=1536 ENABLE_ASCEND_MOE_NZ=1 ASCEND_RT_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --device npu --attention-backend ascend --trust-remote-code --tp-size 2 --model-path *model* --port 30088 --mem-fraction-static 0.8 --cuda-graph-max-bs 16

Client

python ./benchmark/gsm8k/bench_sglang.py --num-questions 1319 --port 30088 --data-path ../gsm8k/test.jsonl --parallel 16

Results

Qwen3-32B-w4a4-LAOS (dynamic)
image

Qwen3-32B-W8A8 (static)
image

Qwen3-32B-W8A8 (dynamic)
image

Qwen3-30B-W8A8 (attn - static / mlp - dynamic)
image

EP MoE Server

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor sysctl -w vm.swappiness=0 sysctl -w kernel.numa_balancing=0 sysctl -w kernel.sched_migration_cost_ns=50000 export SGLANG_SET_CPU_AFFINITY=1 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export STREAMS_PER_DEVICE=32 export HCCL_SOCKET_IFNAME=lo export GLOO_SOCKET_IFNAME=lo export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=36 export HCCL_BUFFSIZE=1600 export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 export SGLANG_NPU_USE_MLAPO=1 export SGLANG_ENABLE_SPEC_V2=1 export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 export SGLANG_USE_FIA_NZ=1 export ENABLE_MOE_NZ=1

For Qwen3-30B-W8A8 (attn - static / mlp - dynamic)

python3 -m sglang.launch_server --model-path *model* --tp 4 --trust-remote-code --attention-backend ascend --device npu --host 127.0.0.1 --port 30088 --mem-fraction-static 0.8 --quantization modelslim --moe-a2a-backend deepep --deepep-mode auto

For DeepSeek-R1-W4A8-pertoken (dynamic)

python3 -m sglang.launch_server --model-path *model* --tp 16 --trust-remote-code --attention-backend ascend --device npu --watchdog-timeout 9000 --host 127.0.0.1 --port 30088 --cuda-graph-bs 8 16 24 28 32 36 --mem-fraction-static 0.71 --max-running-requests 144 --context-length 8188 --disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 9000 --moe-a2a-backend deepep --deepep-mode auto

Client

python ./benchmark/gsm8k/bench_sglang.py --num-questions 1319 --port 30088 --data-path ../gsm8k/test.jsonl --parallel 16

Results

Qwen3-30B-W8A8 (attn - static / mlp - dynamic)
image

DeepSeek-R1-W4A8-pertoken (attn - static / mlp - dynamic)
image

Compressed-Tensors tests

Server

SGLANG_SET_CPU_AFFINITY=1 PYTORCH_NPU_ALLOC_CONF=expandable_segments:True STREAMS_PER_DEVICE=32 HCCL_BUFFSIZE=1536 ENABLE_ASCEND_MOE_NZ=1 ASCEND_RT_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --device npu --attention-backend ascend --trust-remote-code --tp-size 2 --model-path *model* --port 30088 --mem-fraction-static 0.8 --cuda-graph-max-bs 16

Client

python ./benchmark/gsm8k/bench_sglang.py --num-questions 1319 --port 30088 --data-path ../gsm8k/test.jsonl --parallel 16

Results

Llama-3.1-8B-Instruct-quantized-W8A8 (dynamic)
image

Qwen3-30B-A3B-Instruct-2507-W8A8 (dynamic)
image

EP MoE Server

SGLANG_DEEPEP_BF16_DISPATCH=1 SGLANG_SET_CPU_AFFINITY=1 PYTORCH_NPU_ALLOC_CONF=expandable_segments:True STREAMS_PER_DEVICE=32 HCCL_BUFFSIZE=1536 ENABLE_ASCEND_MOE_NZ=1 ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 python3 -m sglang.launch_server --model-path /mnt/share/weights/Kimi-K2-Thinking/ --moe-a2a-backend deepep --deepep-mode auto --tp 16 --mem-fraction-static 0.8 --max-total-tokens 66000 --trust-remote-code --attention-backend ascend --device npu --host 127.0.0.1 --port 30112 --disable-radix-cache --context-length 8192 --chunked-prefill-size 8192 --max-prefill-tokens 8000

Client

python bench_sglang.py --num-questions 200 --port 30112 --data-path /home/swx1199799/gsm8k/test.json

Results

KIMI-K2-Thinking W4A16
image

AWQ tests

Server

SGLANG_SET_CPU_AFFINITY=1 PYTORCH_NPU_ALLOC_CONF=expandable_segments:True STREAMS_PER_DEVICE=32 HCCL_BUFFSIZE=1536 ASCEND_RT_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --device npu --attention-backend ascend --trust-remote-code --tp-size 2 --model-path *model* --port 30088 --mem-fraction-static 0.8 --cuda-graph-max-bs 16

Client

python ./benchmark/gsm8k/bench_sglang.py --num-questions 1319 --port 30088 --data-path ../gsm8k/test.jsonl --parallel 16

Results

Qwen3-32B-awq W4A16
image

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @OrangeRedeng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors and expands the quantization capabilities for Ascend NPU hardware. It introduces a mechanism to automatically detect and apply NPU-optimized quantization settings and adds support for various quantization schemes (W4A4, W4A8, W8A8) for both standard linear layers and Mixture-of-Experts (MoE) layers. The changes aim to leverage NPU's hardware acceleration for improved performance and efficiency in quantized model inference.

Highlights

  • NPU Quantization Configuration: Introduced logic to detect and apply NPU-specific quantization configurations, specifically looking for a 'quant_model_description.json' file when running on NPU hardware.
  • New NPU Quantization Methods: Added several new quantization methods tailored for Ascend NPU, including W4A4 dynamic linear, W4A8 dynamic linear, W8A8 static and dynamic linear, and W4A8/W8A8 dynamic for Mixture-of-Experts (MoE) layers.
  • NPU MoE Layer Support: Implemented specialized methods for handling quantized Mixture-of-Experts (MoE) layers on NPU, including weight packing/unpacking and fused expert operations using NPU-specific grouped matrix multiplication.
  • Quantization Utility Base Class: Created a base class _NPULinearMethodBase to standardize the implementation of NPU-specific linear quantization methods.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@OrangeRedeng OrangeRedeng changed the title [NPU][2/N] Ascend NPU quantization refactoring [NPU][2/N] Ascend NPU quantization refactoring & more quantization formats support Dec 5, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a refactoring for Ascend NPU quantization support. It adds several new quantization methods for NPU, including W4A4, W4A8, and W8A8, along with their Mixture-of-Experts (MoE) variants. The changes also include updates to the model configuration to detect and apply these NPU-specific quantization schemes. My review focuses on ensuring correctness, consistency, and code quality. I've identified several critical issues such as missing imports, incorrect class inheritance, and improper use of decorators that could lead to runtime errors. I've also provided suggestions to improve code readability and maintainability. Overall, this is a significant step towards enabling efficient quantization on Ascend NPUs, but the identified issues should be addressed before merging.

@OrangeRedeng OrangeRedeng changed the title [NPU][2/N] Ascend NPU quantization refactoring & more quantization formats support [NPU][1/3] Ascend NPU quantization refactoring & more quantization formats support Dec 8, 2025
@ping1jing2 ping1jing2 self-assigned this Dec 9, 2025
@github-actions github-actions bot added the npu label Dec 10, 2025
@ping1jing2
Copy link
Collaborator

i found #16953 is better than #16933 and same as yours

@ping1jing2
Copy link
Collaborator

/rerun-failed-ci

1 similar comment
@ping1jing2
Copy link
Collaborator

/rerun-failed-ci

@ping1jing2
Copy link
Collaborator

/rerun-failed-ci

@ping1jing2
Copy link
Collaborator

ping1jing2 commented Jan 14, 2026

https://github.com/sgl-project/sglang/actions/runs/20984256066/job/60316274289?pr=14504

this error seems like an environment-related issue and has nothing to do with our code

@OrangeRedeng OrangeRedeng changed the title [NPU][1/3] Ascend NPU quantization refactoring & more quantization formats support [NPU][1/*] Ascend NPU quantization refactoring & more quantization formats support Jan 14, 2026
@ping1jing2
Copy link
Collaborator

/rerun-failed-ci

2 similar comments
@iforgetmyname
Copy link
Collaborator

/rerun-failed-ci

@ping1jing2
Copy link
Collaborator

/rerun-failed-ci

@iforgetmyname iforgetmyname changed the title [NPU][1/*] Ascend NPU quantization refactoring & more quantization formats support [NPU] NPU quantization refactoring & more quantization formats support Jan 14, 2026
@iforgetmyname iforgetmyname merged commit 424a380 into sgl-project:main Jan 14, 2026
205 of 214 checks passed
zackyoray pushed a commit to zackyoray/sglang that referenced this pull request Jan 21, 2026
sgl-project#14504)

Co-authored-by: TamirBaydasov <mr.jeijy@gmail.com>
Co-authored-by: Tamir Baydasov <41994229+TamirBaydasov@users.noreply.github.com>
Co-authored-by: Савкин Артем <savkinartem@MacBook-Air-Viktoria.local>
Co-authored-by: Edward Shogulin <edward.shogulin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation hicache Hierarchical Caching for SGLang npu quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [Ascend] [AWQ] AWQ quantization RuntimeError with aclnnAddRmsNorm operator on ascend backend

8 participants