[NPU] NPU quantization refactoring & more quantization formats support by OrangeRedeng · Pull Request #14504 · sgl-project/sglang

OrangeRedeng · 2025-12-05T15:00:16Z

Motivation

Related to #14424 (you can found class diagramm here). Follows #13664

Continuation of the refactoring started in #13359 and feature supporting started in #11984. To simplify the support of various quantization algorithms, the code is being refactored to separate the mechanisms for loading weights and inference kernels.

Stage 1. (In progress) Separation of the inference code (kernels) and the code associated with quantization framework (msmodelslim, awq, auto-round, etc.). Supporting schemes for modelslim.

Modifications

Refactored:

Moved create_weights() for w8a8 linear method
sglang/srt/hardware_backend/npu/quantization/linear_method_npu.py -> sglang/srt/layers/quantization/msmodelslim/schemes/msmodelslim_w8a8_int8.py
Moved create_weights() for w8a8 MOE method
sglang/srt/hardware_backend/npu/quantization/fused_moe_method_npu.py -> sglang/srt/layers/quantization/msmodelslim/msmodelslim_moe.py
Moved create_weights() for w4a8 MOE method
sglang/srt/hardware_backend/npu/quantization/fused_moe_method_npu.py -> sglang/srt/layers/quantization/msmodelslim/msmodelslim_moe.py
Moved and redesigned ModelSlimConfig()
sglang/srt/hardware_backend/npu/quantization/msmodelslim.py ->
sglang/srt/layers/quantization/msmodelslim/msmodelslim.py
Remove sglang/srt/hardware_backend/npu/quantization/msmodelslim.py file
Moved create_weights() for w4a16 from msmodelslim to compressed-tensors
sglang/srt/hardware_backend/npu/quantization/fused_moe_method_npu.py ->
‎python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.py

Added:

Add MsModelSlim scheme structure -> sglang/srt/layers/quantization/msmodelslim/
Add ModelSlimW4A4Int4() class and NPU_W4A4DynamicLinearMethod() to support W4A4 linear method for NPU
Add NPUCompressedTensorsW8A8Int8() and NPUCompressedTensorsW8A8Int8MoEMethod() to support compressed-tensors w8a8 linear/MOE method for NPU
Add _find_quant_modelslim_config() method to support automated config detection for msmodelslim
Unit-test for w4a4 modelslim on NPU
Unit-test for w8a8 compressed tensors on NPU

Accuracy Tests & Benchmarking

ModelSlim tests

Server

SGLANG_SET_CPU_AFFINITY=1 PYTORCH_NPU_ALLOC_CONF=expandable_segments:True STREAMS_PER_DEVICE=32 HCCL_BUFFSIZE=1536 ENABLE_ASCEND_MOE_NZ=1 ASCEND_RT_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --device npu --attention-backend ascend --trust-remote-code --tp-size 2 --model-path *model* --port 30088 --mem-fraction-static 0.8 --cuda-graph-max-bs 16

Client

python ./benchmark/gsm8k/bench_sglang.py --num-questions 1319 --port 30088 --data-path ../gsm8k/test.jsonl --parallel 16

Results

Qwen3-32B-w4a4-LAOS (dynamic)

Qwen3-32B-W8A8 (static)

Qwen3-32B-W8A8 (dynamic)

Qwen3-30B-W8A8 (attn - static / mlp - dynamic)

EP MoE Server

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor sysctl -w vm.swappiness=0 sysctl -w kernel.numa_balancing=0 sysctl -w kernel.sched_migration_cost_ns=50000 export SGLANG_SET_CPU_AFFINITY=1 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export STREAMS_PER_DEVICE=32 export HCCL_SOCKET_IFNAME=lo export GLOO_SOCKET_IFNAME=lo export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=36 export HCCL_BUFFSIZE=1600 export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 export SGLANG_NPU_USE_MLAPO=1 export SGLANG_ENABLE_SPEC_V2=1 export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 export SGLANG_USE_FIA_NZ=1 export ENABLE_MOE_NZ=1

For Qwen3-30B-W8A8 (attn - static / mlp - dynamic)

python3 -m sglang.launch_server --model-path *model* --tp 4 --trust-remote-code --attention-backend ascend --device npu --host 127.0.0.1 --port 30088 --mem-fraction-static 0.8 --quantization modelslim --moe-a2a-backend deepep --deepep-mode auto

For DeepSeek-R1-W4A8-pertoken (dynamic)

python3 -m sglang.launch_server --model-path *model* --tp 16 --trust-remote-code --attention-backend ascend --device npu --watchdog-timeout 9000 --host 127.0.0.1 --port 30088 --cuda-graph-bs 8 16 24 28 32 36 --mem-fraction-static 0.71 --max-running-requests 144 --context-length 8188 --disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 9000 --moe-a2a-backend deepep --deepep-mode auto

Client

python ./benchmark/gsm8k/bench_sglang.py --num-questions 1319 --port 30088 --data-path ../gsm8k/test.jsonl --parallel 16

Results

Qwen3-30B-W8A8 (attn - static / mlp - dynamic)

DeepSeek-R1-W4A8-pertoken (attn - static / mlp - dynamic)

Compressed-Tensors tests

Server

SGLANG_SET_CPU_AFFINITY=1 PYTORCH_NPU_ALLOC_CONF=expandable_segments:True STREAMS_PER_DEVICE=32 HCCL_BUFFSIZE=1536 ENABLE_ASCEND_MOE_NZ=1 ASCEND_RT_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --device npu --attention-backend ascend --trust-remote-code --tp-size 2 --model-path *model* --port 30088 --mem-fraction-static 0.8 --cuda-graph-max-bs 16

Client

python ./benchmark/gsm8k/bench_sglang.py --num-questions 1319 --port 30088 --data-path ../gsm8k/test.jsonl --parallel 16

Results

Llama-3.1-8B-Instruct-quantized-W8A8 (dynamic)

Qwen3-30B-A3B-Instruct-2507-W8A8 (dynamic)

EP MoE Server

SGLANG_DEEPEP_BF16_DISPATCH=1 SGLANG_SET_CPU_AFFINITY=1 PYTORCH_NPU_ALLOC_CONF=expandable_segments:True STREAMS_PER_DEVICE=32 HCCL_BUFFSIZE=1536 ENABLE_ASCEND_MOE_NZ=1 ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 python3 -m sglang.launch_server --model-path /mnt/share/weights/Kimi-K2-Thinking/ --moe-a2a-backend deepep --deepep-mode auto --tp 16 --mem-fraction-static 0.8 --max-total-tokens 66000 --trust-remote-code --attention-backend ascend --device npu --host 127.0.0.1 --port 30112 --disable-radix-cache --context-length 8192 --chunked-prefill-size 8192 --max-prefill-tokens 8000

Client

python bench_sglang.py --num-questions 200 --port 30112 --data-path /home/swx1199799/gsm8k/test.json

Results

KIMI-K2-Thinking W4A16

AWQ tests

Server

SGLANG_SET_CPU_AFFINITY=1 PYTORCH_NPU_ALLOC_CONF=expandable_segments:True STREAMS_PER_DEVICE=32 HCCL_BUFFSIZE=1536 ASCEND_RT_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --device npu --attention-backend ascend --trust-remote-code --tp-size 2 --model-path *model* --port 30088 --mem-fraction-static 0.8 --cuda-graph-max-bs 16

Client

python ./benchmark/gsm8k/bench_sglang.py --num-questions 1319 --port 30088 --data-path ../gsm8k/test.jsonl --parallel 16

Results

Qwen3-32B-awq W4A16

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-05T15:00:37Z

Summary of Changes

Hello @OrangeRedeng, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors and expands the quantization capabilities for Ascend NPU hardware. It introduces a mechanism to automatically detect and apply NPU-optimized quantization settings and adds support for various quantization schemes (W4A4, W4A8, W8A8) for both standard linear layers and Mixture-of-Experts (MoE) layers. The changes aim to leverage NPU's hardware acceleration for improved performance and efficiency in quantized model inference.

Highlights

NPU Quantization Configuration: Introduced logic to detect and apply NPU-specific quantization configurations, specifically looking for a 'quant_model_description.json' file when running on NPU hardware.
New NPU Quantization Methods: Added several new quantization methods tailored for Ascend NPU, including W4A4 dynamic linear, W4A8 dynamic linear, W8A8 static and dynamic linear, and W4A8/W8A8 dynamic for Mixture-of-Experts (MoE) layers.
NPU MoE Layer Support: Implemented specialized methods for handling quantized Mixture-of-Experts (MoE) layers on NPU, including weight packing/unpacking and fused expert operations using NPU-specific grouped matrix multiplication.
Quantization Utility Base Class: Created a base class _NPULinearMethodBase to standardize the implementation of NPU-specific linear quantization methods.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a refactoring for Ascend NPU quantization support. It adds several new quantization methods for NPU, including W4A4, W4A8, and W8A8, along with their Mixture-of-Experts (MoE) variants. The changes also include updates to the model configuration to detect and apply these NPU-specific quantization schemes. My review focuses on ensuring correctness, consistency, and code quality. I've identified several critical issues such as missing imports, incorrect class inheritance, and improper use of decorators that could lead to runtime errors. I've also provided suggestions to improve code readability and maintainability. Overall, this is a significant step towards enabling efficient quantization on Ascend NPUs, but the identified issues should be addressed before merging.

python/sglang/srt/hardware_backend/npu/quantization/w4a4.py

python/sglang/srt/hardware_backend/npu/quantization/w4a8.py

python/sglang/srt/hardware_backend/npu/quantization/w8a8.py

python/sglang/srt/hardware_backend/npu/quantization/w4a16_moe.py

python/sglang/srt/hardware_backend/npu/quantization/w8a8.py

python/sglang/srt/configs/model_config.py

python/sglang/srt/hardware_backend/npu/quantization/w4a16_moe.py

python/sglang/srt/hardware_backend/npu/quantization/w4a8_moe.py

…nt8 support

python/sglang/srt/configs/model_config.py

ping1jing2 · 2026-01-12T12:56:30Z

i found #16953 is better than #16933 and same as yours

ping1jing2 · 2026-01-12T16:30:11Z

/rerun-failed-ci

ping1jing2 · 2026-01-13T04:58:12Z

/rerun-failed-ci

ping1jing2 · 2026-01-14T09:04:45Z

/rerun-failed-ci

ping1jing2 · 2026-01-14T09:32:01Z

https://github.com/sgl-project/sglang/actions/runs/20984256066/job/60316274289?pr=14504

this error seems like an environment-related issue and has nothing to do with our code

ping1jing2 · 2026-01-14T13:24:18Z

/rerun-failed-ci

iforgetmyname · 2026-01-14T15:51:35Z

/rerun-failed-ci

ping1jing2 · 2026-01-14T18:57:35Z

/rerun-failed-ci

sgl-project#14504) Co-authored-by: TamirBaydasov <mr.jeijy@gmail.com> Co-authored-by: Tamir Baydasov <41994229+TamirBaydasov@users.noreply.github.com> Co-authored-by: Савкин Артем <savkinartem@MacBook-Air-Viktoria.local> Co-authored-by: Edward Shogulin <edward.shogulin@gmail.com>

OrangeRedeng added 11 commits December 5, 2025 16:56

Automatic quant_model_description.json detection support

dcea881

Add w4a4 support

aa0a0aa

Refactor w8a8

6c845ad

Add import section

dee644b

Create quantization utils file

35b8983

Create w4a16

311cc28

Create w4a8.py

6869ebf

Rename w4a16.py to w4a16_moe.py

c7d6dd5

Rename w4a8.py to w4a8_moe.py

7ffe0f6

Create w8a8_moe

e2d8889

Create w4a8.py

41d3d3f

OrangeRedeng changed the title ~~[NPU][2/N] Ascend NPU quantization refactoring~~ [NPU][2/N] Ascend NPU quantization refactoring & more quantization formats support Dec 5, 2025

gemini-code-assist bot reviewed Dec 5, 2025

View reviewed changes

OrangeRedeng mentioned this pull request Dec 5, 2025

[Roadmap] Ascend NPU quantization refactoring & more quantization formats support #14424

Open

34 tasks

Create msmodelslim structure, initial commit

6d0b035

OrangeRedeng changed the title ~~[NPU][2/N] Ascend NPU quantization refactoring & more quantization formats support~~ [NPU][1/3] Ascend NPU quantization refactoring & more quantization formats support Dec 8, 2025

ping1jing2 self-assigned this Dec 9, 2025

Working msmodelslim structure, W8A8, W8A8 MoE, W4A4

66c7517

github-actions bot added the npu label Dec 10, 2025

OrangeRedeng and others added 8 commits December 10, 2025 18:54

Merge branch 'main' into npu_quantization_refactor

471ad1a

Delete w4a16_moe.py

ccfe6f6

Delete w4a8.py

0a48b2b

Delete w4a8_moe.py

f4fdb0e

Delete w8a8.py

1f4f870

Delete w8a8_moe.py

b5fcf78

Delete utils.py

ba57bc7

Move process_weights to kernel-side, add npu compressed-tensors w8a8i…

a5704f1

…nt8 support

ping1jing2 reviewed Dec 12, 2025

View reviewed changes

python/sglang/srt/configs/model_config.py Show resolved Hide resolved

python/sglang/srt/configs/model_config.py Show resolved Hide resolved

Added check for empty scheme

c42c8f1

Merge branch 'main' into npu_quantization_refactor

64d25e9

Merge branch 'main' into npu_quantization_refactor

de0cd1d

OrangeRedeng and others added 13 commits January 13, 2026 11:06

Move w4a4 test to A2

9a95ff8

Update test_ascend_w4a4_quantization.py

d323c6a

Update run_suite.py

7e3d281

Update test_ascend_w4a4_quantization.py

601a349

Merge branch 'main' into npu_quantization_refactor

ef4ce00

Update test_ascend_w4a4_quantization.py

0d16e53

Merge branch 'main' into npu_quantization_refactor

6bcf2f2

Fix w4a4 test

7b9e614

Fix w4a4 test

a79e4b9

Merge branch 'main' into npu_quantization_refactor

c113924

Merge branch 'main' into npu_quantization_refactor

bfb87cf

Merge branch 'main' into npu_quantization_refactor

eeb3875

Merge branch 'main' into npu_quantization_refactor

cd881ee

Merge branch 'main' into npu_quantization_refactor

27b373b

OrangeRedeng changed the title ~~[NPU][1/3] Ascend NPU quantization refactoring & more quantization formats support~~ [NPU][1/*] Ascend NPU quantization refactoring & more quantization formats support Jan 14, 2026

iforgetmyname approved these changes Jan 14, 2026

View reviewed changes

iforgetmyname changed the title ~~[NPU][1/*] Ascend NPU quantization refactoring & more quantization formats support~~ [NPU] NPU quantization refactoring & more quantization formats support Jan 14, 2026

iforgetmyname merged commit 424a380 into sgl-project:main Jan 14, 2026
205 of 214 checks passed

iforgetmyname mentioned this pull request Jan 23, 2026

[Roadmap] Ascend NPU Development (2026 Q1) #13664

Open

28 tasks

Conversation

OrangeRedeng commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests & Benchmarking

ModelSlim tests

Server

Client

Results

EP MoE Server

For Qwen3-30B-W8A8 (attn - static / mlp - dynamic)

For DeepSeek-R1-W4A8-pertoken (dynamic)

Client

Results

Compressed-Tensors tests

Server

Client

Results

EP MoE Server

Client

Results

AWQ tests

Server

Client

Results

Checklist

Uh oh!

gemini-code-assist bot commented Dec 5, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ping1jing2 commented Jan 12, 2026

Uh oh!

ping1jing2 commented Jan 12, 2026

Uh oh!

ping1jing2 commented Jan 13, 2026

Uh oh!

ping1jing2 commented Jan 14, 2026

Uh oh!

ping1jing2 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ping1jing2 commented Jan 14, 2026

Uh oh!

iforgetmyname commented Jan 14, 2026

Uh oh!

ping1jing2 commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

OrangeRedeng commented Dec 5, 2025 •

edited

Loading

ping1jing2 commented Jan 14, 2026 •

edited

Loading