[NPU] NPU quantization refactoring & more quantization formats support#14504
[NPU] NPU quantization refactoring & more quantization formats support#14504iforgetmyname merged 218 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @OrangeRedeng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors and expands the quantization capabilities for Ascend NPU hardware. It introduces a mechanism to automatically detect and apply NPU-optimized quantization settings and adds support for various quantization schemes (W4A4, W4A8, W8A8) for both standard linear layers and Mixture-of-Experts (MoE) layers. The changes aim to leverage NPU's hardware acceleration for improved performance and efficiency in quantized model inference. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a refactoring for Ascend NPU quantization support. It adds several new quantization methods for NPU, including W4A4, W4A8, and W8A8, along with their Mixture-of-Experts (MoE) variants. The changes also include updates to the model configuration to detect and apply these NPU-specific quantization schemes. My review focuses on ensuring correctness, consistency, and code quality. I've identified several critical issues such as missing imports, incorrect class inheritance, and improper use of decorators that could lead to runtime errors. I've also provided suggestions to improve code readability and maintainability. Overall, this is a significant step towards enabling efficient quantization on Ascend NPUs, but the identified issues should be addressed before merging.
python/sglang/srt/hardware_backend/npu/quantization/w4a16_moe.py
Outdated
Show resolved
Hide resolved
python/sglang/srt/hardware_backend/npu/quantization/w4a16_moe.py
Outdated
Show resolved
Hide resolved
python/sglang/srt/hardware_backend/npu/quantization/w4a8_moe.py
Outdated
Show resolved
Hide resolved
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
this error seems like an environment-related issue and has nothing to do with our code |
|
/rerun-failed-ci |
2 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
sgl-project#14504) Co-authored-by: TamirBaydasov <mr.jeijy@gmail.com> Co-authored-by: Tamir Baydasov <41994229+TamirBaydasov@users.noreply.github.com> Co-authored-by: Савкин Артем <savkinartem@MacBook-Air-Viktoria.local> Co-authored-by: Edward Shogulin <edward.shogulin@gmail.com>
Motivation
Related to #14424 (you can found class diagramm here). Follows #13664
Continuation of the refactoring started in #13359 and feature supporting started in #11984. To simplify the support of various quantization algorithms, the code is being refactored to separate the mechanisms for loading weights and inference kernels.
Stage 1. (In progress) Separation of the inference code (kernels) and the code associated with quantization framework (msmodelslim, awq, auto-round, etc.). Supporting schemes for modelslim.
Modifications
Refactored:
create_weights()for w8a8 linear methodsglang/srt/hardware_backend/npu/quantization/linear_method_npu.py -> sglang/srt/layers/quantization/msmodelslim/schemes/msmodelslim_w8a8_int8.py
create_weights()for w8a8 MOE methodsglang/srt/hardware_backend/npu/quantization/fused_moe_method_npu.py -> sglang/srt/layers/quantization/msmodelslim/msmodelslim_moe.py
create_weights()for w4a8 MOE methodsglang/srt/hardware_backend/npu/quantization/fused_moe_method_npu.py -> sglang/srt/layers/quantization/msmodelslim/msmodelslim_moe.py
ModelSlimConfig()sglang/srt/hardware_backend/npu/quantization/msmodelslim.py ->
sglang/srt/layers/quantization/msmodelslim/msmodelslim.py
create_weights()for w4a16 from msmodelslim to compressed-tensorssglang/srt/hardware_backend/npu/quantization/fused_moe_method_npu.py ->
python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Added:
ModelSlimW4A4Int4()class andNPU_W4A4DynamicLinearMethod()to support W4A4 linear method for NPUNPUCompressedTensorsW8A8Int8()andNPUCompressedTensorsW8A8Int8MoEMethod()to support compressed-tensors w8a8 linear/MOE method for NPU_find_quant_modelslim_config()method to support automated config detection for msmodelslimAccuracy Tests & Benchmarking
ModelSlim tests
Server
SGLANG_SET_CPU_AFFINITY=1 PYTORCH_NPU_ALLOC_CONF=expandable_segments:True STREAMS_PER_DEVICE=32 HCCL_BUFFSIZE=1536 ENABLE_ASCEND_MOE_NZ=1 ASCEND_RT_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --device npu --attention-backend ascend --trust-remote-code --tp-size 2 --model-path *model* --port 30088 --mem-fraction-static 0.8 --cuda-graph-max-bs 16Client
python ./benchmark/gsm8k/bench_sglang.py --num-questions 1319 --port 30088 --data-path ../gsm8k/test.jsonl --parallel 16Results
Qwen3-32B-w4a4-LAOS (dynamic)

Qwen3-32B-W8A8 (static)

Qwen3-32B-W8A8 (dynamic)

Qwen3-30B-W8A8 (attn - static / mlp - dynamic)

EP MoE Server
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor sysctl -w vm.swappiness=0 sysctl -w kernel.numa_balancing=0 sysctl -w kernel.sched_migration_cost_ns=50000 export SGLANG_SET_CPU_AFFINITY=1 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export STREAMS_PER_DEVICE=32 export HCCL_SOCKET_IFNAME=lo export GLOO_SOCKET_IFNAME=lo export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=36 export HCCL_BUFFSIZE=1600 export DEEP_NORMAL_MODE_USE_INT8_QUANT=1 export SGLANG_NPU_USE_MLAPO=1 export SGLANG_ENABLE_SPEC_V2=1 export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 export SGLANG_USE_FIA_NZ=1 export ENABLE_MOE_NZ=1For Qwen3-30B-W8A8 (attn - static / mlp - dynamic)
python3 -m sglang.launch_server --model-path *model* --tp 4 --trust-remote-code --attention-backend ascend --device npu --host 127.0.0.1 --port 30088 --mem-fraction-static 0.8 --quantization modelslim --moe-a2a-backend deepep --deepep-mode autoFor DeepSeek-R1-W4A8-pertoken (dynamic)
python3 -m sglang.launch_server --model-path *model* --tp 16 --trust-remote-code --attention-backend ascend --device npu --watchdog-timeout 9000 --host 127.0.0.1 --port 30088 --cuda-graph-bs 8 16 24 28 32 36 --mem-fraction-static 0.71 --max-running-requests 144 --context-length 8188 --disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 9000 --moe-a2a-backend deepep --deepep-mode autoClient
python ./benchmark/gsm8k/bench_sglang.py --num-questions 1319 --port 30088 --data-path ../gsm8k/test.jsonl --parallel 16Results
Qwen3-30B-W8A8 (attn - static / mlp - dynamic)

DeepSeek-R1-W4A8-pertoken (attn - static / mlp - dynamic)

Compressed-Tensors tests
Server
SGLANG_SET_CPU_AFFINITY=1 PYTORCH_NPU_ALLOC_CONF=expandable_segments:True STREAMS_PER_DEVICE=32 HCCL_BUFFSIZE=1536 ENABLE_ASCEND_MOE_NZ=1 ASCEND_RT_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --device npu --attention-backend ascend --trust-remote-code --tp-size 2 --model-path *model* --port 30088 --mem-fraction-static 0.8 --cuda-graph-max-bs 16Client
python ./benchmark/gsm8k/bench_sglang.py --num-questions 1319 --port 30088 --data-path ../gsm8k/test.jsonl --parallel 16Results
Llama-3.1-8B-Instruct-quantized-W8A8 (dynamic)

Qwen3-30B-A3B-Instruct-2507-W8A8 (dynamic)

EP MoE Server
SGLANG_DEEPEP_BF16_DISPATCH=1 SGLANG_SET_CPU_AFFINITY=1 PYTORCH_NPU_ALLOC_CONF=expandable_segments:True STREAMS_PER_DEVICE=32 HCCL_BUFFSIZE=1536 ENABLE_ASCEND_MOE_NZ=1 ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 python3 -m sglang.launch_server --model-path /mnt/share/weights/Kimi-K2-Thinking/ --moe-a2a-backend deepep --deepep-mode auto --tp 16 --mem-fraction-static 0.8 --max-total-tokens 66000 --trust-remote-code --attention-backend ascend --device npu --host 127.0.0.1 --port 30112 --disable-radix-cache --context-length 8192 --chunked-prefill-size 8192 --max-prefill-tokens 8000Client
python bench_sglang.py --num-questions 200 --port 30112 --data-path /home/swx1199799/gsm8k/test.jsonResults
KIMI-K2-Thinking W4A16

AWQ tests
Server
SGLANG_SET_CPU_AFFINITY=1 PYTORCH_NPU_ALLOC_CONF=expandable_segments:True STREAMS_PER_DEVICE=32 HCCL_BUFFSIZE=1536 ASCEND_RT_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --device npu --attention-backend ascend --trust-remote-code --tp-size 2 --model-path *model* --port 30088 --mem-fraction-static 0.8 --cuda-graph-max-bs 16Client
python ./benchmark/gsm8k/bench_sglang.py --num-questions 1319 --port 30088 --data-path ../gsm8k/test.jsonl --parallel 16Results
Qwen3-32B-awq W4A16

Checklist