-
Notifications
You must be signed in to change notification settings - Fork 5k
[NPU] NPU quantization refactoring & more quantization formats support #14504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
iforgetmyname
merged 218 commits into
sgl-project:main
from
OrangeRedeng:npu_quantization_refactor
Jan 14, 2026
Merged
Changes from all commits
Commits
Show all changes
218 commits
Select commit
Hold shift + click to select a range
dcea881
Automatic quant_model_description.json detection support
OrangeRedeng aa0a0aa
Add w4a4 support
OrangeRedeng 6c845ad
Refactor w8a8
OrangeRedeng dee644b
Add import section
OrangeRedeng 35b8983
Create quantization utils file
OrangeRedeng 311cc28
Create w4a16
OrangeRedeng 6869ebf
Create w4a8.py
OrangeRedeng c7d6dd5
Rename w4a16.py to w4a16_moe.py
OrangeRedeng 7ffe0f6
Rename w4a8.py to w4a8_moe.py
OrangeRedeng e2d8889
Create w8a8_moe
OrangeRedeng 41d3d3f
Create w4a8.py
OrangeRedeng 6d0b035
Create msmodelslim structure, initial commit
TamirBaydasov 66c7517
Working msmodelslim structure, W8A8, W8A8 MoE, W4A4
TamirBaydasov 471ad1a
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng ccfe6f6
Delete w4a16_moe.py
OrangeRedeng 0a48b2b
Delete w4a8.py
OrangeRedeng f4fdb0e
Delete w4a8_moe.py
OrangeRedeng 1f4f870
Delete w8a8.py
OrangeRedeng b5fcf78
Delete w8a8_moe.py
OrangeRedeng ba57bc7
Delete utils.py
OrangeRedeng a5704f1
Move process_weights to kernel-side, add npu compressed-tensors w8a8i…
TamirBaydasov c42c8f1
Added check for empty scheme
OrangeRedeng 25d0d09
Remove unnecessary method
OrangeRedeng ca4895e
Add w4a8 support
OrangeRedeng 28ff8e0
Add w4a8 support (kernel)
OrangeRedeng d9412d4
Update fused_moe_method_npu.py
TamirBaydasov 0f81db3
Fix w8a8_static bug
OrangeRedeng 3175d8b
Improving the code structure
OrangeRedeng 23db53f
Delete print()
OrangeRedeng 393f7d1
Update w4a8 for MOE
OrangeRedeng 5c60c95
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng d4d53e0
Fix w4a4 weights loading
OrangeRedeng 2bb7acf
Update model_config.py
OrangeRedeng 4a05e5d
Add w4a4 test
OrangeRedeng d0a577f
Add compressed-tensors unit-test
OrangeRedeng d9f8a41
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 77a923e
Pre-commit fixes
3917919
Revert "Pre-commit fixes"
OrangeRedeng df01a40
Pre-commit fixes
OrangeRedeng a16b69e
Fix model config loading, add NPU w8a8int8 MoE for compressed-tensors…
TamirBaydasov 238759c
Pre-commit fixes
OrangeRedeng 847e190
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 4640d05
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 5ca19cb
Delete comments
OrangeRedeng 1f18881
Delete comments
OrangeRedeng 2bee5c7
Update model_config.py
TamirBaydasov 2670aa9
Quickfix
OrangeRedeng 1e45ead
Update fused_moe_method_npu.py
TamirBaydasov d3298ec
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng afc11a6
Update CODEOWNERS
TamirBaydasov 168b2a8
Pre-commit fixes
OrangeRedeng 2185718
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng d551652
Update msmodelslim_w8a8_int8.py
TamirBaydasov 1cf18c0
Update msmodelslim.py
TamirBaydasov 3dccf89
Delete python/sglang/srt/hardware_backend/npu/quantization/modelslim.py
OrangeRedeng 1842d0a
Removed unused code
OrangeRedeng 75de787
Remove --quantization modelslim flag from doc
OrangeRedeng e958767
Delete --quantization "modelslim" flag
OrangeRedeng 1567885
Delete --quantization "modelslim" flag
OrangeRedeng d34cb6f
Update test_ascend_hicache_mla.py
OrangeRedeng 09a6d44
Delete --quantization "modelslim" flag
OrangeRedeng 2b7003e
Update test_ascend_mla_w8a8int8.py
OrangeRedeng 43b5d66
Create README.md for msModelSlim
OrangeRedeng 420d6e8
Update README.md
OrangeRedeng f79f9ee
Update README.md
OrangeRedeng a7c43bb
Update fused_moe_method_npu.py
TamirBaydasov ef2fdb8
Update README.md
OrangeRedeng cb95c0a
Update README.md
OrangeRedeng ca38c59
Update layer.py
TamirBaydasov 583cb4d
Update compressed_tensors.py
TamirBaydasov 8af0033
Update compressed_tensors_moe.py
TamirBaydasov 9f8c407
Quickfix
OrangeRedeng d31e96c
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 72efd3a
Update README.md
OrangeRedeng 384835b
Update msmodelslim_moe.py
OrangeRedeng 4ebfb54
Update fused_moe_method_npu.py
OrangeRedeng 0cfbd93
Create test_ascend_w4a4_quantization.py in srt/ascend
TamirBaydasov 87b65a8
Delete test/manual/ascend/test_ascend_w4a4_quantization.py
TamirBaydasov 177102d
Create test_ascend_w8a8_quantization.py
TamirBaydasov 16ca773
Update run_suite.py
TamirBaydasov c6def39
Update test_ascend_w8a8_quantization.py
TamirBaydasov d0dd427
Create ascend_npu_quantization.md
OrangeRedeng 2e1219f
Bugfix
OrangeRedeng 9d6ffbd
Pre-commit fixes
OrangeRedeng 17a6248
Update fused_moe_method_npu.py
OrangeRedeng 0bf3389
Fix missprint
OrangeRedeng 69d3438
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 1d28157
Pre-commit fixes
OrangeRedeng a5b88e9
Update ascend_npu_quantization.md
OrangeRedeng 30f7b10
Update ascend_npu_quantization.md
OrangeRedeng 22c85ce
Update python/sglang/srt/configs/model_config.py
TamirBaydasov 21b9219
Update compressed_tensors.py
TamirBaydasov 52b1088
Update compressed_tensors_moe.py
TamirBaydasov 2a5f745
Update __init__.py
TamirBaydasov 309e5ef
Update compressed_tensors_w8a8_int8.py
TamirBaydasov 611546d
Update README.md
TamirBaydasov d2cc722
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng d2888fd
Update linear_method_npu.py
OrangeRedeng 554027a
Fix group_size
OrangeRedeng ad52cda
Fix group_size
OrangeRedeng 1d0eddb
Update fused_moe_method_npu.py
OrangeRedeng aff9585
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng c2e972f
Update fused_moe_method_npu.py
OrangeRedeng 3bc7faf
Pre-commit fixes
OrangeRedeng c7480fb
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng ff1f793
Fix Qwen3-32B AWQ issue
OrangeRedeng 7cbf964
Update ascend_npu_quantization.md
OrangeRedeng 7b20ccf
Update ascend_npu_quantization.md
OrangeRedeng ed9c68a
Merge branch 'main' into npu_quantization_refactor
ping1jing2 e1cabfa
Update fused_moe_method_npu.py
TamirBaydasov 734ab1d
Update linear_method_npu.py
TamirBaydasov 93533b0
Update base_config.py
TamirBaydasov 0cd79c6
Update compressed_tensors_moe.py
TamirBaydasov a9d4847
Update compressed_tensors_w8a8_int8.py
TamirBaydasov af3756b
Update msmodelslim.py
TamirBaydasov 1ddd8d4
Update msmodelslim_moe.py
TamirBaydasov 76a1e94
Update msmodelslim_w4a4_int4.py
TamirBaydasov f773ee4
Update msmodelslim_w8a8_int8.py
TamirBaydasov a6d1619
Update msmodelslim_moe.py
OrangeRedeng 789c246
Fix lint issue
OrangeRedeng 2cc4db4
Fix lint issue
OrangeRedeng 94827ef
Fix lint issue
OrangeRedeng 1a30a42
Change local path to modelscope
OrangeRedeng f539100
Update test_ascend_w4a4_quantization.py
OrangeRedeng 01f6c58
Temporary fix
OrangeRedeng 0dacfd2
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 07e1f84
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng c9a8122
Update test_ascend_w8a8_quantization.py
OrangeRedeng 6bb9f20
Update run_suite.py
OrangeRedeng 836dc16
Update test_ascend_w4a4_quantization.py
OrangeRedeng 14b6ab8
Update test_ascend_w4a4_quantization.py
OrangeRedeng a8a03a0
Merge branch 'main' into npu_quantization_refactor
AniZpZ 15040cc
Update msmodelslim_moe.py
TamirBaydasov 5a1c7ec
Update msmodelslim_moe.py
TamirBaydasov a26d9e6
Update run_suite.py
OrangeRedeng c560c8f
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 1d44466
Add modelslim to optimized methods
TamirBaydasov 18377d0
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 686966b
Resolve conflicts 1/2
eshoguli 1c888e0
Update test_ascend_w4a4_quantization.py
OrangeRedeng 1830d74
Resolve conflicts 1/2
OrangeRedeng 46a3570
Resolve conflicts 2/2
OrangeRedeng 217536f
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng ffdc7dc
Update compressed_tensors_moe.py
OrangeRedeng c38e16f
Update compressed_tensors_moe.py
OrangeRedeng ef216f4
Update compressed_tensors_moe.py
OrangeRedeng 5d43c4a
Update compressed_tensors_moe.py
OrangeRedeng bee77f0
Update compressed_tensors_moe.py
OrangeRedeng ee59b95
Update fused_moe_method_npu.py
OrangeRedeng 02d7a6a
Update msmodelslim_moe.py
OrangeRedeng ff41d73
Update compressed_tensors_moe.py
OrangeRedeng 8d1bb48
Fix lint issue
OrangeRedeng 6b46093
Fix lint issue
OrangeRedeng 567a771
Update compressed_tensors_moe.py
OrangeRedeng 1b2f289
Update msmodelslim_moe.py
OrangeRedeng fe7067c
Update compressed_tensors_moe.py
OrangeRedeng 2e390e3
Fix lint issue
OrangeRedeng ee17e0c
Update msmodelslim_moe.py
OrangeRedeng 662fada
Update msmodelslim_moe.py
OrangeRedeng 2fb272d
Update compressed_tensors_moe.py
OrangeRedeng ae7875c
Update msmodelslim_moe.py
OrangeRedeng b4c0ebe
Update fused_moe_method_npu.py
OrangeRedeng 897094c
Update msmodelslim_moe.py
OrangeRedeng b463625
Fix lint issue
OrangeRedeng 349dcd0
Fix lint issue
OrangeRedeng b776895
Update fused_moe_method_npu.py
OrangeRedeng 56c8d06
Fix lint issue
OrangeRedeng 4e9c0d0
Fix lint issue
OrangeRedeng f091ab0
Fix lint issue
OrangeRedeng 30ea24e
Fix lint issue
OrangeRedeng b430667
Update fused_moe_method_npu.py
OrangeRedeng 47e8406
Update test_ascend_w4a4_quantization.py
OrangeRedeng 206bb5d
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 97b38e4
Rename MsModelSlim -> ModelSlim
OrangeRedeng 7edefee
Merge branch 'main' into npu_quantization_refactor
ping1jing2 d6f0064
Fix w4a4 test
OrangeRedeng 0aad1d1
Fix link issue
OrangeRedeng e861924
Return run_decode to test_ascend_w4a4_quantization.py
OrangeRedeng a443cf9
Update modelslim_moe.py
OrangeRedeng 373b9c5
Fix link
OrangeRedeng 86093bb
Fix link again
OrangeRedeng 2de91b8
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 70f2fab
Add w4a8 strategy to compressed-tensors
OrangeRedeng f9450c8
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng d5ad3a1
Fix test again
OrangeRedeng 529773d
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 45d6421
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng cb58406
Merge branch 'main' into npu_quantization_refactor
iforgetmyname bf617a5
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng a657e87
Update test order
OrangeRedeng b94d390
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng ff565db
Move w4a4_test to a2-tp1 suite
OrangeRedeng c97c232
Move w4a4_test to a2-tp1 suite
OrangeRedeng c190ea3
Return w4a4 to A3
OrangeRedeng 659fa07
Remove unused is_npu()
OrangeRedeng b3e2021
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 9aec4b9
Merge branch 'main' into npu_quantization_refactor
iforgetmyname 77e9fa8
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 4716b73
Update test_ascend_w4a4_quantization.py
OrangeRedeng 2ace366
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 42d849e
Fix test_ascend_piecewise_graph_prefill test
OrangeRedeng 0a10c5f
Merge branch 'main' into npu_quantization_refactor
ping1jing2 64d25e9
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng de0cd1d
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 9a95ff8
Move w4a4 test to A2
OrangeRedeng d323c6a
Update test_ascend_w4a4_quantization.py
OrangeRedeng 7e3d281
Update run_suite.py
OrangeRedeng 601a349
Update test_ascend_w4a4_quantization.py
OrangeRedeng ef4ce00
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 0d16e53
Update test_ascend_w4a4_quantization.py
OrangeRedeng 6bcf2f2
Merge branch 'main' into npu_quantization_refactor
TamirBaydasov 7b9e614
Fix w4a4 test
OrangeRedeng a79e4b9
Fix w4a4 test
OrangeRedeng c113924
Merge branch 'main' into npu_quantization_refactor
iforgetmyname bfb87cf
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng eeb3875
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng cd881ee
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng 27b373b
Merge branch 'main' into npu_quantization_refactor
OrangeRedeng File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| Quantization on Ascend. | ||
|
|
||
| To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add `--quantization` argument when starting the engine. The quantization method will be automatically parsed from the downloaded `quant_model_description.json` or `config.json` config. | ||
|
|
||
| [ModelSlim on Ascend support](https://github.com/sgl-project/sglang/pull/14504): | ||
| - [x] W4A4 dynamic linear | ||
| - [x] W8A8 static linear | ||
| - [x] W8A8 dynamic linear | ||
| - [x] W4A8 dynamic MOE | ||
| - [x] W8A8 dynamic MOE | ||
|
|
||
| [AWQ on Ascend support](https://github.com/sgl-project/sglang/pull/10158): | ||
| - [x] W4A16 linear | ||
| - [x] W8A16 linear # Need to test | ||
| - [x] W4A16 MOE # Need to test | ||
|
|
||
| Compressed-tensors (LLM Compressor) on Ascend support: | ||
| - [x] [W4A8 dynamic MOE with/without activation clip](https://github.com/sgl-project/sglang/pull/14736) # Need to test | ||
| - [x] [W4A16 MOE](https://github.com/sgl-project/sglang/pull/12759) | ||
| - [x] [W8A8 dynamic linear](https://github.com/sgl-project/sglang/pull/14504) | ||
| - [x] [W8A8 dynamic MOE](https://github.com/sgl-project/sglang/pull/14504) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.