Skip to content
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
809028d
add triton_fused_moe_int4 kernel
huangtingwei9988 Mar 19, 2025
bc56c8b
move moe_wna16 to sglang
huangtingwei9988 Mar 19, 2025
2fc6d61
remove unused code
huangtingwei9988 Mar 19, 2025
ae83c9f
format code
huangtingwei9988 Mar 19, 2025
3e9d24c
fix circular import
laixinn Mar 20, 2025
4c23fdf
fork check_marlin_supports_layer to avoid vllm dependency
laixinn Mar 20, 2025
29c2cf0
add moe_wna16 unit test for w4a16 and w8a16
laixinn Mar 20, 2025
cc12772
remove vllm dependency on getting device capability
laixinn Mar 20, 2025
995d82d
Merge remote-tracking branch 'origin/main' into add-moe-wna16-kernel
AniZpZ Mar 20, 2025
0e5dda1
update
AniZpZ Mar 20, 2025
27dc72f
local test
laixinn Mar 21, 2025
a4c6430
format
AniZpZ Mar 21, 2025
3ae9a7e
Merge branch 'main' into add-moe-wna16-kernel
zhyncs Mar 22, 2025
91133b3
merge
laixinn Mar 24, 2025
db41808
format
laixinn Mar 24, 2025
e6b2884
fix garbage output in test_mla_tp.py
laixinn Mar 24, 2025
79f3e1a
Merge branch 'main' into add-moe-wna16-kernel
zhyncs Mar 26, 2025
7b59b48
solve conflict
laixinn Mar 27, 2025
b79ff4f
Merge remote-tracking branch 'upstream/main' into add-moe-wna16-kernel
laixinn Mar 27, 2025
ab18fbc
import AWQConfig from sglang
laixinn Mar 27, 2025
fc4d919
Merge branch 'main' into add-moe-wna16-kernel
laixinn Mar 27, 2025
1f7cdea
remove awq marlin support
laixinn Mar 27, 2025
354e0ea
fix typo
laixinn Mar 27, 2025
9ee5b42
remove marlin utils
laixinn Mar 27, 2025
92283b8
Merge branch 'main' into add-moe-wna16-kernel
zhyncs Apr 1, 2025
7419e4c
Merge branch 'main' into add-moe-wna16-kernel
zhyncs Apr 2, 2025
47d7198
Merge branch 'main' into add-moe-wna16-kernel
AniZpZ Apr 3, 2025
a65a0d2
Merge branch 'main' into add-moe-wna16-kernel
zhyncs Apr 3, 2025
2d3ea37
upd doc
AniZpZ Apr 3, 2025
2368183
Merge branch 'main' into add-moe-wna16-kernel
zhyncs Apr 3, 2025
291ee58
Merge branch 'main' into add-moe-wna16-kernel
zhyncs Apr 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions benchmark/deepseek_v3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,10 +178,12 @@ python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1

### Example: Serving with 8 A100/A800 with AWQ Quantization

AWQ does not support BF16, so add the `--dtype half` flag if AWQ is used for quantization. One example is as follows:
AWQ does not support BF16, so add the `--dtype half` flag if AWQ is used for quantization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWQ does not support BF16 May you update this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Add `--quantization moe_wna16` flag to enable moe wna16 kernel for better performance.
One example is as follows:

```bash
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half --quantization moe_wna16
```


Expand Down
1 change: 1 addition & 0 deletions python/sglang/srt/configs/model_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,7 @@ def _verify_quantization(self) -> None:
"experts_int8",
"w8a8_int8",
"w8a8_fp8",
"moe_wna16",
]
compatible_quantization_methods = {
"w8a8_int8": ["compressed-tensors", "compressed_tensors"],
Expand Down
Loading
Loading