Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
70d6f6a
addd quark_int4_fp8_moe feature
May 12, 2025
676f854
wip int4fp8_moe version using weight_loader for online quantization
fxmarty-amd Jun 20, 2025
adb7121
simplifications
fxmarty-amd Jun 20, 2025
25e214b
fix accuracy issue, support tp>1
fxmarty-amd Jun 20, 2025
2040998
add doc
fxmarty-amd Jun 25, 2025
3df5e1e
fix get_name
fxmarty-amd Jun 25, 2025
a475709
simplifications
fxmarty-amd Jun 25, 2025
a7e6598
pre-shard high precision weight in order to do online quantization pe…
fxmarty-amd Jun 25, 2025
70e157a
Merge branch 'main' into int4fp8_moe_new
fxmarty-amd Jun 25, 2025
d2f6f34
fix merge issues
fxmarty-amd Jun 25, 2025
ba89177
support pre-sharded MOE
fxmarty-amd Jun 26, 2025
2111f53
Merge branch 'main' into int4fp8_moe_new
HaiShaw Jul 7, 2025
fd1aa21
Merge branch 'main' into int4fp8_moe_new
HaiShaw Jul 7, 2025
ffb4150
Merge branch 'main' into int4fp8_moe_new
HaiShaw Jul 7, 2025
af270c6
Merge branch 'main' into int4fp8_moe_new
fxmarty-amd Jul 15, 2025
c708ccd
Merge branch 'main' into int4fp8_moe_new
fxmarty-amd Jul 23, 2025
33649cc
Merge branch 'main' into int4fp8_moe_new
fxmarty-amd Sep 11, 2025
fa1804c
fix issues
fxmarty-amd Sep 12, 2025
0137d68
address comment
fxmarty-amd Sep 12, 2025
36f79da
simplification
fxmarty-amd Sep 12, 2025
37df93f
reuse Fp8LinearMethod instead of reimplementing it
fxmarty-amd Sep 12, 2025
ae947d4
update test model
fxmarty-amd Sep 12, 2025
ad6794f
Merge branch 'main' into int4fp8_moe_new
fxmarty-amd Sep 12, 2025
30dcb1e
lint
fxmarty-amd Sep 12, 2025
686bd32
remove unused imports
fxmarty-amd Sep 12, 2025
263bc85
Merge branch 'main' into int4fp8_moe_new
fxmarty-amd Dec 15, 2025
f59499b
Merge branch 'main' into int4fp8_moe_new
HaiShaw Dec 16, 2025
5f1fbeb
Merge branch 'main' into int4fp8_moe_new
HaiShaw Dec 16, 2025
9c1da32
Merge branch 'main' into int4fp8_moe_new
HaiShaw Dec 20, 2025
839cd85
Merge branch 'main' into int4fp8_moe_new
fxmarty-amd Dec 22, 2025
50aa07f
rename int4fp8_moe to quark_int4fp8_moe, add test_int4fp8_moe.py to r…
fxmarty-amd Dec 22, 2025
3815e81
linting
fxmarty-amd Dec 22, 2025
cf80ab5
Merge branch 'int4fp8_moe_new' of https://github.com/fxmarty-amd/sgla…
fxmarty-amd Dec 22, 2025
3615eb8
fix names
fxmarty-amd Dec 22, 2025
847744e
Merge branch 'main' into int4fp8_moe_new
HaiShaw Jan 2, 2026
5221165
Merge branch 'main' into int4fp8_moe_new
HaiShaw Jan 5, 2026
8c4a8fe
fix wrong import
fxmarty-amd Jan 6, 2026
d9122e9
Merge branch 'main' into int4fp8_moe_new
fxmarty-amd Jan 6, 2026
653874d
Merge branch 'int4fp8_moe_new' of https://github.com/fxmarty-amd/sgla…
fxmarty-amd Jan 6, 2026
ba306f5
linting
fxmarty-amd Jan 6, 2026
46639a0
Merge branch 'main' into int4fp8_moe_new
fxmarty-amd Jan 8, 2026
ca48935
Merge branch 'main' into int4fp8_moe_new
fxmarty-amd Jan 8, 2026
f751cff
Merge branch 'main' into int4fp8_moe_new
yctseng0211 Jan 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/advanced_features/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -353,6 +353,8 @@ python3 -m sglang.launch_server \

Our team is working on supporting more online quantization methods. SGLang will soon support methods including but not limited to `["awq", "gptq", "marlin", "gptq_marlin", "awq_marlin", "bitsandbytes", "gguf"]`.

### torchao online quantization method

SGLang also supports quantization methods based on [torchao](https://github.com/pytorch/ao). You can simply specify `--torchao-config` in the command line to support this feature. For example, if you want to enable `int4wo-128` for model `meta-llama/Meta-Llama-3.1-8B-Instruct`, you can launch the server with the following command:

```bash
Expand All @@ -374,6 +376,12 @@ python3 -m sglang.launch_server \
--port 30000 --host 0.0.0.0
```

### `quark_int4fp8_moe` online quantization method

SGLang running on AMD GPUs (CDNA3 or CDNA4 architecture) supports the quantization method `--quantization quark_int4fp8_moe`, that will replace [MoE layers](https://github.com/sgl-project/sglang/blob/v0.4.8/python/sglang/srt/layers/moe/fused_moe_triton/layer.py#L271) originally in high precision (bfloat16, float16 or float32) to use weights dynamically quantized to int4, that are upcasted to float8 during inference to run compute in float8 precision with activations dynamically quantized on the fly to float8.

Other layers (e.g. projections in the attention layers) have their weights quantized online to float8 directly.

## Reference

- [GPTQModel](https://github.com/ModelCloud/GPTQModel)
Expand Down
1 change: 1 addition & 0 deletions python/sglang/srt/configs/model_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -716,6 +716,7 @@ def _verify_quantization(self) -> None:
"quark",
"mxfp4",
"auto-round",
"quark_int4fp8_moe",
]
optimized_quantization_methods = [
"fp8",
Expand Down
73 changes: 73 additions & 0 deletions python/sglang/srt/layers/int4fp8_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
"""
Common utilities for quark.
"""

import logging
from typing import Tuple

import torch

logger = logging.getLogger(__name__)


def quantize_fp8_scale_tensorwise(w: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
FP8_MAX = 448.0
scale = w.abs().amax().float() / FP8_MAX
scaled = (w / scale).clamp(-FP8_MAX, FP8_MAX).to(torch.float8_e4m3fn)
return scaled, scale


def quantize_int4_scale_columnwise(
w: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
S4_MAX = 7
w_flat = w.reshape(-1, w.shape[-1]).float()
scale = w_flat.abs().amax(axis=-1) / S4_MAX
scaled = torch.round(w_flat / scale[:, None]).to(torch.int8).clamp(-S4_MAX, S4_MAX)
return scaled.reshape(w.shape), scale.reshape(w.shape[:-1])


def pack_int4_to_int32(to_pack: torch.Tensor, reorder: bool = True) -> torch.Tensor:
if to_pack.ndim > 2:
raise ValueError(
"Pack: Only supports tensors with dimensions not greater than 2."
)

if reorder:
order_map = [0, 2, 4, 6, 1, 3, 5, 7]
else:
order_map = [0, 1, 2, 3, 4, 5, 6, 7]
pack_num = 8
if to_pack.ndim == 2:
packed = torch.zeros(
to_pack.shape[0],
to_pack.shape[1] // pack_num,
dtype=torch.int32,
device=to_pack.device,
)
new_c = to_pack.shape[1] // pack_num
for c in range(new_c):
for i in range(pack_num):
# Use -3 as an example, high_position is 11111111,cause bit_or generate errors, so we can't use int4 directly
packed_col = to_pack[:, c * pack_num + order_map[i]].to(torch.int32)
packed_col = packed_col & 0x0F
packed[:, c] = torch.bitwise_or(
packed[:, c], torch.bitwise_left_shift(packed_col, i * 4)
)
elif to_pack.ndim == 0:
packed = to_pack.to(torch.int32)
else:
packed = torch.zeros(
to_pack.shape[0] // pack_num, dtype=torch.int32, device=to_pack.device
)
new_c = to_pack.shape[0] // pack_num
for c in range(new_c):
for i in range(pack_num):
# Use -3 as an example, high_position is 11111111,cause bit_or generate errors, so we can't use int4 directly
packed_col = to_pack[c * pack_num + order_map[i]]
packed_col = packed_col & 0x0F
packed[c] = torch.bitwise_or(
packed[c], torch.bitwise_left_shift(packed_col, i * 4)
)

return packed.view(torch.uint32)
1 change: 1 addition & 0 deletions python/sglang/srt/layers/linear.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@
"ModelOptFp4LinearMethod",
"IPEXAWQLinearMethod",
"PetitNvFp4LinearMethod",
"QuarkInt4Fp8LinearMethod",
]

_is_cpu = is_cpu()
Expand Down
2 changes: 2 additions & 0 deletions python/sglang/srt/layers/quantization/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ def override_quantization_method(self, *args, **kwargs):
from sglang.srt.layers.quantization.petit import PetitNvFp4Config
from sglang.srt.layers.quantization.qoq import QoQConfig
from sglang.srt.layers.quantization.quark.quark import QuarkConfig
from sglang.srt.layers.quantization.quark_int4fp8_moe import QuarkInt4Fp8Config
from sglang.srt.layers.quantization.w4afp8 import W4AFp8Config
from sglang.srt.layers.quantization.w8a8_fp8 import W8A8Fp8Config
from sglang.srt.layers.quantization.w8a8_int8 import W8A8Int8Config
Expand Down Expand Up @@ -68,6 +69,7 @@ def override_quantization_method(self, *args, **kwargs):
"fbgemm_fp8": FBGEMMFp8Config,
"quark": QuarkConfig,
"auto-round": AutoRoundConfig,
"quark_int4fp8_moe": QuarkInt4Fp8Config,
}


Expand Down
Loading
Loading