[Feature][ROCM] add online int4_fp8_moe quant feature by DehuaTang · Pull Request #6238 · sgl-project/sglang

DehuaTang · 2025-05-12T15:23:01Z

Quark int4_fp8_moe on the fly quant.

In this PR, we support on the fly quantization of int4_fp8_moe using quark.

python bench_one_batch.py --model-path /model/mistralai/Mixtral-8x7B-Instruct-v0.1 --correct --quark-config int4fp8_moe

fxmarty-amd · 2025-05-13T10:45:50Z

python/sglang/srt/configs/model_config.py

            "compressed-tensors",
            "fbgemm_fp8",
            "w8a8_fp8",
+            "quark_int4fp8_moe",


How is this PR related to quark? At the moment, there is no utility from quark used in this PR.

The quantized weights obtained in online_quant should actually be implemented using Quark's realquantizer. However, due to time constraints, a simple function was used.

got it. Should we have a code structure like

quark/ schemes/ quark_scheme.py quark_int4fp8.py quark.py

similar to vllm? Maybe easier to extend with mxfp4 then

No. sglang does not support quark format yet.

Okay, it can be done in an other PR.

fxmarty-amd · 2025-05-13T10:56:07Z

python/sglang/srt/layers/quantization/quark_w4a8_int4fp8.py

+        )
+
+
+class QuarkInt4Fp8MoEMethod:


#4152 applied changes to Fp8MoEMethod class to be able to load int4 weights, using the environment variable SGLANG_INT4_WEIGHT, see

sglang/python/sglang/srt/layers/quantization/fp8.py

Line 439 in 16267d4

class Fp8MoEMethod:

I am confused regarding the need to introduce this new class QuarkInt4Fp8MoEMethod. Once weights are quantized on the fly to int4-fp8, the execution path should be shared with what was done in #4152.

Alternatively, Fp8MoEMethod should be cleaned up if int4-fp8 related code, and int4-fp8 should always be handled with QuarkInt4Fp8MoEMethod, no matter whether quantization of weights is online or offline.

It seems a lot of code is duplicated here

Yes, the current code does have some redundant operations with the PR you mentioned. But we don't want to use the previous process because it uses environment variables for control. This is our improvement.

fxmarty-amd · 2025-05-13T10:57:58Z

python/sglang/srt/layers/quantization/quark_w4a8_int4fp8.py

+        return []
+
+
+class QuarkInt4Fp8LinearMethod(LinearMethodBase):


#4152 did not touch Fp8LinearMethod it seems

In this pr, only w1 w2 w3 use int4fp8, and the linear in attention is still fp8, so it touch this function

fxmarty-amd · 2025-05-13T10:58:11Z

python/sglang/srt/layers/quark_utils.py

+logger = logging.getLogger(__name__)
+
+
+def apply_quark_quant_config_to_model(model_config, quark_config):


fxmarty-amd · 2025-05-13T10:58:53Z

python/sglang/srt/layers/quark_utils.py

+        def quantize_fp8_scale_tensorwise(w):
+            FP8_MAX = 448.0
+            scale = w.abs().amax().float() / FP8_MAX
+            scaled = (w / scale).clamp(-FP8_MAX, FP8_MAX).to(torch.float8_e4m3fn)
+            return scaled, scale
+
+
+        def quantize_int4_scale_columnwise(w):
+            S4_MAX = 7
+            w_flat = w.reshape(-1, w.shape[-1]).float()
+            scale = w_flat.abs().amax(axis=-1) / S4_MAX
+            scaled = torch.round(w_flat / scale[:, None]).to(torch.int8).clamp(-S4_MAX, S4_MAX)
+            return scaled.reshape(w.shape), scale.reshape(w.shape[:-1])
+
+
+        def pack(to_pack: torch.Tensor, reorder: bool = True) -> torch.Tensor:
+            if to_pack.ndim > 2:
+                raise ValueError("Pack: Only supports tensors with dimensions not greater than 2.")
+
+            if reorder:
+                order_map = [0, 2, 4, 6, 1, 3, 5, 7]
+            else:
+                order_map = [0, 1, 2, 3, 4, 5, 6, 7]
+            pack_num = 8
+            if to_pack.ndim == 2:
+                packed = torch.zeros(to_pack.shape[0], to_pack.shape[1] // pack_num, dtype=torch.int32, device=to_pack.device)
+                new_c = to_pack.shape[1] // pack_num
+                for c in range(new_c):
+                    for i in range(pack_num):
+                        # Use -3 as an example, high_position is 11111111,cause bit_or generate errors, so we can't use int4 directly
+                        packed_col = to_pack[:, c * pack_num + order_map[i]].to(torch.int32)
+                        packed_col = packed_col & 0x0F
+                        packed[:, c] = torch.bitwise_or(packed[:, c], torch.bitwise_left_shift(packed_col, i * 4))
+            elif to_pack.ndim == 0:
+                packed = to_pack.to(torch.int32)
+            else:
+                packed = torch.zeros(to_pack.shape[0] // pack_num, dtype=torch.int32, device=to_pack.device)
+                new_c = to_pack.shape[0] // pack_num
+                for c in range(new_c):
+                    for i in range(pack_num):
+                        # Use -3 as an example, high_position is 11111111,cause bit_or generate errors, so we can't use int4 directly
+                        packed_col = to_pack[c * pack_num + order_map[i]]
+                        packed_col = packed_col & 0x0F
+                        packed[c] = torch.bitwise_or(packed[c], torch.bitwise_left_shift(packed_col, i * 4))
+
+            return packed
+
+        def quark_quant_weights(weights_dict):
+            for name, loaded_weight in tqdm(weights_dict, desc="Quark Online Quantizating "):
+                if "w1.weight" in name or "w2.weight" in name or "w3.weight" in name: 
+                    fp8_w, fp8_scale = quantize_fp8_scale_tensorwise(loaded_weight)
+                    int4_w, int4_scale = quantize_int4_scale_columnwise(loaded_weight)
+
+                    int4_w = pack(int4_w)
+                    int4_scale /= fp8_scale
+
+                    yield name, int4_w  
+                    yield name + "_scale", fp8_scale
+                    yield name + "_scale1", int4_scale
+                elif "proj.weight" in name:
+                    fp8_w, fp8_scale = quantize_fp8_scale_tensorwise(loaded_weight)
+
+                    yield name, fp8_w  
+                    yield name + "_scale", fp8_scale
+                else:
+                    yield name, loaded_weight  


I'd move these functions out

Yes. If you have time, it is best to use quark's function to implement it.

I meant move them out of apply_quark_quant_config_to_model - but yeah eventually we might use quark utilities here

fxmarty-amd · 2025-05-13T11:01:21Z

python/sglang/srt/model_loader/loader.py


    @staticmethod
-    def load_weights_and_postprocess(model, weights, target_device):
+    def load_weights_and_postprocess(model, model_config, weights, target_device):


DefaultModelLoader.load_weights_and_postprocess(model, iter, target_device) is called in model_runner.py but was not modified

Sorry for that

fxmarty-amd · 2025-05-13T11:22:30Z

python/sglang/srt/layers/quark_utils.py

+                    yield name + "_scale", fp8_scale
+                    yield name + "_scale1", int4_scale


I'd give more explicit names

fxmarty-amd · 2025-05-13T11:41:32Z

python/sglang/srt/layers/quantization/quark_w4a8_int4fp8.py

+        return []
+
+
+class QuarkInt4Fp8LinearMethod(LinearMethodBase):


Isn't it just simply FP8 static quantization for weights, dynamic quant for activations? The class should probably not be named QuarkInt4Fp8LinearMethod as int4 is not involved here.

fxmarty-amd · 2025-05-13T11:45:07Z

python/sglang/srt/model_loader/loader.py

    @staticmethod
-    def load_weights_and_postprocess(model, weights, target_device):
+    def load_weights_and_postprocess(model, model_config, weights, target_device):
+        weights = online_quant(model_config, weights)


It looks a bit weird to me to call quark/int4-fp8 specific method here.

Couldn't we rather make it so that the weight_loader method handles on the fly quantization? WDYT?

Like this in QuarkInt4Fp8MoEMethod:

def create_weights( self, layer: torch.nn.Module, num_experts: int, hidden_size: int, intermediate_size: int, params_dtype: torch.dtype, **extra_weight_attrs, ): # ... original_weight_loader = extra_weight_attrs.get("weight_loader") def weight_loader( self, param: torch.nn.Parameter, loaded_weight: torch.Tensor, weight_name: str, shard_id: str, expert_id: int, ) -> None: _, fp8_scale = quantize_fp8_scale_tensorwise(loaded_weight) int4_w, int4_scale = quantize_int4_scale_columnwise(loaded_weight) original_weight_loader( param, int4_w, weight_name, shard_id=shard_id, expert_id=expert_id ) # maybe need to care about TP>1 param.fp8_scale = fp8_scale param.int4_scale = int4_scale w13_weight = ModelWeightParameter( data=torch.empty( num_experts, 2 * intermediate_size, hidden_size // 8, dtype=params_dtype, ), input_dim=2, output_dim=1, weight_loader=weight_loader, ) layer.register_parameter("w13_weight", weight) set_weight_attrs(w13_weight, extra_weight_attrs) # ... def process_weights_after_loading(self, layer: torch.nn.Module) -> None: # properly register int4 and fp8 scales here.

Maybe you are right. This method is simple to implement and is compatible with the case where tp>1.

fxmarty-amd · 2025-06-06T08:09:15Z

@merrymercy @zhyncs @HaiShaw do you have comments?

BowenBao · 2025-06-25T20:20:46Z

Replaced by #7392, feel free to close this.

addd quark_int4_fp8_moe feature

70d6f6a

DehuaTang requested review from ByronHsu, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, merrymercy, xiezhq-hermann and zhyncs as code owners May 12, 2025 15:23

fxmarty-amd reviewed May 13, 2025

View reviewed changes

HaiShaw marked this pull request as draft June 7, 2025 00:22

HaiShaw assigned HaiShaw, DehuaTang and fxmarty-amd Jun 7, 2025

fxmarty-amd mentioned this pull request Jun 20, 2025

[AMD][Quantization] Add int4fp8_moe online quantization on ROCm #7392

Merged

7 tasks

		logger = logging.getLogger(__name__)


		def apply_quark_quant_config_to_model(model_config, quark_config):

		yield name + "_scale", fp8_scale
		yield name + "_scale1", int4_scale

Conversation

DehuaTang commented May 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd commented Jun 6, 2025

Uh oh!

BowenBao commented Jun 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants