w4a16 nvfp14 quant support by JINO-ROHIT · Pull Request #25535 · sgl-project/sglang

JINO-ROHIT · 2026-05-17T17:07:06Z

add support for nvfp4 blackwell w4a16 support.

CI States

Latest PR Test (Base): ❌ Missing run-ci label — add it to run CI tests.
Latest PR Test (Extra): ❌ Blocked — run-ci is required first.

gemini-code-assist

Code Review

This pull request introduces the CompressedTensorsW4A16Fp4 quantization scheme, including weight initialization, post-loading processing, and the GEMM application logic. The review feedback identifies several critical issues: the derivation of activation scales from weight scales is mathematically incorrect for weight-only quantization, and the implementation lacks necessary padding logic for the FlashInfer/TRTLLM backend to meet kernel alignment requirements. Additionally, the GEMM operation requires activation padding and output slicing to handle these alignments correctly. A code suggestion was also provided to ensure the output shape calculation supports 3D input tensors.

gemini-code-assist · 2026-05-17T17:08:38Z

+        weight_gs = layer.weight_global_scale.max().to(torch.float32)
+        input_gs = (1.0 / weight_gs).to(torch.float32)
+        layer.input_global_scale = Parameter(input_gs, requires_grad=False)
+        layer.weight_global_scale = Parameter(weight_gs, requires_grad=False)


Synthesizing a static activation scale (input_gs) from the weight scale (weight_gs) is mathematically incorrect for w4a16 (weight-only) quantization. Since the checkpoint does not provide activation scales, the model should ideally use dynamic quantization for activations (calculating the scale from the input x at runtime) to maintain accuracy. Using a fixed scale derived from weights will likely result in poor model performance.

gemini-code-assist · 2026-05-17T17:08:38Z

+        if get_fp4_gemm_runner_backend().is_flashinfer_trtllm():
+            from flashinfer import shuffle_matrix_a, shuffle_matrix_sf_a
+
+            weight = layer.weight_packed.data
+            weight_scale = layer.weight_scale.data
+
+            epilogue_tile_m = 128
+            weight = shuffle_matrix_a(weight.view(torch.uint8), epilogue_tile_m)
+            weight_scale = (
+                shuffle_matrix_sf_a(weight_scale.view(torch.uint8), epilogue_tile_m)
+                .reshape(weight_scale.shape)
+                .view(torch.float8_e4m3fn)
+            )
+
+            layer.weight_scale = Parameter(weight_scale, requires_grad=False)
+            layer.weight_packed = Parameter(weight, requires_grad=False)


This section is missing the necessary padding logic for the flashinfer_trtllm backend. The FP4 kernels require specific alignments (e.g., N dimension must be a multiple of 128, and K dimension must be a multiple of 32). Without padding, layers with non-aligned dimensions will cause kernel failures or incorrect results. Please refer to the padding implementation in ModelOptFp4LinearMethod.process_weights_after_loading within modelopt_quant.py and apply similar logic here.

gemini-code-assist · 2026-05-17T17:08:39Z

+        out = fp4_gemm(
+            x_fp4,
+            w,
+            x_blockscale,
+            w_blockscale,
+            layer.alpha,
+            output_dtype,
+            w_n,
+        )


The fp4_gemm call is missing activation padding and output slicing. If the weights are padded to meet alignment requirements (as noted in the process_weights_after_loading feedback), the activations must be padded in the K-dimension to match, and the resulting output must be sliced to remove the N-dimension padding. See ModelOptFp4LinearMethod.apply for reference.

gemini-code-assist · 2026-05-17T17:08:39Z

+    ) -> torch.Tensor:
+        output_dtype = x.dtype
+        w_n, _ = layer.weight_packed.shape
+        output_shape = [x.shape[0], w_n]


The output_shape calculation assumes a 2D input tensor. If the input x is 3D (e.g., [batch, seq, hidden]), this will lead to an incorrect shape and a runtime error during the view operation. Using x.shape[:-1] ensures compatibility with both 2D and 3D inputs.

Suggested change

output_shape = [x.shape[0], w_n]

output_shape = list(x.shape[:-1]) + [w_n]

w4a16 nvfp14 quant support

5795b5b

JINO-ROHIT requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, HaiShaw and ch-wan as code owners May 17, 2026 17:07

github-actions Bot added the blackwell SM100/SM120 label May 17, 2026

gemini-code-assist Bot reviewed May 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

w4a16 nvfp14 quant support#25535

w4a16 nvfp14 quant support#25535
JINO-ROHIT wants to merge 1 commit into
sgl-project:mainfrom
JINO-ROHIT:nvfp4w4a16

JINO-ROHIT commented May 17, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Uh oh!

gemini-code-assist Bot May 17, 2026

Uh oh!

gemini-code-assist Bot May 17, 2026

Uh oh!

gemini-code-assist Bot May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	output_shape = [x.shape[0], w_n]
	output_shape = list(x.shape[:-1]) + [w_n]

Conversation

JINO-ROHIT commented May 17, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI States

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JINO-ROHIT commented May 17, 2026 •

edited by github-actions Bot

Loading