[W8A8 Block Linear Refactor][2/N] Make Fp8 block linear Op use kernel abstraction. by maralbahari · Pull Request #33891 · vllm-project/vllm

maralbahari · 2026-02-05T09:29:51Z

Purpose

closing this PR in favor of #33892

Test Plan

Does not require testing since the code path is not utilized yet.

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: maral <maralbahari.98@gmail.com>

gemini-code-assist

Code Review

This PR introduces a new kernel abstraction for FP8 block-scaled linear layers, which is a great step towards improving code clarity and maintainability. The changes are extensive and well-documented. However, I've found several critical issues in the implementation of the new DynamicMMLinearKernel and its integration, which could lead to runtime errors. These include logical errors in support checks, typos causing NameError, and type incompatibilities in kernel initialization. Please see the detailed comments for each issue.

vllm/model_executor/layers/quantization/kernels/base.py

vllm/model_executor/kernels/linear/base.py

vllm/model_executor/kernels/linear/scaled_mm/aiter.py

vllm/model_executor/layers/quantization/kernels/scaled_mm/cuda.py

vllm/model_executor/layers/quantization/kernels/scaled_mm/BlockScaledMMLinearKernel.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

…r.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

…kScaledMMLinearKernel.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Signed-off-by: maral <maralbahari.98@gmail.com>

…ement for cutlass and fix type error in dynamic deepgemm/flash-infer Signed-off-by: maral <maralbahari.98@gmail.com>

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari · 2026-02-23T06:11:12Z

@robertgshaw2-redhat @ProExpertProg @mgoin cloud you review this PR. appreciate it.

mergify · 2026-02-24T06:51:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maralbahari.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: maral <maralbahari.98@gmail.com>

tjtanaa · 2026-02-25T08:34:13Z

vllm/model_executor/kernels/linear/scaled_mm/cuda.py

+        if (
+            self.flashinfer_deepgemm_kernel is not None
+            and should_use_flashinfer_for_blockscale_fp8_gemm(
+                True, output_dtype, input_2d, weight


This is set to true because FlashInferFp8DeepGEMMDynamicBlockScaledKernel

is_flashinfer_fp8_blockscale_gemm_supported()

is evaluated in the __init__()

self.flashinfer_deepgemm_kernel: ( FlashInferFp8DeepGEMMDynamicBlockScaledKernel | None ) = None if FlashInferFp8DeepGEMMDynamicBlockScaledKernel.is_supported()[0]:

So, this condition self.flashinfer_deepgemm_kernel is not None is testing whether is_flashinfer_supported.

We can now set first argument of should_use_flashinfer_for_blockscale_fp8_gemm to be True

Benefit of doing this self.flashinfer_deepgemm_kernel is not None first is that it short-circuits the conditions.

We should try static dispatching either in this PR or upcoming PR. By doing in another PR we can confine the changes of this PR as just refactoring. Either ways work for me.

tjtanaa · 2026-02-25T09:15:14Z

vllm/model_executor/kernels/linear/scaled_mm/cuda.py

+            and should_use_flashinfer_for_blockscale_fp8_gemm(
+                True, output_dtype, input_2d, weight
+            )
+            and should_use_deepgemm_for_fp8_linear(output_dtype, weight, True)


The reason that the last argument of should_use_deepgemm_for_fp8_linear can be set to True is the same as in https://github.com/vllm-project/vllm/pull/33891/changes#r2851594385

tjtanaa · 2026-02-25T09:15:23Z

vllm/model_executor/kernels/linear/scaled_mm/cuda.py

+            return self.flashinfer_deepgemm_kernel.apply_weights(layer, x, bias)
+
+        if self.deepgemm_kernel is not None and should_use_deepgemm_for_fp8_linear(
+            output_dtype, weight, True


The reason that the last argument of should_use_deepgemm_for_fp8_linear can be set to True is the same as in https://github.com/vllm-project/vllm/pull/33891/changes#r2851594385

tjtanaa · 2026-02-25T09:18:59Z

vllm/model_executor/kernels/linear/scaled_mm/deep_gemm.py

+        self.is_deep_gemm_supported = is_deep_gemm_supported()
+        self.input_quant_op = QuantFP8(
+            static=False,
+            group_shape=act_scale_descriptor.group_shape,


Missing tma_aligned_scales=envs.VLLM_USE_DEEP_GEMM_TMA_ALIGNED_SCALES,

@tjtanaa added and updated the 3/N PR as well.

tjtanaa · 2026-02-25T09:20:38Z

vllm/model_executor/kernels/linear/scaled_mm/deep_gemm.py

+        act_scale_descriptor = config.activation_quant_key.scale
+        self.is_deep_gemm_supported = is_deep_gemm_supported()
+        self.input_quant_op = QuantFP8(
+            static=False,


Missing column_major_scales=True,

@tjtanaa added.

tjtanaa · 2026-02-25T09:34:58Z

vllm/model_executor/kernels/linear/scaled_mm/deep_gemm.py

+        return [CutlassFp8BlockScaledMMKernel, TritonFp8BlockScaledMMKernel]
+
+    @classmethod
+    def is_supported(cls, compute_capability=None):


vllm/vllm/model_executor/layers/quantization/utils/fp8_utils.py

Line 432 in 675ec59

dtype=torch.bfloat16,

It seems they hardcoded the output_dtype for output tensor of deepgemm to torch.bfloat16, we can assume that it is a condition that we should add to is_supported.

FlashInfer and DeepGEMM are not following current abstraction. They are wrapping the quant ops in a direct_register_custom_op as shown in

vllm/vllm/model_executor/layers/quantization/utils/fp8_utils.py

Lines 272 to 282 in 675ec59

def run_flashinfer_deepgemm_swapAB(

input: torch.Tensor,

weight: torch.Tensor,

weight_scale: torch.Tensor,

) -> torch.Tensor:

return flashinfer_fp8_blockscale_gemm(

input=input,

weight=weight,

weight_scale=weight_scale,

out_dtype=torch.bfloat16,

)

and

vllm/vllm/model_executor/layers/quantization/utils/fp8_utils.py

Lines 284 to 306 in 675ec59

def run_deepgemm(

input: torch.Tensor,

weight: torch.Tensor,

weight_scale: torch.Tensor,

) -> torch.Tensor:

q_input, input_scale = per_token_group_quant_fp8(

input,

group_size=group_size,

column_major_scales=True,

use_ue8m0=use_deep_gemm_e8m0,

)

output = torch.empty(

(q_input.shape[0], weight.shape[0]),

dtype=torch.bfloat16,

device=q_input.device,

)

fp8_gemm_nt(

(q_input, input_scale),

(weight, weight_scale),

output,

is_deep_gemm_e8m0_used=use_deep_gemm_e8m0,

)

return output

tjtanaa · 2026-02-25T09:45:23Z

vllm/model_executor/kernels/linear/scaled_mm/BlockScaledMMLinearKernel.py

+        self.input_quant_op = QuantFP8(
+            static=act_scale_descriptor.static,
+            group_shape=act_scale_descriptor.group_shape,
+            num_token_padding=self.get_output_padding(),


vllm/vllm/model_executor/layers/quantization/input_quant_fp8.py

Line 45 in 675ec59

use_ue8m0: bool | None = None, # for Torch compile

Following the implementation here, it seems we always explicitly set the use_ue8m0

vllm/vllm/model_executor/layers/quantization/utils/fp8_utils.py

Lines 562 to 584 in 675ec59

if use_cutlass:

return self._run_cutlass, (

QuantFP8(

False,

self.act_quant_group_shape,

column_major_scales=True,

use_ue8m0=False,

)

)

if use_aiter_and_is_supported:

return self._run_aiter, QuantFP8(

False,

self.act_quant_group_shape,

column_major_scales=False,

use_ue8m0=False,

)

return self._run_triton, (

QuantFP8(

False,

self.act_quant_group_shape,

column_major_scales=False,

use_ue8m0=False,

)

tjtanaa · 2026-02-25T09:52:53Z

vllm/model_executor/kernels/linear/scaled_mm/flashinfer.py

+    def __init__(self, config: FP8ScaledMMLinearLayerConfig) -> None:
+        super().__init__(config)
+        act_scale_descriptor = config.activation_quant_key.scale
+        self.input_quant_op = QuantFP8(


I noticed that FlashInferFp8BlockScaledMMKernel is not using this quant op, can you add a comment why it is needed here?

tjtanaa · 2026-02-25T10:03:26Z

vllm/model_executor/kernels/linear/scaled_mm/flashinfer.py

+        return torch.ops.vllm.flashinfer_fp8_blockscale_gemm(
+            A,  # BF16 input
+            B,  # FP8 weight
+            Bs,  # Weight scales


Missing group_size https://github.com/vllm-project/vllm/blob/675ec59aa94301989c3c174b3b910338c2d51ff4/vllm/model_executor/layers/quantization/utils/fp8_utils.py#L541C13-L541C55

Do we need use_deep_gemm_e8m0 at
https://github.com/vllm-project/vllm/blob/675ec59aa94301989c3c174b3b910338c2d51ff4/vllm/model_executor/layers/quantization/utils/fp8_utils.py#L542C32-L542C55

@tjtanaa this is just a placeholder. the issue is addressed in the PR after to independently register flashinfers swap gemm #33892.

tjtanaa · 2026-02-25T10:37:37Z

Since the abstraction and code introduced in this PR is not used and is served to highlight the core changes of refactoring the FP8 block linear op only. We will directly proceed with the 3/N PR #33892 which uses the code introduced in the PR and validate through CI.

Signed-off-by: maral <maralbahari.98@gmail.com>

create initial block scaled mm kernels and a common base

8a542b7

Signed-off-by: maral <maralbahari.98@gmail.com>

mergify bot added the nvidia label Feb 5, 2026

github-project-automation bot added this to NVIDIA Feb 5, 2026

gemini-code-assist bot reviewed Feb 5, 2026

View reviewed changes

maralbahari mentioned this pull request Feb 5, 2026

[W8A8 Block Linear Refactor][2/N] Make FP8 Block Linear Ops use kernel abstraction. #33407

Draft

5 tasks

maralbahari and others added 7 commits February 5, 2026 18:04

Update vllm/model_executor/layers/quantization/kernels/base.py

3c7049e

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Update vllm/model_executor/layers/quantization/kernels/base.py

08a893d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Update vllm/model_executor/layers/quantization/kernels/scaled_mm/aite…

9847109

…r.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Update vllm/model_executor/layers/quantization/kernels/scaled_mm/cuda.py

5d58935

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Update vllm/model_executor/layers/quantization/kernels/scaled_mm/Bloc…

9887678

…kScaledMMLinearKernel.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

fix pre-commit issues and typings

4b53675

Signed-off-by: maral <maralbahari.98@gmail.com>

imporve typing

acac7c1

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari marked this pull request as ready for review February 5, 2026 10:49

maralbahari requested review from mgoin, pavanimajety, robertgshaw2-redhat, tjtanaa, tlrmchlsmth and yewentao256 as code owners February 5, 2026 10:49

add missing kwargs for aiter fp8 block scaled mm func and return stat…

3363c88

…ement for cutlass and fix type error in dynamic deepgemm/flash-infer Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari mentioned this pull request Feb 9, 2026

[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections. #33892

Open

5 tasks

maralbahari added 2 commits February 9, 2026 02:49

fix f-string

6465faa

Signed-off-by: maral <maralbahari.98@gmail.com>

improve documenetation and fix typings in init_fp8_linear_kernel

320ced0

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari mentioned this pull request Feb 23, 2026

[W8A8 Block Linear Refactor][3/N] Make all scaled MM kernels inherit from common generic base. #33893

Draft

5 tasks

Merge remote-tracking branch 'origin/main' into 2n-block-scaled-rfc-pr

f555f75

mergify bot added the needs-rebase label Feb 24, 2026

Merge remote-tracking branch 'origin/main' into 2n-block-scaled-rfc-pr

c43b6cd

Signed-off-by: maral <maralbahari.98@gmail.com>

mergify bot removed the needs-rebase label Feb 24, 2026

fix import error

08d6a54

Signed-off-by: maral <maralbahari.98@gmail.com>

tjtanaa added the rocm Related to AMD ROCm label Feb 25, 2026

github-project-automation bot added this to AMD Feb 25, 2026

github-project-automation bot moved this to Todo in AMD Feb 25, 2026

tjtanaa reviewed Feb 25, 2026

View reviewed changes

address PR comments

7a26e60

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari closed this Feb 26, 2026

github-project-automation bot moved this from Todo to Done in AMD Feb 26, 2026

github-project-automation bot moved this to Done in NVIDIA Feb 26, 2026

	def run_flashinfer_deepgemm_swapAB(
	input: torch.Tensor,
	weight: torch.Tensor,
	weight_scale: torch.Tensor,
	) -> torch.Tensor:
	return flashinfer_fp8_blockscale_gemm(
	input=input,
	weight=weight,
	weight_scale=weight_scale,
	out_dtype=torch.bfloat16,
	)

	def run_deepgemm(
	input: torch.Tensor,
	weight: torch.Tensor,
	weight_scale: torch.Tensor,
	) -> torch.Tensor:
	q_input, input_scale = per_token_group_quant_fp8(
	input,
	group_size=group_size,
	column_major_scales=True,
	use_ue8m0=use_deep_gemm_e8m0,
	)
	output = torch.empty(
	(q_input.shape[0], weight.shape[0]),
	dtype=torch.bfloat16,
	device=q_input.device,
	)
	fp8_gemm_nt(
	(q_input, input_scale),
	(weight, weight_scale),
	output,
	is_deep_gemm_e8m0_used=use_deep_gemm_e8m0,
	)
	return output

	if use_cutlass:
	return self._run_cutlass, (
	QuantFP8(
	False,
	self.act_quant_group_shape,
	column_major_scales=True,
	use_ue8m0=False,
	)
	)
	if use_aiter_and_is_supported:
	return self._run_aiter, QuantFP8(
	False,
	self.act_quant_group_shape,
	column_major_scales=False,
	use_ue8m0=False,
	)
	return self._run_triton, (
	QuantFP8(
	False,
	self.act_quant_group_shape,
	column_major_scales=False,
	use_ue8m0=False,
	)

Uh oh!

Conversation

maralbahari commented Feb 5, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maralbahari commented Feb 23, 2026

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

tjtanaa Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maralbahari commented Feb 5, 2026 •

edited by github-actions bot

Loading

tjtanaa Feb 25, 2026 •

edited

Loading