[W8A8 Block Linear Refactor][1/N] Keep all quantization types into `QuantFP8` class. by maralbahari · Pull Request #33047 · vllm-project/vllm

maralbahari · 2026-01-26T02:18:24Z

Purpose

This PR moves group quantization methods into QuantFP8 class.

This is PR 1/2 in the series of updates for the block_scale_linear kernels mentioned in #31818.

Test Plan

No functional changes to the quantization behavior. All existing CI/CD tests should pass without test modification.

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: maral <maralbahari.98@gmail.com>

mergify · 2026-01-26T02:19:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maralbahari.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request is a significant refactoring that modularizes the FP8 input quantization logic into a kernel-based architecture. The introduction of an abstract InputQuantKernel and platform-specific implementations is a great step towards better code organization and extensibility. However, I've found a few critical issues in the new kernel implementations that need to be addressed. Specifically, there are bugs in the CudaInputQuantKernel and TritonInputQuantKernel related to handling static quantization and incorrect argument passing. There is also a consistent typo in a key method name across the new abstract class and its implementations.

vllm/model_executor/layers/quantization/kernels/input_quant/cuda.py

vllm/model_executor/layers/quantization/kernels/input_quant/triton.py

vllm/model_executor/layers/quantization/kernels/input_quant/InputQuantKernel.py

vllm/model_executor/layers/quantization/kernels/input_quant/triton.py

vllm/model_executor/layers/quantization/kernels/input_quant/cuda.py

vllm/model_executor/layers/quantization/kernels/input_quant/__init__.py

vllm/model_executor/layers/quantization/kernels/input_quant/cuda.py

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari · 2026-01-26T05:02:39Z

@cursor review

cursor

Cursor Bugbot has reviewed your changes and found 5 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

vllm/model_executor/layers/quantization/kernels/input_quant/triton.py

vllm/model_executor/layers/quantization/kernels/input_quant/aiter.py

vllm/model_executor/layers/quantization/kernels/input_quant/InputQuantKernel.py

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari · 2026-01-26T07:53:32Z

@ProExpertProg @robertgshaw2-redhat Kindly review this PR as part of #31818. Correct me if there is wrong conditional support logics for each of the input quantization kernel. Like in terms of per_tensor per_token, per_group methods and device specs.

Signed-off-by: maral <maralbahari.98@gmail.com>

ProExpertProg

Nice cleanup, thanks!

ProExpertProg · 2026-01-30T15:43:41Z

vllm/model_executor/layers/quantization/input_quant_fp8.py

+        # Fallback to native implementation for group quantization.
+        if self.is_group_quant:
+            assert scale is None, "Dynamic group quantization does not use scale"
+            return self._quantize_group_native(x)


Should we fallback to vLLM hipified CUDA kernel here? or is per-token group quant kernel not supported in the ROCm build of vLLM?

the group quant from in forward_cuda is either deepgemm or the CUDA kernel in fp8_utils.per_token_group_quant_fp8 function that is not supported on ROCm. the fall back is either triton or the native. which for triton we are controlling it with the kwargs. if code reaches to this line then we fallback native.

Got it, makes sense! We should try to get fp8_utils.per_token_group_quant_fp8 supported on ROCm if possible

vllm/model_executor/layers/quantization/utils/fp8_utils.py

ProExpertProg · 2026-01-30T16:26:44Z

Btw I want this to wait for #33293 so that we can run the e2e fusion tests

Signed-off-by: maral <maralbahari.98@gmail.com>

ElizaWszola · 2026-01-30T17:12:06Z

vllm/model_executor/layers/quantization/input_quant_fp8.py

        x: torch.Tensor,
        scale: torch.Tensor | None = None,
        scale_ub: torch.Tensor | None = None,
+        **kwargs,


Is there a reason to use kwargs and not pass use_triton as a regular arg?

because, it is only used in forward_hip the forward passes they all have to follow the same signature otherwise there is mypy error. and since this is a keyword argument only used for ROCm platform use case.

Signed-off-by: maral <maralbahari.98@gmail.com>

ProExpertProg · 2026-01-31T14:56:08Z

CI failures seem related please take a look

DarkLight1337 · 2026-01-31T17:47:16Z

They are caused by #33362, should be fixed by #33482

ProExpertProg · 2026-01-31T21:10:51Z

#33462 just merged, can you merge from main?

dosubot · 2026-02-01T09:28:11Z

Related Documentation

No published documentation to review for changes on this repository.

Write your first living document

^{How did I do? Any feedback?}

…uantFP8` class. (vllm-project#33047) Signed-off-by: maral <maralbahari.98@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Pai <416932041@qq.com>

…uantFP8` class. (vllm-project#33047) Signed-off-by: maral <maralbahari.98@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

maralbahari added 2 commits January 22, 2026 09:06

refactor quantization kernel

4f38711

Signed-off-by: maral <maralbahari.98@gmail.com>

fix correct can_implement function for cuda

67ffdf2

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari requested review from mgoin, pavanimajety, robertgshaw2-redhat, tjtanaa, tlrmchlsmth and yewentao256 as code owners January 26, 2026 02:18

mergify bot added nvidia needs-rebase labels Jan 26, 2026

github-project-automation bot added this to NVIDIA Jan 26, 2026

gemini-code-assist bot reviewed Jan 26, 2026

View reviewed changes

cursor bot reviewed Jan 26, 2026

View reviewed changes

fix typo in function name

e7c7a6f

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari marked this pull request as draft January 26, 2026 02:35

maralbahari added 4 commits January 26, 2026 02:47

improve cuda can_implement

f348fac

Signed-off-by: maral <maralbahari.98@gmail.com>

fix triton arguments

f7f4a32

Signed-off-by: maral <maralbahari.98@gmail.com>

remove print

755f5a2

Signed-off-by: maral <maralbahari.98@gmail.com>

update with main

7bd1c70

Signed-off-by: maral <maralbahari.98@gmail.com>

mergify bot removed the needs-rebase label Jan 26, 2026

maralbahari marked this pull request as ready for review January 26, 2026 03:26

fix after merge conflict

f11e478

Signed-off-by: maral <maralbahari.98@gmail.com>

cursor bot reviewed Jan 26, 2026

View reviewed changes

maralbahari added 3 commits January 26, 2026 05:34

improve triton can_implement compability

0db8d9c

Signed-off-by: maral <maralbahari.98@gmail.com>

add comments

f6ab04e

Signed-off-by: maral <maralbahari.98@gmail.com>

return native implementation only for forward_native

b9da71c

Signed-off-by: maral <maralbahari.98@gmail.com>

add kwargs fallback

e8a8007

Signed-off-by: maral <maralbahari.98@gmail.com>

ProExpertProg approved these changes Jan 30, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Jan 30, 2026

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 30, 2026

fix keyword argument

c21e459

Signed-off-by: maral <maralbahari.98@gmail.com>

ElizaWszola reviewed Jan 30, 2026

View reviewed changes

maralbahari and others added 3 commits January 30, 2026 17:18

avoid circular import

d153eda

Signed-off-by: maral <maralbahari.98@gmail.com>

Merge remote-tracking branch 'origin/main' into rfc-quant-fp8

16f9201

Merge branch 'main' into rfc-quant-fp8

ad5ea1d

Merge branch 'main' into rfc-quant-fp8

5ee1fcb

DarkLight1337 enabled auto-merge (squash) February 1, 2026 02:33

Merge branch 'main' into rfc-quant-fp8

93932c9

DarkLight1337 merged commit b5f8c30 into vllm-project:main Feb 1, 2026
50 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Feb 1, 2026

maralbahari mentioned this pull request Feb 5, 2026

[W8A8 Block Linear Refactor][2/N] Make FP8 Block Linear Ops use kernel abstraction. #33407

Draft

5 tasks

This was referenced Feb 11, 2026

[bugfix] Fix Dynamo unexpected keyword argument #34318

Closed

[Bugfix] Fix Dynamo unexpected keyword argument #34320

Merged

hmellor mentioned this pull request Mar 4, 2026

Support w8a8 block_fp8_matmul from generated kernel #13835

Closed

Uh oh!

Conversation

maralbahari commented Jan 26, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Jan 26, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maralbahari commented Jan 26, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maralbahari commented Jan 26, 2026

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maralbahari Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ProExpertProg commented Jan 30, 2026

Uh oh!

ElizaWszola Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

maralbahari Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

ProExpertProg commented Jan 31, 2026

Uh oh!

DarkLight1337 commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProExpertProg commented Jan 31, 2026

Uh oh!

Uh oh!

dosubot bot commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maralbahari commented Jan 26, 2026 •

edited by github-actions bot

Loading

ProExpertProg Jan 30, 2026 •

edited

Loading

maralbahari Jan 30, 2026 •

edited

Loading

DarkLight1337 commented Jan 31, 2026 •

edited

Loading