[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections. by maralbahari · Pull Request #33892 · vllm-project/vllm

maralbahari · 2026-02-05T09:32:23Z

Purpose

This PR refactors block scaled linear kernel into kernel abstraction.

changes:

Introduces MMLinearKernel base interface for all linear kernels.
Introduces Params, Fp8Params and Int8Params, classes to access layer params in structured format.
Introduces DynamicMMLinearKernel which is a type of MMLinearKernel with two main properties of base and fallback kernels that are variant of MMLinearKernel. this class switches between base and fallback
implementations at runtime.
Removing the legacy W8A8BlockFp8LinearOp class.
Unifying kernel selection for both block and non-block quantization
Updating all consumers (fp8.py, modelopt.py, tests, benchmarks)

Test Plan

Cuda platfrom
run ci/cd tests.

ROCm platform:
lm_eval score RedHatAI/Qwen3-30B-A3B-FP8-block

Test Result

ROCm platform:
lm_eval score RedHatAI/Qwen3-30B-A3B-FP8-block, without AITER

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8196	±	0.0106
		strict-match	5	exact_match	↑	0.8954	±	0.0084

W8A8 Block Linear Refactor PRs:

[1/N] [W8A8 Block Linear Refactor][1/N] Keep all quantization types into QuantFP8 class. #33047: Moves all the quantization ops into the same QuantFP8 class. (merged)
[2/N]: [W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections. #33892: Extract block scaled mm linear kernels into kernel abstraction and removes the W8A8Fp8BlockLinearOp class and updates all code paths and files that use this class.
[3/N] [W8A8 Block Linear Refactor][3/N] Make all scaled MM kernels inherit from common generic base. #33893: Applies base inheritance to the remaining ScaledMM kernels for consistent code and improved maintainability of linear kernel classes.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: maral <maralbahari.98@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a significant and well-designed refactoring of the FP8 block-scaled linear kernel integration. By removing the monolithic W8A8BlockFp8LinearOp and introducing a new kernel abstraction layer with MMLinearKernel, the code becomes much more modular, maintainable, and extensible. The new kernel selection mechanism in init_fp8_linear_kernel is clear and correctly dispatches to different kernel implementations based on the quantization configuration. The changes are consistently applied across benchmarks, tests, and model implementation files.

I've found a few issues, including a critical one that would cause a runtime error, and a couple of high-severity issues related to correctness in tests and code robustness. After addressing these, this PR will be a great improvement to the codebase.

vllm/model_executor/kernels/linear/scaled_mm/flashinfer.py

tests/utils.py

vllm/model_executor/kernels/linear/scaled_mm/cuda.py

Signed-off-by: maral <maralbahari.98@gmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

…r.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

…kScaledMMLinearKernel.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Signed-off-by: maral <maralbahari.98@gmail.com>

…block-scaled-rfc-pr Signed-off-by: maral <maralbahari.98@gmail.com>

mergify · 2026-02-06T14:07:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maralbahari.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…ement for cutlass and fix type error in dynamic deepgemm/flash-infer Signed-off-by: maral <maralbahari.98@gmail.com>

…block-scaled-rfc-pr

Signed-off-by: maral <maralbahari.98@gmail.com>

…block-scaled-rfc-pr

Signed-off-by: maral <maralbahari.98@gmail.com>

…block-scaled-rfc-pr Signed-off-by: maral <maralbahari.98@gmail.com>

Signed-off-by: maral <maralbahari.98@gmail.com>

mergify · 2026-03-23T05:28:53Z

Hi @maralbahari, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: maral <maralbahari.98@gmail.com>

mergify · 2026-03-23T10:25:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maralbahari.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari · 2026-03-24T00:51:44Z

tests/conftest.py

-        yield
+    config = VllmConfig()
+    with set_current_vllm_config(config):
+        yield config


@LucasWilkinson after including input_dtype to FP8LinearLayerConfig. as discussed we can assume that the activation dtype is same as model_config.dtype. However, XXLinearMethod objects do not have vllm_config access then.. I used get_current_vllm_config() to access the model config.
then the tests in quantization/test_fp8.py::test_fp8_reloading and quantization/test_modelopt.py were failing because the vllm_config is not set correctly by this default_vllm_config pytest fixture. I had to make this changes to the fixture and then set model_config as well in the unit cases as needed.

Signed-off-by: maral <maralbahari.98@gmail.com>

LucasWilkinson

LGTM! Thanks for the contribution and cleanups! this helps alot

Signed-off-by: maral <maralbahari.98@gmail.com>

…LM/vllm into 3n-block-scaled-rfc-pr

mergify · 2026-03-26T08:30:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maralbahari.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: maral <maralbahari.98@gmail.com>

mergify · 2026-03-27T20:54:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maralbahari.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

maralbahari added 2 commits February 5, 2026 04:38

create initial block scaled mm kernels and a common base

8a542b7

Signed-off-by: maral <maralbahari.98@gmail.com>

remove W8A8Fp8BlockLinearOp and adop mm kernel selection

0ebcf78

Signed-off-by: maral <maralbahari.98@gmail.com>

mergify bot added performance Performance-related issues nvidia labels Feb 5, 2026

github-project-automation bot added this to NVIDIA Feb 5, 2026

gemini-code-assist bot reviewed Feb 5, 2026

View reviewed changes

vllm/model_executor/kernels/linear/scaled_mm/flashinfer.py Show resolved Hide resolved

tests/utils.py Show resolved Hide resolved

vllm/model_executor/kernels/linear/scaled_mm/cuda.py Outdated Show resolved Hide resolved

remove W8A8Fp8BlockLinearOp from unit tests

b76074c

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari mentioned this pull request Feb 5, 2026

[W8A8 Block Linear Refactor][2/N] Make FP8 Block Linear Ops use kernel abstraction. #33407

Draft

5 tasks

maralbahari and others added 8 commits February 5, 2026 18:04

Update vllm/model_executor/layers/quantization/kernels/base.py

3c7049e

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Update vllm/model_executor/layers/quantization/kernels/base.py

08a893d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Update vllm/model_executor/layers/quantization/kernels/scaled_mm/aite…

9847109

…r.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Update vllm/model_executor/layers/quantization/kernels/scaled_mm/cuda.py

5d58935

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Update vllm/model_executor/layers/quantization/kernels/scaled_mm/Bloc…

9887678

…kScaledMMLinearKernel.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

fix pre-commit issues and typings

4b53675

Signed-off-by: maral <maralbahari.98@gmail.com>

imporve typing

acac7c1

Signed-off-by: maral <maralbahari.98@gmail.com>

Merge remote-tracking branch 'origin/2n-block-scaled-rfc-pr' into 3n-…

61bfb5b

…block-scaled-rfc-pr Signed-off-by: maral <maralbahari.98@gmail.com>

mergify bot added the needs-rebase label Feb 6, 2026

maralbahari added 5 commits February 9, 2026 02:24

add missing kwargs for aiter fp8 block scaled mm func and return stat…

3363c88

…ement for cutlass and fix type error in dynamic deepgemm/flash-infer Signed-off-by: maral <maralbahari.98@gmail.com>

Merge remote-tracking branch 'origin/2n-block-scaled-rfc-pr' into 3n-…

79951e2

…block-scaled-rfc-pr

fix f-string

6465faa

Signed-off-by: maral <maralbahari.98@gmail.com>

Merge remote-tracking branch 'origin/2n-block-scaled-rfc-pr' into 3n-…

5b3c2e1

…block-scaled-rfc-pr

Merge remote-tracking branch 'origin/main' into 3n-block-scaled-rfc-pr

8dd23bd

Signed-off-by: maral <maralbahari.98@gmail.com>

mergify bot removed the needs-rebase label Feb 9, 2026

maralbahari added 2 commits February 9, 2026 04:04

improve documenetation and fix typings in init_fp8_linear_kernel

320ced0

Signed-off-by: maral <maralbahari.98@gmail.com>

Merge remote-tracking branch 'origin/2n-block-scaled-rfc-pr' into 3n-…

d0cd8a2

…block-scaled-rfc-pr Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari marked this pull request as ready for review February 23, 2026 02:09

maralbahari requested review from mgoin, robertgshaw2-redhat and tjtanaa as code owners February 23, 2026 02:09

fix fusion unit test and online fp8 quant

f0ca1e9

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari added 3 commits March 23, 2026 05:34

fix pre-commit error

f595112

Signed-off-by: maral <maralbahari.98@gmail.com>

fix input_dtype

55096ef

Signed-off-by: maral <maralbahari.98@gmail.com>

fix unittest

64df301

Signed-off-by: maral <maralbahari.98@gmail.com>

mergify bot added the needs-rebase label Mar 23, 2026

Merge remote-tracking branch 'origin/main' into 3n-block-scaled-rfc-pr

a08f623

Signed-off-by: maral <maralbahari.98@gmail.com>

mergify bot removed the needs-rebase label Mar 23, 2026

maralbahari added 2 commits March 23, 2026 12:44

fix unit tests

1d5c1b7

Signed-off-by: maral <maralbahari.98@gmail.com>

Merge remote-tracking branch 'origin/main' into 3n-block-scaled-rfc-pr

98f215b

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari commented Mar 24, 2026

View reviewed changes

maralbahari added 2 commits March 24, 2026 01:05

fix unit test for test_modelopt

1b65c2e

Signed-off-by: maral <maralbahari.98@gmail.com>

remove unused function.

f093d82

Signed-off-by: maral <maralbahari.98@gmail.com>

LucasWilkinson approved these changes Mar 24, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Mar 24, 2026

maralbahari and others added 3 commits March 24, 2026 05:13

fix Quantization unit test

cbb0599

Signed-off-by: maral <maralbahari.98@gmail.com>

attemp to fix marlin fp8 quant fp8

05b7cc9

Signed-off-by: maral <maralbahari.98@gmail.com>

Merge branch 'main' into 3n-block-scaled-rfc-pr

929d05d

tjtanaa enabled auto-merge (squash) March 25, 2026 03:41

tjtanaa and others added 3 commits March 25, 2026 23:38

Merge branch 'main' into 3n-block-scaled-rfc-pr

8593412

Merge remote-tracking branch 'origin/main' into 3n-block-scaled-rfc-pr

5c73e37

Merge branch '3n-block-scaled-rfc-pr' of https://github.com/EmbeddedL…

a07f484

…LM/vllm into 3n-block-scaled-rfc-pr

mergify bot added the needs-rebase label Mar 26, 2026

Merge remote-tracking branch 'origin/main' into 3n-block-scaled-rfc-pr

f3a1cd2

Signed-off-by: maral <maralbahari.98@gmail.com>

auto-merge was automatically disabled March 26, 2026 12:32
Head branch was pushed to by a user without write access

mergify bot removed the needs-rebase label Mar 26, 2026

mergify bot added the needs-rebase label Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections.#33892

[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections.#33892
maralbahari wants to merge 60 commits intovllm-project:mainfrom
EmbeddedLLM:3n-block-scaled-rfc-pr

maralbahari commented Feb 5, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 6, 2026

Uh oh!

mergify bot commented Mar 23, 2026

Uh oh!

mergify bot commented Mar 23, 2026

Uh oh!

maralbahari Mar 24, 2026 •

edited

Loading

Uh oh!

LucasWilkinson left a comment

Uh oh!

mergify bot commented Mar 26, 2026

Uh oh!

mergify bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

maralbahari commented Feb 5, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 6, 2026

Uh oh!

mergify bot commented Mar 23, 2026

Uh oh!

mergify bot commented Mar 23, 2026

Uh oh!

maralbahari Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 26, 2026

Uh oh!

mergify bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maralbahari commented Feb 5, 2026 •

edited by github-actions bot

Loading

maralbahari Mar 24, 2026 •

edited

Loading