[Refactor] Make FP8 Linear Ops use kernel abstraction by vllmellm · Pull Request #27814 · vllm-project/vllm

vllmellm · 2025-10-30T14:25:04Z

Purpose

This PR refactors the FP8 linear kernel integration in vLLM to improve code clarity, maintainability, and path consistency.

Changes:
1- A more generic ScaledMMLinearKernel interface.
2- Introduce FP8ScaledMMLinearKernel and Int8ScaledMMLinearKernel.
3- Remove FP8LinearOp and replace with different FP8ScaledMMLinearKernel implementations.
4- Update unit tests to use ScaledMMLinearKernel interface.

Follow-ups

The following items are gonna need following up after merging this PR:

[Feature]: Refactor W8A8 Block Linear to use the kernel abstraction #31818 and unify TestFP8Layer
[Feature]: Benchmark torch._scaled_mm performance with and without padding #32269
[Feature]: Refactor Int8ScaledMMLinearLayerConfig to use QuantKey #32268
Add VLLM_DISABLED_KERNELS to vllm.envs

Test Plan

End-to-end model evaluations using lm_eval.

Test Result

CUDA
RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 (cutlass)

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.5027	±	0.0138
		strict-match	5	exact_match	↑	0.5011	±	0.0138

ROCm
RedHatAI/Qwen3-8B-FP8-dynamic (torch row-wise)

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8795	±	0.0090
		strict-match	5	exact_match	↑	0.8734	±	0.0092

Qwen/Qwen3-0.6B (ptpc_fp8 torch per-tensor)

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.3798	±	0.0134
		strict-match	5	exact_match	↑	0.3889	±	0.0134

RedHatAI/Qwen2-72B-Instruct-FP8 (torch per-tensor)

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8855	±	0.0088
		strict-match	5	exact_match	↑	0.8287	±	0.0104

RedHatAI/Qwen2-72B-Instruct-FP8 (rocm per-tensor)

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8741	±	0.0091
		strict-match	5	exact_match	↑	0.8143	±	0.0107

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

^{Cursor Bugbot is generating a summary for commit d8ee1b1. Configure here.}

Note

Introduces a unified kernel abstraction for scaled GEMM and migrates FP8/INT8 linear paths to it.

Adds ScaledMMLinearKernel base with FP8ScaledMMLinearKernel and Int8ScaledMMLinearKernel; implements backends: flashinfer, cutlass (CUDA), ROCm (skinny GEMM), and torch variants (per-tensor/per-token/channel-wise), plus Triton/CPU for INT8
Replaces Fp8LinearOp across code with init_fp8_linear_kernel/init_int8_linear_kernel and apply_weights (e.g., in fp8.py, fbgemm_fp8.py, modelopt.py, ptpc_fp8.py, compressed-tensors and quark schemes)
Adds test helpers TestFP8Layer and TestBlockFP8Layer; updates compilation/fusion tests to select kernels by GroupShape, force kernels, and verify fused RMSNorm/quant and SiLU+Mul quant patterns
Refactors ROCm Aiter fusion/quant paths to use the new APIs; introduces per-tensor rowwise/skinny-GEMM paths and capability checks
Cleans legacy utilities (e.g., old per-backend dispatch, device identity) and consolidates quantization keys in quant_utils
CI: adds small ROCm model list in .buildkite/lm-eval-harness/configs/models-small-rocm.txt

^{Written by Cursor Bugbot for commit d8ee1b1. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit c52cfae. Configure here.}

Note

Modernizes FP8/INT8 linear execution with a unified kernel interface and updates tests/fusion passes accordingly.

Add ScaledMMLinearKernel base with FP8ScaledMMLinearKernel and Int8ScaledMMLinearKernel; implement backends: FlashInfer, CUTLASS (CUDA), ROCm skinny-GEMM, Torch (per-tensor/per-token/channel-wise), Triton, and CPU
Replace Fp8LinearOp usages across code (fp8.py, fbgemm_fp8.py, modelopt.py, ptpc_fp8.py, compressed-tensors, quark schemes) with init_fp8_linear_kernel/init_int8_linear_kernel and apply_weights
Introduce test helpers TestFP8Layer and TestBlockFP8Layer; refactor fusion/functionalization/sequence-parallel tests to parameterize kernels and group shapes, add ROCm Aiter quant/fusion coverage
Clean up legacy dispatch and device-identity paths in w8a8_utils; consolidate quantization keys (kFp8StaticTokenSym, etc.)
Add ROCm lm-eval small model list

^{Written by Cursor Bugbot for commit c52cfae. This will update automatically on new commits. Configure here.}

Note

Introduces a unified scaled GEMM abstraction and migrates FP8/INT8 linear paths to it for clearer, backend-agnostic execution.

Add ScaledMMLinearKernel base with FP8ScaledMMLinearKernel and Int8ScaledMMLinearKernel; implement backends: flashinfer, cutlass, ROCm skinny-GEMM, torch (per-tensor/per-token/channel-wise), Triton, and CPU
Replace Fp8LinearOp usages with init_fp8_linear_kernel/init_int8_linear_kernel and apply_weights in fp8.py, fbgemm_fp8.py, modelopt.py, PTPC, compressed-tensors, and quark schemes
Enhance quant utils with new keys (e.g., kFp8StaticTokenSym) and clean legacy dispatch/device-identity from w8a8_utils
Overhaul tests: add TestFP8Layer/TestBlockFP8Layer, parameterize kernel/group-shape combos, expand ROCm Aiter fusion/quant coverage, and update fusion/functionalization/sequence-parallelism tests
CI/helper: add small ROCm model list models-small-rocm.txt

^{Written by Cursor Bugbot for commit d691155. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit 6ce94db. Configure here.}

Note

Modernizes quantized linear execution with a unified, backend-agnostic interface.

Introduces ScaledMMLinearKernel with FP8ScaledMMLinearKernel and Int8ScaledMMLinearKernel; implements backends: flashinfer, cutlass (CUDA), ROCm skinny-GEMM, and torch variants (per-tensor/per-token/channel-wise), plus Triton/CPU for INT8
Replaces Fp8LinearOp with init_fp8_linear_kernel/init_int8_linear_kernel and apply_weights across fp8.py, fbgemm_fp8.py, modelopt.py, ptpc_fp8.py, compressed-tensors, and quark schemes
Adds/uses new quantization keys (e.g., kFp8StaticTokenSym); removes legacy dispatch/device-identity logic from w8a8_utils
Refactors tests: new helpers TestFP8Layer/TestBlockFP8Layer, kernel/group-shape parametrization, expanded ROCm Aiter fusion/quant, SiLU+Mul quant fusion, and updates to distributed fusion/sequence-parallelism
CI/helper: adds configs/models-small-rocm.txt for lm-eval

^{Written by Cursor Bugbot for commit 6ce94db. This will update automatically on new commits. Configure here.}

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

vllm/model_executor/layers/quantization/utils/quant_utils.py

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

tests/compile/test_fusion_attn.py

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

…7814) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

…7814) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…7814) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>

…7814) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

vllmellm added 4 commits October 28, 2025 16:26

first try

974e682

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

fix int8 path

e54e572

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

clean up; fix quark path

c05027f

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

update quark fp8 path; format

c089ea5

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

vllmellm changed the title ~~Refactor fp8 linear~~ [Refactor] FP8 Linear Ops Oct 30, 2025

vllmellm added 13 commits October 31, 2025 15:58

Merge branch 'main' into refactor-fp8-linear

38825fc

reduce logging boilerplate; update fp8 path

423e2a6

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

reduce kernel init boilerplate

dd00106

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

update ptpc path; bug fixes

7d36148

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

revert input scale upper bounds

1f65cd5

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

format; update fbgemm path

5fbe76b

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

bug fix

e845035

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

fix types; reduce boilerplate for int8

d92c23b

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into refactor-fp8-linear

8e8218e

format

4ce0ba2

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

update unit tests to use ScaledMMLinearKernels

dd5a70e

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

update modelopt path

52ff537

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

remove FP8LinearOps

b13c4bb

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

vllmellm marked this pull request as ready for review November 1, 2025 16:37

vllmellm requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, pavanimajety, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners November 1, 2025 16:37

vllmellm and others added 4 commits January 15, 2026 11:23

force cutlass kernel for attention fusion test

7b3af09

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge branch 'main' into refactor-fp8-linear

9515da3

fix cuda ci

6cf65dc

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge branch 'main' into refactor-fp8-linear

b5c1c70

hangy-amd reviewed Jan 16, 2026

View reviewed changes

vllm/model_executor/layers/quantization/utils/quant_utils.py Show resolved Hide resolved

hangy-amd mentioned this pull request Jan 16, 2026

[Refactor] Move MXFP4/MXFP6 logic from fused_experts to Quark #32120

Open

5 tasks

fix cutlass kernel hang

1f95ab9

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

ProExpertProg reviewed Jan 16, 2026

View reviewed changes

tests/compile/test_fusion_attn.py Outdated Show resolved Hide resolved

vllmellm added 3 commits January 17, 2026 15:40

Merge branch 'main' into refactor-fp8-linear

ce381f0

add comment

99a5218

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge branch 'main' into refactor-fp8-linear

8bf199f

tjtanaa merged commit 148117e into vllm-project:main Jan 20, 2026
147 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Jan 20, 2026

github-project-automation bot moved this from In review to Done in MoE Refactor Jan 20, 2026

adobrzyn mentioned this pull request Jan 20, 2026

[FIX_FOR_VLLM_LATEST] Fix for hourly vllm-project/vllm-gaudi#843

Closed

vkuzo mentioned this pull request Jan 20, 2026

fp8 online quant: split out Fp8OnlineLinearMethod #32189

Merged

5 tasks

tjtanaa mentioned this pull request Jan 20, 2026

[Bugfix] Fix Quant Type Descriptor for Weights #32702

Open

5 tasks

gopalsarda pushed a commit to gopalsarda/vllm that referenced this pull request Jan 20, 2026

[Refactor] Make FP8 Linear Ops use kernel abstraction (vllm-project#2…

47d4dce

…7814) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

gpolovets1 mentioned this pull request Jan 21, 2026

fix for vllm #30623 and #27814 vllm-project/tpu-inference#1494

Merged

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[Refactor] Make FP8 Linear Ops use kernel abstraction (vllm-project#2…

147c1e3

…7814) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

lapy pushed a commit to lapy/vllm that referenced this pull request Jan 27, 2026

[Refactor] Make FP8 Linear Ops use kernel abstraction (vllm-project#2…

882ba5e

…7814) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

BadrBasowid mentioned this pull request Jan 29, 2026

[RFC] [Feature]: Apply RFC #8913 to Quantized Linear Methods (Decouple Kernels from Checkpoint Layout) #33314

Open

11 tasks

ProExpertProg mentioned this pull request Jan 30, 2026

[WIP][FP8] ScaledMM refactor #19434

Closed

4 tasks

BadrBasowid mentioned this pull request Feb 5, 2026

[Feature]: Community Help Wanted: Migrate Remaining Linear Methods into Kernel Abstraction #33872

Open

21 tasks

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Refactor] Make FP8 Linear Ops use kernel abstraction (vllm-project#2…

4acad8f

…7814) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

kane-vln mentioned this pull request Mar 15, 2026

ImportError: cannot import name 'Fp8LinearOp' after upgrading vllm to >= 0.17.0 nvidia-cosmos/cosmos-rl#640

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Refactor] Make FP8 Linear Ops use kernel abstraction#27814

[Refactor] Make FP8 Linear Ops use kernel abstraction#27814
tjtanaa merged 93 commits intovllm-project:mainfrom
EmbeddedLLM:refactor-fp8-linear

vllmellm commented Oct 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

vllmellm commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Follow-ups

Test Plan

Test Result

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

vllmellm commented Oct 30, 2025 •

edited

Loading