[Refactor] Make FP8 Linear Ops use kernel abstraction#27814
Merged
tjtanaa merged 93 commits intovllm-project:mainfrom Jan 20, 2026
Merged
[Refactor] Make FP8 Linear Ops use kernel abstraction#27814tjtanaa merged 93 commits intovllm-project:mainfrom
tjtanaa merged 93 commits intovllm-project:mainfrom
Conversation
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
hangy-amd
reviewed
Jan 16, 2026
5 tasks
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
5 tasks
5 tasks
gopalsarda
pushed a commit
to gopalsarda/vllm
that referenced
this pull request
Jan 20, 2026
…7814) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
dsuhinin
pushed a commit
to dsuhinin/vllm
that referenced
this pull request
Jan 21, 2026
…7814) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
monajafi-amd
pushed a commit
to monajafi-amd/vllm
that referenced
this pull request
Jan 23, 2026
…7814) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>
lapy
pushed a commit
to lapy/vllm
that referenced
this pull request
Jan 27, 2026
…7814) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
11 tasks
4 tasks
Open
21 tasks
ItzDEXX
pushed a commit
to ItzDEXX/vllm
that referenced
this pull request
Feb 19, 2026
…7814) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
This PR refactors the FP8 linear kernel integration in vLLM to improve code clarity, maintainability, and path consistency.
Changes:
1- A more generic ScaledMMLinearKernel interface.
2- Introduce
FP8ScaledMMLinearKernelandInt8ScaledMMLinearKernel.3- Remove
FP8LinearOpand replace with differentFP8ScaledMMLinearKernelimplementations.4- Update unit tests to use ScaledMMLinearKernel interface.
Follow-ups
The following items are gonna need following up after merging this PR:
TestFP8LayerVLLM_DISABLED_KERNELStovllm.envsTest Plan
End-to-end model evaluations using lm_eval.
Test Result
CUDA
RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 (cutlass)
ROCm
RedHatAI/Qwen3-8B-FP8-dynamic (torch row-wise)
Qwen/Qwen3-0.6B (ptpc_fp8 torch per-tensor)
RedHatAI/Qwen2-72B-Instruct-FP8 (torch per-tensor)
RedHatAI/Qwen2-72B-Instruct-FP8 (rocm per-tensor)
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Note
Cursor Bugbot is generating a summary for commit d8ee1b1. Configure here.
Note
Introduces a unified kernel abstraction for scaled GEMM and migrates FP8/INT8 linear paths to it.
ScaledMMLinearKernelbase withFP8ScaledMMLinearKernelandInt8ScaledMMLinearKernel; implements backends:flashinfer,cutlass(CUDA), ROCm (skinny GEMM), and torch variants (per-tensor/per-token/channel-wise), plus Triton/CPU for INT8Fp8LinearOpacross code withinit_fp8_linear_kernel/init_int8_linear_kernelandapply_weights(e.g., infp8.py,fbgemm_fp8.py,modelopt.py,ptpc_fp8.py, compressed-tensors and quark schemes)TestFP8LayerandTestBlockFP8Layer; updates compilation/fusion tests to select kernels byGroupShape, force kernels, and verify fused RMSNorm/quant and SiLU+Mul quant patternsquant_utils.buildkite/lm-eval-harness/configs/models-small-rocm.txtWritten by Cursor Bugbot for commit d8ee1b1. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit c52cfae. Configure here.
Note
Modernizes FP8/INT8 linear execution with a unified kernel interface and updates tests/fusion passes accordingly.
ScaledMMLinearKernelbase withFP8ScaledMMLinearKernelandInt8ScaledMMLinearKernel; implement backends: FlashInfer, CUTLASS (CUDA), ROCm skinny-GEMM, Torch (per-tensor/per-token/channel-wise), Triton, and CPUFp8LinearOpusages across code (fp8.py, fbgemm_fp8.py, modelopt.py, ptpc_fp8.py, compressed-tensors, quark schemes) withinit_fp8_linear_kernel/init_int8_linear_kernelandapply_weightsTestFP8LayerandTestBlockFP8Layer; refactor fusion/functionalization/sequence-parallel tests to parameterize kernels and group shapes, add ROCm Aiter quant/fusion coveragew8a8_utils; consolidate quantization keys (kFp8StaticTokenSym, etc.)Written by Cursor Bugbot for commit c52cfae. This will update automatically on new commits. Configure here.
Note
Introduces a unified scaled GEMM abstraction and migrates FP8/INT8 linear paths to it for clearer, backend-agnostic execution.
ScaledMMLinearKernelbase withFP8ScaledMMLinearKernelandInt8ScaledMMLinearKernel; implement backends:flashinfer,cutlass, ROCm skinny-GEMM, torch (per-tensor/per-token/channel-wise), Triton, and CPUFp8LinearOpusages withinit_fp8_linear_kernel/init_int8_linear_kernelandapply_weightsinfp8.py,fbgemm_fp8.py,modelopt.py, PTPC, compressed-tensors, and quark schemeskFp8StaticTokenSym) and clean legacy dispatch/device-identity fromw8a8_utilsTestFP8Layer/TestBlockFP8Layer, parameterize kernel/group-shape combos, expand ROCm Aiter fusion/quant coverage, and update fusion/functionalization/sequence-parallelism testsmodels-small-rocm.txtWritten by Cursor Bugbot for commit d691155. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 6ce94db. Configure here.
Note
Modernizes quantized linear execution with a unified, backend-agnostic interface.
ScaledMMLinearKernelwithFP8ScaledMMLinearKernelandInt8ScaledMMLinearKernel; implements backends:flashinfer,cutlass(CUDA), ROCm skinny-GEMM, and torch variants (per-tensor/per-token/channel-wise), plus Triton/CPU for INT8Fp8LinearOpwithinit_fp8_linear_kernel/init_int8_linear_kernelandapply_weightsacrossfp8.py,fbgemm_fp8.py,modelopt.py,ptpc_fp8.py, compressed-tensors, and quark schemeskFp8StaticTokenSym); removes legacy dispatch/device-identity logic fromw8a8_utilsTestFP8Layer/TestBlockFP8Layer, kernel/group-shape parametrization, expanded ROCm Aiter fusion/quant, SiLU+Mul quant fusion, and updates to distributed fusion/sequence-parallelismconfigs/models-small-rocm.txtfor lm-evalWritten by Cursor Bugbot for commit 6ce94db. This will update automatically on new commits. Configure here.