Skip to content

[Refactor] Make FP8 Linear Ops use kernel abstraction#27814

Merged
tjtanaa merged 93 commits intovllm-project:mainfrom
EmbeddedLLM:refactor-fp8-linear
Jan 20, 2026
Merged

[Refactor] Make FP8 Linear Ops use kernel abstraction#27814
tjtanaa merged 93 commits intovllm-project:mainfrom
EmbeddedLLM:refactor-fp8-linear

Conversation

@vllmellm
Copy link
Contributor

@vllmellm vllmellm commented Oct 30, 2025

Purpose

This PR refactors the FP8 linear kernel integration in vLLM to improve code clarity, maintainability, and path consistency.

Changes:
1- A more generic ScaledMMLinearKernel interface.
2- Introduce FP8ScaledMMLinearKernel and Int8ScaledMMLinearKernel.
3- Remove FP8LinearOp and replace with different FP8ScaledMMLinearKernel implementations.
4- Update unit tests to use ScaledMMLinearKernel interface.

Follow-ups

The following items are gonna need following up after merging this PR:

Test Plan

End-to-end model evaluations using lm_eval.

Test Result

CUDA
RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 (cutlass)

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.5027 ± 0.0138
strict-match 5 exact_match 0.5011 ± 0.0138

ROCm
RedHatAI/Qwen3-8B-FP8-dynamic (torch row-wise)

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8795 ± 0.0090
strict-match 5 exact_match 0.8734 ± 0.0092

Qwen/Qwen3-0.6B (ptpc_fp8 torch per-tensor)

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.3798 ± 0.0134
strict-match 5 exact_match 0.3889 ± 0.0134

RedHatAI/Qwen2-72B-Instruct-FP8 (torch per-tensor)

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8855 ± 0.0088
strict-match 5 exact_match 0.8287 ± 0.0104

RedHatAI/Qwen2-72B-Instruct-FP8 (rocm per-tensor)

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8741 ± 0.0091
strict-match 5 exact_match 0.8143 ± 0.0107

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Cursor Bugbot is generating a summary for commit d8ee1b1. Configure here.


Note

Introduces a unified kernel abstraction for scaled GEMM and migrates FP8/INT8 linear paths to it.

  • Adds ScaledMMLinearKernel base with FP8ScaledMMLinearKernel and Int8ScaledMMLinearKernel; implements backends: flashinfer, cutlass (CUDA), ROCm (skinny GEMM), and torch variants (per-tensor/per-token/channel-wise), plus Triton/CPU for INT8
  • Replaces Fp8LinearOp across code with init_fp8_linear_kernel/init_int8_linear_kernel and apply_weights (e.g., in fp8.py, fbgemm_fp8.py, modelopt.py, ptpc_fp8.py, compressed-tensors and quark schemes)
  • Adds test helpers TestFP8Layer and TestBlockFP8Layer; updates compilation/fusion tests to select kernels by GroupShape, force kernels, and verify fused RMSNorm/quant and SiLU+Mul quant patterns
  • Refactors ROCm Aiter fusion/quant paths to use the new APIs; introduces per-tensor rowwise/skinny-GEMM paths and capability checks
  • Cleans legacy utilities (e.g., old per-backend dispatch, device identity) and consolidates quantization keys in quant_utils
  • CI: adds small ROCm model list in .buildkite/lm-eval-harness/configs/models-small-rocm.txt

Written by Cursor Bugbot for commit d8ee1b1. This will update automatically on new commits. Configure here.


Note

Cursor Bugbot is generating a summary for commit c52cfae. Configure here.


Note

Modernizes FP8/INT8 linear execution with a unified kernel interface and updates tests/fusion passes accordingly.

  • Add ScaledMMLinearKernel base with FP8ScaledMMLinearKernel and Int8ScaledMMLinearKernel; implement backends: FlashInfer, CUTLASS (CUDA), ROCm skinny-GEMM, Torch (per-tensor/per-token/channel-wise), Triton, and CPU
  • Replace Fp8LinearOp usages across code (fp8.py, fbgemm_fp8.py, modelopt.py, ptpc_fp8.py, compressed-tensors, quark schemes) with init_fp8_linear_kernel/init_int8_linear_kernel and apply_weights
  • Introduce test helpers TestFP8Layer and TestBlockFP8Layer; refactor fusion/functionalization/sequence-parallel tests to parameterize kernels and group shapes, add ROCm Aiter quant/fusion coverage
  • Clean up legacy dispatch and device-identity paths in w8a8_utils; consolidate quantization keys (kFp8StaticTokenSym, etc.)
  • Add ROCm lm-eval small model list

Written by Cursor Bugbot for commit c52cfae. This will update automatically on new commits. Configure here.


Note

Introduces a unified scaled GEMM abstraction and migrates FP8/INT8 linear paths to it for clearer, backend-agnostic execution.

  • Add ScaledMMLinearKernel base with FP8ScaledMMLinearKernel and Int8ScaledMMLinearKernel; implement backends: flashinfer, cutlass, ROCm skinny-GEMM, torch (per-tensor/per-token/channel-wise), Triton, and CPU
  • Replace Fp8LinearOp usages with init_fp8_linear_kernel/init_int8_linear_kernel and apply_weights in fp8.py, fbgemm_fp8.py, modelopt.py, PTPC, compressed-tensors, and quark schemes
  • Enhance quant utils with new keys (e.g., kFp8StaticTokenSym) and clean legacy dispatch/device-identity from w8a8_utils
  • Overhaul tests: add TestFP8Layer/TestBlockFP8Layer, parameterize kernel/group-shape combos, expand ROCm Aiter fusion/quant coverage, and update fusion/functionalization/sequence-parallelism tests
  • CI/helper: add small ROCm model list models-small-rocm.txt

Written by Cursor Bugbot for commit d691155. This will update automatically on new commits. Configure here.


Note

Cursor Bugbot is generating a summary for commit 6ce94db. Configure here.


Note

Modernizes quantized linear execution with a unified, backend-agnostic interface.

  • Introduces ScaledMMLinearKernel with FP8ScaledMMLinearKernel and Int8ScaledMMLinearKernel; implements backends: flashinfer, cutlass (CUDA), ROCm skinny-GEMM, and torch variants (per-tensor/per-token/channel-wise), plus Triton/CPU for INT8
  • Replaces Fp8LinearOp with init_fp8_linear_kernel/init_int8_linear_kernel and apply_weights across fp8.py, fbgemm_fp8.py, modelopt.py, ptpc_fp8.py, compressed-tensors, and quark schemes
  • Adds/uses new quantization keys (e.g., kFp8StaticTokenSym); removes legacy dispatch/device-identity logic from w8a8_utils
  • Refactors tests: new helpers TestFP8Layer/TestBlockFP8Layer, kernel/group-shape parametrization, expanded ROCm Aiter fusion/quant, SiLU+Mul quant fusion, and updates to distributed fusion/sequence-parallelism
  • CI/helper: adds configs/models-small-rocm.txt for lm-eval

Written by Cursor Bugbot for commit 6ce94db. This will update automatically on new commits. Configure here.

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
@vllmellm vllmellm changed the title Refactor fp8 linear [Refactor] FP8 Linear Ops Oct 30, 2025
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
vllmellm and others added 4 commits January 15, 2026 11:23
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
@tjtanaa tjtanaa merged commit 148117e into vllm-project:main Jan 20, 2026
147 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Jan 20, 2026
@github-project-automation github-project-automation bot moved this from In review to Done in MoE Refactor Jan 20, 2026
gopalsarda pushed a commit to gopalsarda/vllm that referenced this pull request Jan 20, 2026
…7814)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…7814)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
monajafi-amd pushed a commit to monajafi-amd/vllm that referenced this pull request Jan 23, 2026
…7814)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>
lapy pushed a commit to lapy/vllm that referenced this pull request Jan 27, 2026
…7814)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
@ProExpertProg ProExpertProg mentioned this pull request Jan 30, 2026
4 tasks
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…7814)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build cpu Related to CPU backends nvidia ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs rocm Related to AMD ROCm

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants