Skip to content

[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections.#33892

Open
maralbahari wants to merge 60 commits intovllm-project:mainfrom
EmbeddedLLM:3n-block-scaled-rfc-pr
Open

[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections.#33892
maralbahari wants to merge 60 commits intovllm-project:mainfrom
EmbeddedLLM:3n-block-scaled-rfc-pr

Conversation

@maralbahari
Copy link
Copy Markdown
Contributor

@maralbahari maralbahari commented Feb 5, 2026

Purpose

This PR refactors block scaled linear kernel into kernel abstraction.

changes:

  • Introduces MMLinearKernel base interface for all linear kernels.
  • Introduces Params, Fp8Params and Int8Params, classes to access layer params in structured format.
  • Introduces DynamicMMLinearKernel which is a type of MMLinearKernel with two main properties of base and fallback kernels that are variant of MMLinearKernel. this class switches between base and fallback
    implementations at runtime.
  • Removing the legacy W8A8BlockFp8LinearOp class.
  • Unifying kernel selection for both block and non-block quantization
  • Updating all consumers (fp8.py, modelopt.py, tests, benchmarks)

Test Plan

Cuda platfrom
run ci/cd tests.

ROCm platform:
lm_eval score RedHatAI/Qwen3-30B-A3B-FP8-block

Test Result

ROCm platform:
lm_eval score RedHatAI/Qwen3-30B-A3B-FP8-block, without AITER

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8196 ± 0.0106
strict-match 5 exact_match 0.8954 ± 0.0084

W8A8 Block Linear Refactor PRs:


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
@mergify mergify bot added performance Performance-related issues nvidia labels Feb 5, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-designed refactoring of the FP8 block-scaled linear kernel integration. By removing the monolithic W8A8BlockFp8LinearOp and introducing a new kernel abstraction layer with MMLinearKernel, the code becomes much more modular, maintainable, and extensible. The new kernel selection mechanism in init_fp8_linear_kernel is clear and correctly dispatches to different kernel implementations based on the quantization configuration. The changes are consistently applied across benchmarks, tests, and model implementation files.

I've found a few issues, including a critical one that would cause a runtime error, and a couple of high-severity issues related to correctness in tests and code robustness. After addressing these, this PR will be a great improvement to the codebase.

Signed-off-by: maral <maralbahari.98@gmail.com>
maralbahari and others added 8 commits February 5, 2026 18:04
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
…r.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
…kScaledMMLinearKernel.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
…block-scaled-rfc-pr

Signed-off-by: maral <maralbahari.98@gmail.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 6, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maralbahari.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 6, 2026
…ement for cutlass and fix type error in dynamic deepgemm/flash-infer

Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
@mergify mergify bot removed the needs-rebase label Feb 9, 2026
Signed-off-by: maral <maralbahari.98@gmail.com>
…block-scaled-rfc-pr

Signed-off-by: maral <maralbahari.98@gmail.com>
@maralbahari maralbahari marked this pull request as ready for review February 23, 2026 02:09
Signed-off-by: maral <maralbahari.98@gmail.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 23, 2026

Hi @maralbahari, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maralbahari.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 23, 2026
Signed-off-by: maral <maralbahari.98@gmail.com>
@mergify mergify bot removed the needs-rebase label Mar 23, 2026
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
yield
config = VllmConfig()
with set_current_vllm_config(config):
yield config
Copy link
Copy Markdown
Contributor Author

@maralbahari maralbahari Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LucasWilkinson after including input_dtype to FP8LinearLayerConfig. as discussed we can assume that the activation dtype is same as model_config.dtype. However, XXLinearMethod objects do not have vllm_config access then.. I used get_current_vllm_config() to access the model config.
then the tests in quantization/test_fp8.py::test_fp8_reloading and quantization/test_modelopt.py were failing because the vllm_config is not set correctly by this default_vllm_config pytest fixture. I had to make this changes to the fixture and then set model_config as well in the unit cases as needed.

Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Copy link
Copy Markdown
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the contribution and cleanups! this helps alot

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Mar 24, 2026
maralbahari and others added 3 commits March 24, 2026 05:13
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
@tjtanaa tjtanaa enabled auto-merge (squash) March 25, 2026 03:41
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 26, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maralbahari.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 26, 2026
Signed-off-by: maral <maralbahari.98@gmail.com>
auto-merge was automatically disabled March 26, 2026 12:32

Head branch was pushed to by a user without write access

@mergify mergify bot removed the needs-rebase label Mar 26, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 27, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maralbahari.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase nvidia performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs rocm Related to AMD ROCm

Projects

Status: Todo
Status: Ready

Development

Successfully merging this pull request may close these issues.

3 participants