Skip to content

[W8A8 Block Linear Refactor][1/N] Keep all quantization types into QuantFP8 class.#33047

Merged
DarkLight1337 merged 21 commits intovllm-project:mainfrom
EmbeddedLLM:rfc-quant-fp8
Feb 1, 2026
Merged

[W8A8 Block Linear Refactor][1/N] Keep all quantization types into QuantFP8 class.#33047
DarkLight1337 merged 21 commits intovllm-project:mainfrom
EmbeddedLLM:rfc-quant-fp8

Conversation

@maralbahari
Copy link
Contributor

@maralbahari maralbahari commented Jan 26, 2026

Purpose

This PR moves group quantization methods into QuantFP8 class.

This is PR 1/2 in the series of updates for the block_scale_linear kernels mentioned in #31818.

Test Plan

No functional changes to the quantization behavior. All existing CI/CD tests should pass without test modification.

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
@mergify
Copy link

mergify bot commented Jan 26, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maralbahari.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant refactoring that modularizes the FP8 input quantization logic into a kernel-based architecture. The introduction of an abstract InputQuantKernel and platform-specific implementations is a great step towards better code organization and extensibility. However, I've found a few critical issues in the new kernel implementations that need to be addressed. Specifically, there are bugs in the CudaInputQuantKernel and TritonInputQuantKernel related to handling static quantization and incorrect argument passing. There is also a consistent typo in a key method name across the new abstract class and its implementations.

Signed-off-by: maral <maralbahari.98@gmail.com>
@maralbahari maralbahari marked this pull request as draft January 26, 2026 02:35
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
@mergify mergify bot removed the needs-rebase label Jan 26, 2026
@maralbahari maralbahari marked this pull request as ready for review January 26, 2026 03:26
Signed-off-by: maral <maralbahari.98@gmail.com>
@maralbahari
Copy link
Contributor Author

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 5 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
@maralbahari
Copy link
Contributor Author

@ProExpertProg @robertgshaw2-redhat Kindly review this PR as part of #31818. Correct me if there is wrong conditional support logics for each of the input quantization kernel. Like in terms of per_tensor per_token, per_group methods and device specs.

Signed-off-by: maral <maralbahari.98@gmail.com>
Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup, thanks!

# Fallback to native implementation for group quantization.
if self.is_group_quant:
assert scale is None, "Dynamic group quantization does not use scale"
return self._quantize_group_native(x)
Copy link
Collaborator

@ProExpertProg ProExpertProg Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fallback to vLLM hipified CUDA kernel here? or is per-token group quant kernel not supported in the ROCm build of vLLM?

Copy link
Contributor Author

@maralbahari maralbahari Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the group quant from in forward_cuda is either deepgemm or the CUDA kernel in fp8_utils.per_token_group_quant_fp8 function that is not supported on ROCm. the fall back is either triton or the native. which for triton we are controlling it with the kwargs. if code reaches to this line then we fallback native.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, makes sense! We should try to get fp8_utils.per_token_group_quant_fp8 supported on ROCm if possible

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Jan 30, 2026
@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 30, 2026
@ProExpertProg
Copy link
Collaborator

Btw I want this to wait for #33293 so that we can run the e2e fusion tests

Signed-off-by: maral <maralbahari.98@gmail.com>
x: torch.Tensor,
scale: torch.Tensor | None = None,
scale_ub: torch.Tensor | None = None,
**kwargs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use kwargs and not pass use_triton as a regular arg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because, it is only used in forward_hip the forward passes they all have to follow the same signature otherwise there is mypy error. and since this is a keyword argument only used for ROCm platform use case.

@ProExpertProg
Copy link
Collaborator

CI failures seem related please take a look

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jan 31, 2026

They are caused by #33362, should be fixed by #33482

@ProExpertProg
Copy link
Collaborator

#33462 just merged, can you merge from main?

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) February 1, 2026 02:33
@DarkLight1337 DarkLight1337 merged commit b5f8c30 into vllm-project:main Feb 1, 2026
50 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Feb 1, 2026
@dosubot
Copy link

dosubot bot commented Feb 1, 2026

Related Documentation

No published documentation to review for changes on this repository.

Write your first living document

How did I do? Any feedback?  Join Discord

PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026
…uantFP8` class. (vllm-project#33047)

Signed-off-by: maral <maralbahari.98@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: Pai <416932041@qq.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…uantFP8` class. (vllm-project#33047)

Signed-off-by: maral <maralbahari.98@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants