[W8A8 Block Linear Refactor][1/N] Keep all quantization types into QuantFP8 class.#33047
Conversation
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request is a significant refactoring that modularizes the FP8 input quantization logic into a kernel-based architecture. The introduction of an abstract InputQuantKernel and platform-specific implementations is a great step towards better code organization and extensibility. However, I've found a few critical issues in the new kernel implementations that need to be addressed. Specifically, there are bugs in the CudaInputQuantKernel and TritonInputQuantKernel related to handling static quantization and incorrect argument passing. There is also a consistent typo in a key method name across the new abstract class and its implementations.
vllm/model_executor/layers/quantization/kernels/input_quant/cuda.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/kernels/input_quant/triton.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/kernels/input_quant/InputQuantKernel.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/kernels/input_quant/triton.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/kernels/input_quant/cuda.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/kernels/input_quant/__init__.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/kernels/input_quant/cuda.py
Outdated
Show resolved
Hide resolved
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
|
@cursor review |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 5 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
vllm/model_executor/layers/quantization/kernels/input_quant/triton.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/kernels/input_quant/triton.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/kernels/input_quant/aiter.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/kernels/input_quant/aiter.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/kernels/input_quant/InputQuantKernel.py
Outdated
Show resolved
Hide resolved
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
|
@ProExpertProg @robertgshaw2-redhat Kindly review this PR as part of #31818. Correct me if there is wrong conditional support logics for each of the input quantization kernel. Like in terms of per_tensor per_token, per_group methods and device specs. |
Signed-off-by: maral <maralbahari.98@gmail.com>
ProExpertProg
left a comment
There was a problem hiding this comment.
Nice cleanup, thanks!
| # Fallback to native implementation for group quantization. | ||
| if self.is_group_quant: | ||
| assert scale is None, "Dynamic group quantization does not use scale" | ||
| return self._quantize_group_native(x) |
There was a problem hiding this comment.
Should we fallback to vLLM hipified CUDA kernel here? or is per-token group quant kernel not supported in the ROCm build of vLLM?
There was a problem hiding this comment.
the group quant from in forward_cuda is either deepgemm or the CUDA kernel in fp8_utils.per_token_group_quant_fp8 function that is not supported on ROCm. the fall back is either triton or the native. which for triton we are controlling it with the kwargs. if code reaches to this line then we fallback native.
There was a problem hiding this comment.
Got it, makes sense! We should try to get fp8_utils.per_token_group_quant_fp8 supported on ROCm if possible
|
Btw I want this to wait for #33293 so that we can run the e2e fusion tests |
Signed-off-by: maral <maralbahari.98@gmail.com>
| x: torch.Tensor, | ||
| scale: torch.Tensor | None = None, | ||
| scale_ub: torch.Tensor | None = None, | ||
| **kwargs, |
There was a problem hiding this comment.
Is there a reason to use kwargs and not pass use_triton as a regular arg?
There was a problem hiding this comment.
because, it is only used in forward_hip the forward passes they all have to follow the same signature otherwise there is mypy error. and since this is a keyword argument only used for ROCm platform use case.
Signed-off-by: maral <maralbahari.98@gmail.com>
|
CI failures seem related please take a look |
|
#33462 just merged, can you merge from main? |
|
Related Documentation No published documentation to review for changes on this repository. |
…uantFP8` class. (vllm-project#33047) Signed-off-by: maral <maralbahari.98@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Pai <416932041@qq.com>
…uantFP8` class. (vllm-project#33047) Signed-off-by: maral <maralbahari.98@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Purpose
This PR moves group quantization methods into
QuantFP8class.This is PR 1/2 in the series of updates for the block_scale_linear kernels mentioned in #31818.
Test Plan
No functional changes to the quantization behavior. All existing CI/CD tests should pass without test modification.
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.