[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections.#33892
Conversation
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-designed refactoring of the FP8 block-scaled linear kernel integration. By removing the monolithic W8A8BlockFp8LinearOp and introducing a new kernel abstraction layer with MMLinearKernel, the code becomes much more modular, maintainable, and extensible. The new kernel selection mechanism in init_fp8_linear_kernel is clear and correctly dispatches to different kernel implementations based on the quantization configuration. The changes are consistently applied across benchmarks, tests, and model implementation files.
I've found a few issues, including a critical one that would cause a runtime error, and a couple of high-severity issues related to correctness in tests and code robustness. After addressing these, this PR will be a great improvement to the codebase.
Signed-off-by: maral <maralbahari.98@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>
…r.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>
…kScaledMMLinearKernel.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
…block-scaled-rfc-pr Signed-off-by: maral <maralbahari.98@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
…ement for cutlass and fix type error in dynamic deepgemm/flash-infer Signed-off-by: maral <maralbahari.98@gmail.com>
…block-scaled-rfc-pr
Signed-off-by: maral <maralbahari.98@gmail.com>
…block-scaled-rfc-pr
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
…block-scaled-rfc-pr Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
|
Hi @maralbahari, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
| yield | ||
| config = VllmConfig() | ||
| with set_current_vllm_config(config): | ||
| yield config |
There was a problem hiding this comment.
@LucasWilkinson after including input_dtype to FP8LinearLayerConfig. as discussed we can assume that the activation dtype is same as model_config.dtype. However, XXLinearMethod objects do not have vllm_config access then.. I used get_current_vllm_config() to access the model config.
then the tests in quantization/test_fp8.py::test_fp8_reloading and quantization/test_modelopt.py were failing because the vllm_config is not set correctly by this default_vllm_config pytest fixture. I had to make this changes to the fixture and then set model_config as well in the unit cases as needed.
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
LucasWilkinson
left a comment
There was a problem hiding this comment.
LGTM! Thanks for the contribution and cleanups! this helps alot
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: maral <maralbahari.98@gmail.com>
Head branch was pushed to by a user without write access
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
This PR refactors block scaled linear kernel into kernel abstraction.
changes:
MMLinearKernelbase interface for all linear kernels.Params,Fp8ParamsandInt8Params, classes to access layer params in structured format.DynamicMMLinearKernelwhich is a type ofMMLinearKernelwith two main properties of base and fallback kernels that are variant ofMMLinearKernel. this class switches between base and fallbackimplementations at runtime.
W8A8BlockFp8LinearOpclass.Test Plan
Cuda platfrom
run ci/cd tests.
ROCm platform:
lm_eval score RedHatAI/Qwen3-30B-A3B-FP8-block
Test Result
ROCm platform:
lm_eval score RedHatAI/Qwen3-30B-A3B-FP8-block, without AITER
W8A8 Block Linear Refactor PRs:
QuantFP8class. #33047: Moves all the quantization ops into the sameQuantFP8class. (merged)W8A8Fp8BlockLinearOpclass and updates all code paths and files that use this class.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.