[ROCm] Enable Triton ScaledMM fallback + kernel selection fix#26668
[ROCm] Enable Triton ScaledMM fallback + kernel selection fix#26668ProExpertProg merged 6 commits intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request addresses an issue where triton_scaled_mm was not being used on ROCm by fixing the kernel selection logic. It correctly adds TritonScaledMMLinearKernel as a fallback for both ROCm and CUDA, and introduces an is_supported check to ensure kernels are compatible with the current platform. The changes are accompanied by a new integration test to verify the fix.
My review focuses on improving the robustness of the kernel selection. I've suggested making the get_min_capability check in the Triton kernel platform-aware to prevent it from being selected on unsupported ROCm hardware. Additionally, I've pointed out a confusing try-except block in the new test file that should be simplified for clarity and to avoid masking potential errors.
vllm/model_executor/layers/quantization/kernels/scaled_mm/triton.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
99018da to
4d3a612
Compare
d0d088d to
9036316
Compare
vllm/model_executor/layers/quantization/kernels/scaled_mm/triton.py
Outdated
Show resolved
Hide resolved
9036316 to
2a6c86c
Compare
vllm/model_executor/layers/quantization/kernels/scaled_mm/__init__.py
Outdated
Show resolved
Hide resolved
be28ac6 to
d2591bf
Compare
|
Is this ready for review again? |
|
@ProExpertProg yes! |
ProExpertProg
left a comment
There was a problem hiding this comment.
Just one note about is_supported
vllm/model_executor/layers/quantization/kernels/scaled_mm/ScaledMMLinearKernel.py
Show resolved
Hide resolved
… entry Signed-off-by: Shivam <shivampr.dev@gmail.com> Signed-off-by: Shivam <shivamprasad91@gmail.com>
Make is_supported() abstract in base class, remove get_min_capability(), and implement is_supported() in all kernels. Move platform checks from can_implement() to is_supported() in AiterScaledMMLinearKernel. Add CPU-compatible tests for kernel selection validation. Signed-off-by: Shivam <shivamprasad91@gmail.com>
859585f to
c8b0b83
Compare
|
@shivampr could you report lm-eval results for a model that uses int8 Triton scaled mm to check this works? |
…rd dependency for CI failure Signed-off-by: Shivam <shivamprasad91@gmail.com>
|
I verified the ROCm int8 Triton ScaledMM path using Command : Form logs :
|
Signed-off-by: Shivam <shivamprasad91@gmail.com>
|
Hi @shivampr, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Shivam <shivamprasad91@gmail.com>
…roject#26668) Signed-off-by: Shivam <shivampr.dev@gmail.com> Signed-off-by: Shivam <shivamprasad91@gmail.com>
…roject#26668) Signed-off-by: Shivam <shivampr.dev@gmail.com> Signed-off-by: Shivam <shivamprasad91@gmail.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
…roject#26668) Signed-off-by: Shivam <shivampr.dev@gmail.com> Signed-off-by: Shivam <shivamprasad91@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Purpose
Fixes #14397 —
triton_scaled_mmwas never used on ROCm due to missing dispatch and checks.This PR:
Enables Triton fallback for ROCm when AITriton is unavailable
Adds Triton fallback after CUTLASS on CUDA
Implements
is_supported()checks for kernel selectionAdds a lightweight integration test validating ROCm dispatch logic
Test Plan
1. Mocked test (no GPU)
Result
2. MI300X (ROCm 7.0, vLLM built from this PR)
(a) Triton kernel functional test
(b) OpenAI-compatible API test
Then:
Response
Confirms successful end-to-end inference on ROCm.