[ROCm] Enable Triton ScaledMM fallback + kernel selection fix by shivampr · Pull Request #26668 · vllm-project/vllm

shivampr · 2025-10-13T01:06:44Z

Purpose

Fixes #14397 — triton_scaled_mm was never used on ROCm due to missing dispatch and checks.
This PR:

Enables Triton fallback for ROCm when AITriton is unavailable
Adds Triton fallback after CUTLASS on CUDA
Implements is_supported() checks for kernel selection
Adds a lightweight integration test validating ROCm dispatch logic

Test Plan

1. Mocked test (no GPU)

python3 mini_tests/select_triton_rocm.py

Result

Selected kernel: TritonScaledMMLinearKernel
OK: TritonScaledMMLinearKernel chosen on ROCm fallback.

2. MI300X (ROCm 7.0, vLLM built from this PR)

(a) Triton kernel functional test

max_abs_err≈2.5e-01, max_rel_err≈3.9e-03

(b) OpenAI-compatible API test

python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-0.5B-Instruct --dtype bfloat16 --host 0.0.0.0 --port 8000

Then:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-0.5B-Instruct","messages":[{"role":"user","content":"Say hi from MI300X."}]}'

Response

"Hello! How can I assist you today?"

Confirms successful end-to-end inference on ROCm.

mergify · 2025-10-13T01:07:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @shivampr.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request addresses an issue where triton_scaled_mm was not being used on ROCm by fixing the kernel selection logic. It correctly adds TritonScaledMMLinearKernel as a fallback for both ROCm and CUDA, and introduces an is_supported check to ensure kernels are compatible with the current platform. The changes are accompanied by a new integration test to verify the fix.

My review focuses on improving the robustness of the kernel selection. I've suggested making the get_min_capability check in the Triton kernel platform-aware to prevent it from being selected on unsupported ROCm hardware. Additionally, I've pointed out a confusing try-except block in the new test file that should be simplified for clarity and to avoid masking potential errors.

mini_tests/select_triton_rocm.py

vllm/model_executor/layers/quantization/kernels/scaled_mm/triton.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/model_executor/layers/quantization/kernels/scaled_mm/triton.py

vllm/model_executor/layers/quantization/kernels/scaled_mm/__init__.py

mini_tests/select_triton_rocm.py

ProExpertProg · 2025-11-06T22:13:04Z

Is this ready for review again?

shivampr · 2025-11-07T03:25:57Z

@ProExpertProg yes!
Sorry will ping you directly from now on if its review ready.

ProExpertProg

Just one note about is_supported

vllm/model_executor/layers/quantization/kernels/scaled_mm/ScaledMMLinearKernel.py

vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py

… entry Signed-off-by: Shivam <shivampr.dev@gmail.com> Signed-off-by: Shivam <shivamprasad91@gmail.com>

Make is_supported() abstract in base class, remove get_min_capability(), and implement is_supported() in all kernels. Move platform checks from can_implement() to is_supported() in AiterScaledMMLinearKernel. Add CPU-compatible tests for kernel selection validation. Signed-off-by: Shivam <shivamprasad91@gmail.com>

tests/kernels/quantization/test_scaled_mm_kernel_selection.py

ProExpertProg · 2025-12-10T16:57:09Z

@shivampr could you report lm-eval results for a model that uses int8 Triton scaled mm to check this works?

…rd dependency for CI failure Signed-off-by: Shivam <shivamprasad91@gmail.com>

shivampr · 2025-12-10T19:14:35Z

@ProExpertProg

I verified the ROCm int8 Triton ScaledMM path using lm-eval with vLLM.

Environment:
- Base image: RunPod ROCm vLLM 0.9.2 / ROCm 7.0
- Backend: `VLLM_TARGET_DEVICE=rocm`
- Aiter disabled so we exercise Triton fallback:
  ```bash
  export VLLM_TARGET_DEVICE=rocm
  export VLLM_DISABLED_KERNELS="AiterScaledMMLinearKernel"

Command :

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",dtype=auto,tensor_parallel_size=1,max_model_len=4096,gpu_memory_utilization=0.9,trust_remote_code=true \
  --tasks gsm8k \
  --num_fewshot 5 \
  --limit 50 \
  --batch_size auto

Form logs :

INFO ... Automatically detected platform rocm.
INFO ... Using TritonScaledMMLinearKernel for CompressedTensorsW8A8Int8

lm-eval results :

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.70	±	0.0655
		strict-match	5	exact_match	↑	0.64	±	0.0686

Signed-off-by: Shivam <shivamprasad91@gmail.com>

mergify · 2025-12-10T19:35:43Z

Hi @shivampr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Shivam <shivamprasad91@gmail.com>

…roject#26668) Signed-off-by: Shivam <shivampr.dev@gmail.com> Signed-off-by: Shivam <shivamprasad91@gmail.com>

…roject#26668) Signed-off-by: Shivam <shivampr.dev@gmail.com> Signed-off-by: Shivam <shivamprasad91@gmail.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

…roject#26668) Signed-off-by: Shivam <shivampr.dev@gmail.com> Signed-off-by: Shivam <shivamprasad91@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

shivampr requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners October 13, 2025 01:06

mergify bot added the rocm Related to AMD ROCm label Oct 13, 2025

mergify bot added the needs-rebase label Oct 13, 2025

gemini-code-assist bot reviewed Oct 13, 2025

View reviewed changes

mini_tests/select_triton_rocm.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/quantization/kernels/scaled_mm/triton.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Oct 13, 2025

View reviewed changes

vllm/model_executor/layers/quantization/kernels/scaled_mm/triton.py Show resolved Hide resolved

shivampr force-pushed the rocm-triton-fallback branch 3 times, most recently from 99018da to 4d3a612 Compare October 13, 2025 05:09

mergify bot removed the needs-rebase label Oct 13, 2025

shivampr force-pushed the rocm-triton-fallback branch 4 times, most recently from d0d088d to 9036316 Compare October 13, 2025 05:50

gshtras reviewed Oct 20, 2025

View reviewed changes

vllm/model_executor/layers/quantization/kernels/scaled_mm/triton.py Outdated Show resolved Hide resolved

shivampr force-pushed the rocm-triton-fallback branch from 9036316 to 2a6c86c Compare October 24, 2025 05:11

shivampr requested a review from pavanimajety as a code owner October 24, 2025 05:11

gshtras reviewed Oct 24, 2025

View reviewed changes

vllm/model_executor/layers/quantization/kernels/scaled_mm/triton.py Show resolved Hide resolved

ProExpertProg reviewed Nov 3, 2025

View reviewed changes

vllm/model_executor/layers/quantization/kernels/scaled_mm/__init__.py Outdated Show resolved Hide resolved

ProExpertProg reviewed Nov 3, 2025

View reviewed changes

mini_tests/select_triton_rocm.py Outdated Show resolved Hide resolved

shivampr force-pushed the rocm-triton-fallback branch from be28ac6 to d2591bf Compare November 4, 2025 15:07

shivampr requested a review from WoosukKwon as a code owner November 4, 2025 15:07

mergify bot added the nvidia label Nov 11, 2025

github-project-automation bot added this to NVIDIA Nov 11, 2025

ProExpertProg reviewed Dec 5, 2025

View reviewed changes

vllm/model_executor/layers/quantization/kernels/scaled_mm/ScaledMMLinearKernel.py Show resolved Hide resolved

vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py Show resolved Hide resolved

shivampr requested a review from tjtanaa as a code owner December 10, 2025 05:03

shivampr added 2 commits December 9, 2025 21:05

feat(rocm): enable TritonScaledMM fallback on ROCm; add CUDA fallback…

a55f7d4

… entry Signed-off-by: Shivam <shivampr.dev@gmail.com> Signed-off-by: Shivam <shivamprasad91@gmail.com>

shivampr force-pushed the rocm-triton-fallback branch from 859585f to c8b0b83 Compare December 10, 2025 06:06

tjtanaa reviewed Dec 10, 2025

View reviewed changes

tests/kernels/quantization/test_scaled_mm_kernel_selection.py Outdated Show resolved Hide resolved

ProExpertProg approved these changes Dec 10, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Dec 10, 2025

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 10, 2025

Fix CPU tests: add AiterScaledMMLinearKernel test, handle Mantis deco…

7b4224e

…rd dependency for CI failure Signed-off-by: Shivam <shivamprasad91@gmail.com>

mergify bot added the ci/build label Dec 10, 2025

Fix YAML: quote Mantis install command with || operator in CPU test

7336d3f

Signed-off-by: Shivam <shivamprasad91@gmail.com>

shivampr added 2 commits December 10, 2025 11:45

Fix ruff : split long comment in test file

689659f

Signed-off-by: Shivam <shivamprasad91@gmail.com>

Merge branch 'main' into rocm-triton-fallback

824fbea

ProExpertProg merged commit cd7740a into vllm-project:main Dec 12, 2025
55 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Dec 12, 2025

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Dec 15, 2025

[ROCm] Enable Triton ScaledMM fallback + kernel selection fix (vllm-p…

3b6ab79

…roject#26668) Signed-off-by: Shivam <shivampr.dev@gmail.com> Signed-off-by: Shivam <shivamprasad91@gmail.com>

ProExpertProg mentioned this pull request Jan 6, 2026

[Feature]: Change min_capability to is_supported in the kernel abstraction #31821

Open

1 task

Uh oh!

Conversation

shivampr commented Oct 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

1. Mocked test (no GPU)

2. MI300X (ROCm 7.0, vLLM built from this PR)

Uh oh!

mergify bot commented Oct 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ProExpertProg commented Nov 6, 2025

Uh oh!

shivampr commented Nov 7, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ProExpertProg commented Dec 10, 2025

Uh oh!

shivampr commented Dec 10, 2025

Uh oh!

mergify bot commented Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shivampr commented Oct 13, 2025 •

edited by github-actions bot

Loading