[Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831 by gshtras · Pull Request #33366 · vllm-project/vllm

gshtras · 2026-01-29T20:57:58Z

Fixing 2 issues from the previous PR

When the wvsplitk kernel is not applicable, the fallback should be hipblaslt through torch._scaled_mm, and not cloning and modifying the input tensor. This fixes the +100% performance regression on amd/LLama-3.1-*-FP8-KV models when running on smaller batch sizes
Weights can be non-contiguous, and often are in the FP8 case where we explicitly pad them to a multiple of 256. So the condition needs to only be applied to the activation tensor
cc @rasmith

…o work Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

gemini-code-assist

Code Review

This pull request correctly refactors the skinny GEMM dispatch logic by removing the unconditional enforcement of contiguous inputs and instead adding checks at the call sites. For the FP8 kernel wvSplitKQ, checking only the activation tensor's contiguity is correct as the kernel can handle non-contiguous weights. However, for the unquantized kernels (wvSplitK, wvSplitKrc, LLMM1), they still seem to require both inputs to be contiguous. The current changes only check the activation tensor, which could lead to incorrect results if non-contiguous weights are passed. I've added comments to vllm/model_executor/layers/utils.py to suggest adding checks for weight contiguity for these unquantized kernels to prevent this potential issue.

vllm/model_executor/layers/utils.py

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

JartX · 2026-01-29T22:58:37Z

@gshtras Excuse me for asking, but could this PR break inference on gfx1x by removing part of the contiguous logic? There have been several instances where removing contiguous logic caused the model specifically in ViT to produce nothing but hallucinations. Thank you for your time.

gshtras · 2026-01-29T23:20:07Z

@gshtras Excuse me for asking, but could this PR break inference on gfx1x by removing part of the contiguous logic? There have been several instances where removing contiguous logic caused the model specifically in ViT to produce nothing but hallucinations. Thank you for your time.

No, the condition to not dispatch these kernels on Radeon is unchanged

JartX · 2026-01-29T23:21:15Z

@gshtras many thanks for the answer

rasmith · 2026-01-30T01:06:00Z

@gshtras The tests that were breaking were from

pytest -sv models/language/generation/test_hybrid.py

and I tested them and found no failures.

Also, I tested reverting https://github.com/vllm-project/vllm/pull/32099 with this PR applied and also found no failures in the test_hybrid.py test.

Could you also revert that PR as well in this one?

tjtanaa · 2026-01-30T07:17:16Z

I also help to test with the command used in PR #32831

vllm bench serve --model state-spaces/mamba-130m-hf
vllm serve state-spaces/mamba-130m-hf

The performance seems good. Fallback is better than just trying to cast to contiguous for the model as well.

tjtanaa

LGTM. Thanks @gshtras for the quick fix.

tjtanaa · 2026-01-30T15:00:16Z

@gshtras can you try rebasing again? I have retried those tests and they still failed.

) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> (cherry picked from commit 31aedfe)

…t#32831 (vllm-project#33366) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: Pai <416932041@qq.com>

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…t#32831 (vllm-project#33366) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

Fixing the skinny gemm dispatch logic. Weights can be padded for it t…

ef220d1

…o work Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

gshtras requested a review from tjtanaa as a code owner January 29, 2026 20:57

mergify bot added rocm Related to AMD ROCm bug Something isn't working labels Jan 29, 2026

github-project-automation bot added this to AMD Jan 29, 2026

github-project-automation bot moved this to Todo in AMD Jan 29, 2026

gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 29, 2026

gemini-code-assist bot reviewed Jan 29, 2026

View reviewed changes

vllm/model_executor/layers/utils.py Show resolved Hide resolved

vllm/model_executor/layers/utils.py Show resolved Hide resolved

Added weight padding into fp8 skinny gemm tests

c29b513

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

tjtanaa added this to the v0.15.1 Hotfix milestone Jan 30, 2026

tjtanaa approved these changes Jan 30, 2026

View reviewed changes

Merge branch 'main' into rocm_skinny_dispatch_fix

67ef36d

gshtras merged commit 31aedfe into vllm-project:main Jan 31, 2026
50 checks passed

github-project-automation bot moved this from Todo to Done in AMD Jan 31, 2026

gshtras deleted the rocm_skinny_dispatch_fix branch January 31, 2026 01:05

khluu pushed a commit that referenced this pull request Feb 2, 2026

[Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831 (#33366

5f45b0b

) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> (cherry picked from commit 31aedfe)

AndreasKaratzas added a commit to ROCm/vllm that referenced this pull request Feb 4, 2026

Removed redundant after vllm-project#33366

eecb448

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas mentioned this pull request Feb 4, 2026

[ROCm][Bugfix][CI] Fix hybrid models and their tests (Mamba/Jamba/Bamba) #32710

Merged

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Bugfix][ROCm] Fixing the skinny gemm dispatch logic from vllm-projec…

a4554ed

…t#32831 (vllm-project#33366) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831#33366

[Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831#33366
gshtras merged 3 commits intovllm-project:mainfrom
ROCm:rocm_skinny_dispatch_fix

gshtras commented Jan 29, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

JartX commented Jan 29, 2026

Uh oh!

gshtras commented Jan 29, 2026

Uh oh!

JartX commented Jan 29, 2026

Uh oh!

rasmith commented Jan 30, 2026 •

edited

Loading

Uh oh!

tjtanaa commented Jan 30, 2026

Uh oh!

tjtanaa left a comment

Uh oh!

tjtanaa commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

gshtras commented Jan 29, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

JartX commented Jan 29, 2026

Uh oh!

gshtras commented Jan 29, 2026

Uh oh!

JartX commented Jan 29, 2026

Uh oh!

rasmith commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjtanaa commented Jan 30, 2026

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gshtras commented Jan 29, 2026 •

edited by github-actions bot

Loading

rasmith commented Jan 30, 2026 •

edited

Loading