Skip to content

[Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831#33366

Merged
gshtras merged 3 commits intovllm-project:mainfrom
ROCm:rocm_skinny_dispatch_fix
Jan 31, 2026
Merged

[Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831#33366
gshtras merged 3 commits intovllm-project:mainfrom
ROCm:rocm_skinny_dispatch_fix

Conversation

@gshtras
Copy link
Copy Markdown
Collaborator

@gshtras gshtras commented Jan 29, 2026

Fixing 2 issues from the previous PR

  1. When the wvsplitk kernel is not applicable, the fallback should be hipblaslt through torch._scaled_mm, and not cloning and modifying the input tensor. This fixes the +100% performance regression on amd/LLama-3.1-*-FP8-KV models when running on smaller batch sizes
  2. Weights can be non-contiguous, and often are in the FP8 case where we explicitly pad them to a multiple of 256. So the condition needs to only be applied to the activation tensor
  3. cc @rasmith

…o work

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
@gshtras gshtras requested a review from tjtanaa as a code owner January 29, 2026 20:57
@mergify mergify bot added rocm Related to AMD ROCm bug Something isn't working labels Jan 29, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Jan 29, 2026
@gshtras gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 29, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly refactors the skinny GEMM dispatch logic by removing the unconditional enforcement of contiguous inputs and instead adding checks at the call sites. For the FP8 kernel wvSplitKQ, checking only the activation tensor's contiguity is correct as the kernel can handle non-contiguous weights. However, for the unquantized kernels (wvSplitK, wvSplitKrc, LLMM1), they still seem to require both inputs to be contiguous. The current changes only check the activation tensor, which could lead to incorrect results if non-contiguous weights are passed. I've added comments to vllm/model_executor/layers/utils.py to suggest adding checks for weight contiguity for these unquantized kernels to prevent this potential issue.

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
@JartX
Copy link
Copy Markdown
Contributor

JartX commented Jan 29, 2026

@gshtras Excuse me for asking, but could this PR break inference on gfx1x by removing part of the contiguous logic? There have been several instances where removing contiguous logic caused the model specifically in ViT to produce nothing but hallucinations. Thank you for your time.

@gshtras
Copy link
Copy Markdown
Collaborator Author

gshtras commented Jan 29, 2026

@gshtras Excuse me for asking, but could this PR break inference on gfx1x by removing part of the contiguous logic? There have been several instances where removing contiguous logic caused the model specifically in ViT to produce nothing but hallucinations. Thank you for your time.

No, the condition to not dispatch these kernels on Radeon is unchanged

@JartX
Copy link
Copy Markdown
Contributor

JartX commented Jan 29, 2026

@gshtras many thanks for the answer

@rasmith
Copy link
Copy Markdown
Contributor

rasmith commented Jan 30, 2026

@gshtras The tests that were breaking were from

pytest -sv models/language/generation/test_hybrid.py

and I tested them and found no failures.

Also, I tested reverting https://github.com/vllm-project/vllm/pull/32099 with this PR applied and also found no failures in the test_hybrid.py test.

Could you also revert that PR as well in this one?

@tjtanaa tjtanaa added this to the v0.15.1 Hotfix milestone Jan 30, 2026
@tjtanaa
Copy link
Copy Markdown
Collaborator

tjtanaa commented Jan 30, 2026

I also help to test with the command used in PR #32831

vllm bench serve --model state-spaces/mamba-130m-hf
vllm serve state-spaces/mamba-130m-hf

The performance seems good. Fallback is better than just trying to cast to contiguous for the model as well.

Copy link
Copy Markdown
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @gshtras for the quick fix.

@tjtanaa
Copy link
Copy Markdown
Collaborator

tjtanaa commented Jan 30, 2026

@gshtras can you try rebasing again? I have retried those tests and they still failed.

@gshtras gshtras merged commit 31aedfe into vllm-project:main Jan 31, 2026
50 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Jan 31, 2026
@gshtras gshtras deleted the rocm_skinny_dispatch_fix branch January 31, 2026 01:05
khluu pushed a commit that referenced this pull request Feb 2, 2026
)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
(cherry picked from commit 31aedfe)
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026
…t#32831 (vllm-project#33366)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Pai <416932041@qq.com>
AndreasKaratzas added a commit to ROCm/vllm that referenced this pull request Feb 4, 2026
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…t#32831 (vllm-project#33366)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants