[ROCm] Cap Triton paged attention block size to fix ROCm shared memory OOM by AndreasKaratzas · Pull Request #38502 · vllm-project/vllm

AndreasKaratzas · 2026-03-30T04:29:36Z

Hybrid Mamba models (e.g. Jamba) inflate block_size to 2048 to align attention and Mamba page sizes. When the ROCm custom paged attention kernel rejects this (it only supports 16/32), the Triton fallback kernel_paged_attention_2d used 2048 as its tile size, requesting 262144 bytes of shared memory and thus exceeding the MI325X hardware limit of 65536 bytes. Cap TRITON_BLOCK_SIZE at 128. The kernel already decouples tile size from physical block size via l_block_idx/internal_offsets addressing, so this is safe.

Test plan

pytest tests/models/language/generation/test_hybrid.py

cc @kenroche

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas · 2026-03-30T04:29:43Z

cc @micah-wil

gemini-code-assist

Code Review

This pull request introduces a cap of 128 on the TRITON_BLOCK_SIZE within the chunked_prefill_paged_decode operation to prevent shared memory OOM errors, particularly for models with large block sizes like hybrid Mamba. Feedback suggests that hardcoding this value is brittle and recommends a more robust approach by dynamically calculating the maximum block size based on the specific device's shared memory capacity to ensure better portability across different hardware.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…id models Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas · 2026-04-01T22:11:06Z

Test group is green: https://buildkite.com/vllm/amd-ci/builds/7193/steps/canvas?jid=019d49fc-7807-4af1-bb3e-96103fc4392b&tab=output

hmellor · 2026-04-14T12:47:21Z

+        is_contiguous_blocks = key_cache.stride(0) == key_cache[0].numel()
+        if block_size in (16, 32) and is_contiguous_blocks:
+            # Normal 16, 32 with contiguous blocks, use vLLM native HIP C++ logic


Could we use https://docs.pytorch.org/docs/stable/generated/torch.Tensor.is_contiguous.html?

is_contiguous() is for tensor level contiguity but reshape_and_cache kernel's assumption is block-level contiguity, so the former would introduce more performance overhead that I think would be unnecessary here.

I see so we just want to check that the rows are contiguous?

Yep :) Ensures that rows in zero dim (aka blocks) are packed with no gaps between them

hmellor · 2026-04-14T12:47:41Z

+        MAX_TRITON_BLOCK_SIZE = 128
+        TRITON_BLOCK_SIZE = min(block_size, MAX_TRITON_BLOCK_SIZE) if is_pow2 else 32


Is this a ROCm specific limit?

Yep :)

At least for now.

Is this op only used for ROCm? (sorry if that's a dumb question, I'm not familiar with this area of the code)

I didn't know the answer to that question myself before I attempted to resolve the failure here, so I think it's not a dumb question 😅

Answer: Yep :) It's found only in vllm/v1/attention/backends/rocm_attn.py.

Yes. This op is only used for ROCm.

tjtanaa · 2026-04-14T15:37:51Z

+    # (CanonicalizePointers, ConvertToBufferOps) crash when an scf.if
+    # yields pointers with different base addresses. Instead, we compute
+    # both sets of load pointers and use mutually exclusive masks.
+    if HAS_INITSTATES:


@hmellor do you know who is more familiar with this mamba code?

I can review changes to this kernel, but I don't really understand why these changes are related to the rest of the PR?

@AndreasKaratzas can you explain? Thanks

It is a second ROCm/Mamba blocker by the same hybrid-model validation path. The PR is aimed at getting hybrid Mamba models, e.g. Jamba, working on ROCm with chunked prefill. Once the attention path gets past the inflated/padded block-size issue, the same hybrid_model tests run the Mamba2 varlen SSD path with initial_states. In the previous Triton code, prev_states_ptr could come from either initstates_ptr or states_ptr through an if. On AMD Triton this lowers to an scf.if yielding pointers with different base addresses, and the ROCm compiler crashes in CanonicalizePointers / ConvertToBufferOps. This change keeps the same semantics by computing both candidate load pointers and using mutually exclusive masks, so only the selected source contributes. Without this fix we get:

FAILED tests/kernels/mamba/test_mamba_ssm_ssd.py::test_mamba_chunk_scan_cont_batch_prefill_chunking[seqlens0-8] FAILED tests/kernels/mamba/test_mamba_ssm_ssd.py::test_mamba_chunk_scan_cont_batch_prefill_chunking[seqlens0-256] FAILED tests/kernels/mamba/test_mamba_ssm_ssd.py::test_mamba_chunk_scan_cont_batch_prefill_chunking[seqlens1-8] FAILED tests/kernels/mamba/test_mamba_ssm_ssd.py::test_mamba_chunk_scan_cont_batch_prefill_chunking[seqlens1-256]

@AndreasKaratzas Can you share the error? I think this was fixed on the Triton side already with triton-lang/triton#9541.

After updating to the new base image, I realize that this patch is unnecessary. I restored this file and waiting for the CI eval to confirm that indeed this issue has been solved already elsewhere.

gshtras · 2026-04-16T16:06:22Z


-        if block_size in (16, 32):
-            # Normal 16, 32, use vLLM native HIP C++ logic
+        is_contiguous_blocks = key_cache.stride(0) == key_cache[0].numel()


Would this logic also apply to the use_custom for the actual kernel selection?

I think that the custom kernel is stride aware:

vllm/csrc/rocm/attention.cu

Line 3244 in 617d1c2

int kv_block_stride = key_cache.stride(0);

So that logic is not needed there.

We only can go into that kernel is reshape_and_cache was used, not reshape_and_cache_flash
There is a condition for whether to select the kernel, or go with the triton fallback. It may need to be changed accordingly

@gshtras I updated the kernel selection to use the same native-layout as the cache update path. If the KV cache blocks are "strided" and the update path uses reshape_and_cache_flash, use_custom is now forced false so decode falls back to the Triton path.

tdoublep · 2026-04-16T16:37:33Z

+        # via the l_block_idx/internal_offsets addressing logic.
+        # TODO: Remove after upgrading from Triton 3.6 on ROCm
+        # See: https://github.com/triton-lang/triton/pull/9541
+        MAX_TRITON_BLOCK_SIZE = 128


is this a hard limit for all ROCm GPUs?

It's not architectural. The constraint is LDS pressure from the kernel's tile, and 128 is just where this kernel fits without per-arch tuning.

Different platforms do have different LDS size (e.g. it's different between MI300 and MI355), so we could actually query the current platform's LDS size to calculate the max block size here if we wanted to be more precise. 128 does seem like it works universally though.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

mergify · 2026-04-29T06:58:04Z

Hi @AndreasKaratzas, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

AndreasKaratzas · 2026-05-04T21:20:24Z

@tjtanaa the code changes from ssd has been reverted. Is this PR good to go?

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas · 2026-05-05T06:28:36Z

Changed is_contiguous_blocks = key_cache.stride(0) == key_cache[0].numel() because key_cache.shape[1:].numel() is pure metadata and avoids creating a key_cache[0] tensor view in a hot path

… path Signed-off-by: Andreas Karatzas <akaratza@amd.com>

mergify · 2026-05-10T06:37:20Z

Hi @AndreasKaratzas, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

tjtanaa

LGTM

tjtanaa · 2026-05-10T06:52:28Z

@AndreasKaratzas please fix precommit

AndreasKaratzas · 2026-05-10T06:54:39Z

@AndreasKaratzas please fix precommit

Yep, it's broken currently on main, will probably be fixed by: #42197

…y OOM (vllm-project#38502) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…y OOM (vllm-project#38502) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

…y OOM (vllm-project#38502) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Cap Triton paged attention block size to fix ROCm shared memory OOM

1fa54b7

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas added ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm labels Mar 30, 2026

github-project-automation Bot moved this to Todo in AMD Mar 30, 2026

github-project-automation Bot added this to AMD Mar 30, 2026

mergify Bot added the v1 label Mar 30, 2026

gemini-code-assist Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread vllm/v1/attention/ops/chunked_prefill_paged_decode.py

AndreasKaratzas added 7 commits March 29, 2026 23:36

Cap Triton paged attention block size to fix ROCm shared memory OOM

9d5b0a0

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'origin/main' into akaratza_chunked_prefill

ca6e2df

[ROCm] Fix ROCM_ATTN KV cache write for non-contiguous blocks in hybr…

3b44ad4

…id models Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'origin/main' into akaratza_chunked_prefill

70a327c

[ROCm][CI] Fix AMD Triton compiler crash in Mamba SSD chunk scan kernel

483debc

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'origin/main' into akaratza_chunked_prefill

f3e5e4e

Syncing with upstream states mamba version

3262441

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

mergify Bot added the ci/build label Apr 1, 2026

Merge remote-tracking branch 'origin/main' into akaratza_chunked_prefill

513ada7

AndreasKaratzas marked this pull request as ready for review April 1, 2026 22:10

AndreasKaratzas requested review from gshtras, tdoublep, tjtanaa and tomeras91 as code owners April 1, 2026 22:10

Merge remote-tracking branch 'origin/main' into akaratza_chunked_prefill

311039c

hmellor reviewed Apr 14, 2026

View reviewed changes

tjtanaa reviewed Apr 14, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into akaratza_chunked_prefill

bf8a6f5

gshtras reviewed Apr 16, 2026

View reviewed changes

tdoublep reviewed Apr 16, 2026

View reviewed changes

AndreasKaratzas added 3 commits April 28, 2026 13:32

Merge remote-tracking branch 'origin/main' into akaratza_chunked_prefill

b038408

Restored ssd

979ad99

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'origin/main' into akaratza_chunked_prefill

e7eb924

AndreasKaratzas added 2 commits April 29, 2026 03:40

Merge remote-tracking branch 'origin/main' into akaratza_chunked_prefill

e97ad4a

Merge remote-tracking branch 'origin/main' into akaratza_chunked_prefill

2cf7d11

AndreasKaratzas added 2 commits May 4, 2026 19:08

Merge remote-tracking branch 'origin/main' into akaratza_chunked_prefill

e0a7d20

Optimize contiguous block detection

49945d7

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas mentioned this pull request May 5, 2026

Fix LFM2 decoding on ROCm #41054

Open

AndreasKaratzas added 3 commits May 7, 2026 11:59

Merge remote-tracking branch 'origin/main' into akaratza_chunked_prefill

0fcf335

[ROCm] Updated kernel selection to same native-layout as cache update…

3befaed

… path Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'origin/main' into akaratza_chunked_prefill

4743215

AndreasKaratzas requested a review from Harry-Chen as a code owner May 10, 2026 06:32

tjtanaa approved these changes May 10, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into akaratza_chunked_prefill

6f9f1ea

tjtanaa enabled auto-merge (squash) May 10, 2026 08:32

tjtanaa merged commit 0a309b5 into vllm-project:main May 10, 2026
65 of 66 checks passed

github-project-automation Bot moved this from Todo to Done in AMD May 10, 2026

AndreasKaratzas deleted the akaratza_chunked_prefill branch May 10, 2026 20:48

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request May 11, 2026

[ROCm] Cap Triton paged attention block size to fix ROCm shared memor…

e611f9b

…y OOM (vllm-project#38502) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026

[ROCm] Cap Triton paged attention block size to fix ROCm shared memor…

c94e68a

…y OOM (vllm-project#38502) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026

[ROCm] Cap Triton paged attention block size to fix ROCm shared memor…

fd10e49

…y OOM (vllm-project#38502) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

[ROCm] Cap Triton paged attention block size to fix ROCm shared memor…

bad1adc

…y OOM (vllm-project#38502) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026

[ROCm] Cap Triton paged attention block size to fix ROCm shared memor…

5faee02

…y OOM (vllm-project#38502) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

		MAX_TRITON_BLOCK_SIZE = 128
		TRITON_BLOCK_SIZE = min(block_size, MAX_TRITON_BLOCK_SIZE) if is_pow2 else 32

Uh oh!

Conversation

AndreasKaratzas commented Mar 30, 2026

Test plan

Uh oh!

AndreasKaratzas commented Mar 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

AndreasKaratzas commented Apr 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndreasKaratzas Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 29, 2026

Uh oh!

AndreasKaratzas commented May 4, 2026

Uh oh!

AndreasKaratzas commented May 5, 2026

Uh oh!

mergify Bot commented May 10, 2026

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented May 10, 2026

Uh oh!

AndreasKaratzas commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

AndreasKaratzas Apr 14, 2026 •

edited

Loading