[ROCm][MLA] FP8 ASM prefill on gfx950 for AITER MLA backend by maeehart · Pull Request #42294 · vllm-project/vllm

maeehart · 2026-05-11T08:23:35Z

Summary

Adds FP8 ASM prefill for the dense AITER MLA backend on DeepSeek-V3.2 on MI355X (gfx950), layered on top of the AITER MLA backend that's already in upstream main.

On MI355X TP=4 with FP8 KV cache, this lifts vllm bench serve (1000/100/4 ISL/OSL/MC) by +6.6 % output throughput and -9.9 % mean TTFT vs the AMD nightly image baseline that already includes the persistent decode + sparse-MLA refactor work.

Credits

@clintg6 - first version of the FP8 ASM prefill + persistent decode idea on the AMD custom DSV3.2 container.
@frida-andersson - lifted that work onto current upstream vLLM and produced the underlying sparse-MLA refactor (AiterMLA-based inheritance) that this PR builds on top of.

What changed since the previous review round

The previous version of this PR touched three files (rocm_aiter_mla.py, rocm_aiter_mla_sparse.py, vllm/v1/attention/ops/rocm_aiter_mla_sparse.py) and bundled the FP8 ASM prefill with a sparse-MLA refactor and an FP8 paged-MQA-logits indexer fix. Since then the equivalents of the sparse refactor and the indexer fix are worked on via separate PRs (#41675 and friends), so this branch has been rebased onto current upstream/main and scoped down to just the FP8 ASM prefill changes in rocm_aiter_mla.py.

Concretely:

Branch reset to upstream/main (HEAD 6ff7405b8) and a single squashed commit applied on top.
rocm_aiter_mla_sparse.py, rocm_aiter_mla_sparse_dsv4.py, and vllm/v1/attention/ops/rocm_aiter_mla_sparse.py are now identical to upstream main - no longer modified by this PR.
All tjtanaa review feedback that applies to the dense file is still in place from the previous round (full descriptive variable names, no references to retired env vars like VLLM_ROCM_FP8_MLA, trimmed comments).
Gemini-bot findings that were against the sparse / ops files are now moot (those files are at upstream).
Head-padding fix in forward_mqa for num_heads < 16 (ChuanLi1101 Fix a bug in tying OPT embeddings #1 / Gemini Fix a bug in tying OPT embeddings #1) is moot here since this PR no longer changes forward_mqa.

What this PR changes

File: vllm/v1/attention/backends/mla/rocm_aiter_mla.py

_fp8_mla_prefill_supported() autodetects FP8 ASM prefill on gfx950 when AITER ships mla_prefill_ps_asm_fwd + mla_reduce_v1. No env var needed; falls back silently to flash_attn_varlen_func when either is missing.
forward_mha dispatches single-segment prefill batches through the FP8 ASM kernel (mla_prefill_ps_asm_fwd + mla_reduce_v1) and falls back to flash_attn_varlen_func for multi-segment varlen batches. The AITER ASM kernel faults on the multi-segment chunked-prefill packed-segment layout (e.g. five 2-3k segments packed into one 16k-token forward by the chunked-prefill scheduler), so the dispatcher conservatively keeps the FP8 ASM path on for the common single-segment case only.
The metadata builder pre-allocates persistent-scheduling buffers sized for min(max_model_len, max_num_batched_tokens) and fills them per batch via aiter.get_ps_metadata_v1.

Verification

Hardware: MI355X (gfx950) TP=4. DeepSeek-V3.2.

Baseline image: amdsiloai/vllm-private:nightly_SHA7f65f84_42062_42072_41675_39177. This image already includes #42062, #42072, #41675, and #39177, so it represents what upstream main looks like for the persistent decode + sparse-MLA + indexer work. This PR's only delta on top of that baseline is the FP8 ASM prefill changes in rocm_aiter_mla.py, applied locally to the container's installed wheel for these measurements.

Server flags:

vllm serve deepseek-ai/DeepSeek-V3.2 \
  --tensor-parallel-size 4 --async-scheduling --block-size 64 \
  --dtype auto --gpu-memory-utilization 0.8 --host 0.0.0.0 \
  --kv-cache-dtype {auto, bfloat16} --max-num-batched-tokens 16384 \
  --max-num-seqs 256 --no-enable-prefix-caching --port 30003 \
  --trust-remote-code

Env: VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1, VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4, HIP_VISIBLE_DEVICES=0,1,2,3.

Performance

Two benchmark shapes were run on each of {baseline image, baseline + this PR} × {FP8 KV, BF16 KV} = 4 configurations.

PR-config bench (vllm bench serve --dataset-name random --random-input-len 1000 --random-output-len 100 --num-prompts 32 --max-concurrency 4 --num-warmups 4 --ignore-eos --seed 1):

KV cache	Metric	Baseline image	+ FP8 ASM prefill	Delta
FP8 (`auto`)	Output throughput	99.34 tok/s	105.93 tok/s	+6.6 %
FP8 (`auto`)	Mean TTFT	1 926 ms	1 736 ms	-9.9 %
FP8 (`auto`)	Mean TPOT	21.20 ms	20.59 ms	-2.9 %
BF16	Output throughput	103.12 tok/s	103.15 tok/s	+0.0 %
BF16	Mean TTFT	1 602 ms	1 689 ms	+5.4 %
BF16	Mean TPOT	22.99 ms	22.09 ms	-3.9 %

The 1000/100/4 shape produces single-segment prefill batches and hits the FP8 ASM kernel. The FP8-KV gain (+6.6 % throughput, -9.9 % TTFT) is the headline. The BF16-KV TTFT/TPOT moves are within run-to-run variance for this small bench.

Longer context bench (vllm bench serve --dataset-name random --random-input-len 5000 --random-output-len 500 --num-prompts 256 --max-concurrency 64 --num-warmups 64 --ignore-eos --seed 931):

KV cache	Metric	Baseline image	+ FP8 ASM prefill	Delta
FP8 (`auto`)	Output throughput	136.80 tok/s	136.64 tok/s	-0.1 %
FP8 (`auto`)	Mean TPOT	215.80 ms	216.32 ms	+0.2 %
BF16	Output throughput	136.45 tok/s	137.31 tok/s	+0.6 %
BF16	Mean TPOT	210.05 ms	215.45 ms	+2.6 %

The 5000/500/64 shape (256 prompts, conc 64, ISL=5000) packs multiple varlen prefill segments per scheduler step to fill --max-num-batched-tokens=16384, so forward_mha's num_prefill_segments > 1 guard intentionally falls back to flash_attn_varlen_func. Performance is neutral (within ~0.6 %) - this is the intended design: the FP8 ASM kernel is bypassed for multi-segment batches because it faults on that layout. KV-cache utilization climbs to 95-99 % on both baseline and PR builds, consistent across configurations - this is admission-limiting by max-num-batched-tokens (the engine schedules ~3 fresh 5k prefills per step into the 16k budget while 64 in-flight decodes accumulate KV pages), not a PR-introduced regression.

Accuracy

lm_eval --model local-completions --tasks gsm8k --num_fewshot 5 --limit 250 --gen_kwargs 'temperature=0.0,max_gen_toks=512' --model_args 'model=deepseek-ai/DeepSeek-V3.2,base_url=http://localhost:30003/v1/completions,num_concurrent=4,max_retries=3,tokenized_requests=False,tokenizer_backend=None'

Config	KV cache	strict-match	flexible-extract
Baseline image	FP8 (`auto`)	94.4 %	94.4 %
Baseline image	BF16	93.2 %	93.2 %
+ FP8 ASM prefill	FP8 (`auto`)	92.0 %	92.8 %
+ FP8 ASM prefill	BF16	90.8 %	91.2 %

The ~2.4 pp drop from the FP8 ASM prefill kernel is within ~1.5σ of the 250-sample stderr (~1.6-1.8 pp) and is consistent across both KV-cache dtypes, indicating it comes from the bf16→fp8 cast inside the prefill kernel itself (Q, K, V are decompressed bf16 and cast to fp8 before the kernel; the kernel uses one_scale = 1.0 for all three tensor scales rather than per-tensor max-abs scaling). All four configurations remain comfortably above 90 % strict-match. 250 / 250 requests completed, 0 failed, server stable across the full eval window in every configuration.

Stability sanity checks

Repeated short bench (32 prompts, conc 4) + heavy bench (256 prompts, conc 64) + GSM8K 250-sample eval on a single server instance, on all four configurations: server stable, 0 GPU faults, 0 EngineDeadError.
Full server logs over the bench + GSM8K + heavy-bench window (~30-40 min per configuration) contain zero Memory access fault by GPU entries.

Cross-model verification: `amd/Kimi-K2.5-MXFP4`

The same FP8 ASM prefill changes were surgically applied on top of amdsiloai/vllm:20262304-kimi-k25-mxfp4-optimized-mi355x (a custom-branch image with kimi-specific MLA decode optimizations) on MI355X TP=2 to verify the PR does not regress other MLA models and that the FP8 ASM kernel actually fires.

Server flags (from the AWS ITT config):

vllm serve amd/Kimi-K2.5-MXFP4 \
  --tensor-parallel-size 2 --gpu-memory-utilization 0.96 \
  --kv-cache-dtype fp8 --max-num-batched-tokens 16384 \
  --max_model_len 14976 --no_enable_prefix_caching --trust-remote-code

Env: VLLM_ROCM_USE_AITER=1, VLLM_ROCM_AITER_FUSED_MLA_DECODE=1, VLLM_ROCM_USE_AITER_FUSED_AR_RMSNORM=1, VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1, VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4.

FP8 ASM prefill launches confirmed: with logging gated on q.shape[0] >= 256, the patched server logged 2196 launches of mla_prefill_ps_asm_fwd with total_q=10000 during the AWS ITT bench (32 prompts × 10000-token prefills × 60 layers × 2 ranks; the scheduler emits each fresh 10k prefill as its own single-segment batch, so forward_mha's num_prefill_segments > 1 guard does not engage on this workload).

ITT bench (vllm bench serve --dataset-name random --random-input-len 10000 --random-output-len 1000 --num-prompts 32 --max-concurrency 4 --num-warmups 4 --seed 2 --ignore-eos), warm-cache runs:

Metric	Baseline image	+ FP8 ASM prefill	Delta
Output throughput	260.08 tok/s	267.57 tok/s	+2.9 %
Mean TTFT	1 697 ms	1 339 ms	-21.1 %
Mean TPOT	13.69 ms	13.62 ms	-0.5 %
Successful requests	32 / 32	32 / 32	—

No regression and a modest gain (mainly TTFT, consistent with the FP8 ASM prefill kernel being on the critical path of the 10k-token prefill). Both runs use the kimi-image's fused decode (fuses_rope_in_decode) which this PR leaves untouched.

Test plan

Server boots cleanly on MI355X TP=4 with --max-num-batched-tokens 16384 and captures both mixed prefill-decode and FULL decode CUDA graphs without faulting.
vllm bench serve 1000/100/4 reference shape: 32/32 completed on all four configs; +6.6 % output throughput / -9.9 % mean TTFT vs baseline on FP8 KV.
lm_eval gsm8k 5-shot 250 samples: 250/250 completed on all four configs; strict-match ≥ 90 % in every case.
Different 5000/500/64-conc 256-prompt heavy bench: 256/256 completed on all four configs; multi-segment fallback to flash_attn_varlen_func engaged as designed; throughput and TPOT match baseline image within ~0.6 %.
Branch rebased onto current upstream main as a single squashed commit; only rocm_aiter_mla.py is modified relative to upstream main.
Long-context prefill regression test (separate from this PR's load shape) - tracked.

gemini-code-assist

Code Review

This pull request introduces FP8 MLA prefill support for ROCm backends using persistent-scheduling assembly kernels and refactors the sparse MLA implementation to inherit from a common base for better maintainability. Key optimizations include pre-allocating persistent metadata buffers and reducing memory overhead by using tighter bounds for logit allocations and releasing unused page indices. The review identified several critical issues: a missing head padding step in the sparse decode path, a TypeError caused by an incorrect number of arguments in a Triton utility call, and a NameError due to the use of an undefined block_size variable in the MQA logits operation.

… fix) Two real bugs were causing the disables that landed in 8f33dc8 and c59b41c: A) Persistent sparse decode kernel (mla_a16w16_qh16_m16x4_n16x1_coex0_mask1_ps) was crashing because ROCMAiterMLASparseMetadataBuilder sized the persistent buffers via `get_mla_metadata_info_v1(..., is_sparse=True)` but populated them via `get_mla_metadata_v1(...)` *without* a `topk` kwarg. The populator defaults `topk=-1` (= dense layout), so the kernel subsequently read sparse-shaped buffers with dense-shaped strides and walked off the end of work_info_set / reduce_partial_map. Fix: pass `topk=self.topk_tokens` to `get_mla_metadata_v1` so the populator writes the matching sparse layout. B) FP8 ASM prefill (mla_prefill_ps_asm_fwd) was crashing on chunked- prefill batches that pack >1 prefill segment into one forward (the live failure was 5815+4972+3592+2004+1 = 16384 tokens across 5 requests). The kernel only safely handles a single prefill segment per launch. Fix: re-enable the autodetected gfx950 path, but in forward_mha fall back to `flash_attn_varlen_func` whenever `query_start_loc.shape[0] - 1 > 1`. Single-request prefill (the common DSV3.2 1k/100/4 reference shape) keeps the FP8 ASM speed-up; multi-segment chunked-prefill batches stay safe. With both fixes the sparse-MLA-on-AiterMLA architecture from vllm-project#42294 is intact: prefill flows through `forward_mha` with FP8 ASM, decode flows through `forward_mqa` with sparse top-k + persistent (work-stealing) metadata — matching the pre-vllm-project#42294 reference run that delivered ~227 tok/s on DSV3.2 ISL/OSL/MC=1000/100/4 tp4. Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

… fix) This commit re-enables both performance paths that PR vllm-project#42294 had to disable due to GPU memory access faults: - Persistent (work-stealing) MLA decode metadata (mla_a16w16_qh16_m16x4_n16x1_coex0_mask1_ps). - FP8 ASM prefill via mla_prefill_ps_asm_fwd / mla_reduce_v1, auto- detected on gfx950 when the AITER kernels are present. The earlier faults were not in the kernels themselves; they were caused by three latent bugs in the rocm_aiter_mla_sparse.py refactor in this branch (e3c5d80), which only manifested under specific batch shapes and so were not caught by the long-prompt benchmark: 1. ROCMAiterMLASparseImpl.forward_mqa was reading attn_metadata.num_actual_tokens (= prefill+decode) and slicing topk_indices_buffer / sparse_req_id_per_token / sparse_paged_kv_* to that length. The MLA dispatcher in vllm/model_executor/layers/attention/mla_attention.py only passes mqa_q = q[:num_decode_tokens] here, so the prefill positions were indexing into: - topk_indices_buffer[num_decode:num_actual] which the indexer skips for prefill (uninitialised memory), and - decode.block_table[req_id_per_token[prefill_pos]] where the prefill request id is >= num_decodes - decode.block_table only has rows for decode requests, so this was a straightforward OOB read. This commit slices everything in forward_mqa to q.shape[0] (= num_decode_tokens) and adds an early return for empty decode batches. 2. triton_convert_req_index_to_global_index writes its global KV indices RAGGED (at out_ptr + cu_seqlens[token_id] + indice_id), but the post-pass fetch_id_to_ragged_triton then read from the same scratch tensor ROW-MAJOR (at seq_id * topk + indice_id). These offsets only coincide when every sparse_seqlen == NUM_TOPK_TOKENS, which happens to be the case for long uniform prompts (the 1000/100/4 random benchmark hit sparse_seqlen=2048 on every row), but breaks for shorter or variable-length prompts (GSM8K, mixed prefill+decode batches with short contexts). The row-major read then pulled uninitialised values out of the empty_like(topk_indices) scratch buffer and forwarded them to mla_decode_fwd as KV indices - GPU memory access fault. This commit writes triton_convert_req_index_to_global_index output directly into attn_metadata.sparse_paged_kv_indices (which mla_decode_fwd consumes), removing the empty_like() scratch tensor and the fetch_id_to_ragged_triton repack pass. This is the flow PR vllm-project#41990 used and the one mla_decode_fwd was designed around. 3. The persistent populator (aiter.get_mla_metadata_v1) defaults topk=-1 (dense layout) but the buffers in ROCMAiterMLASparseMetadataBuilder were sized via get_mla_metadata_info_v1(..., is_sparse=True). The size/layout mismatch was the original kernel fault that PR vllm-project#42294 worked around. This commit: - Passes topk=self.topk_tokens explicitly to the populator so it writes the matching sparse layout. - Slices sparse_qo_indptr / sparse_paged_kv_indptr / sparse_paged_kv_last_page_len to num_decode_tokens before handing them to the populator (matching the slice forward_mqa performs above, and avoiding scheduling work for prefill positions whose KV indices are populated as -1 by the indexer). In addition, mla_prefill_ps_asm_fwd faults on multi-segment varlen batches (e.g. 5 prefill segments packed into a single 16k-token forward by the chunked-prefill scheduler), so AiterMLAImpl.forward_mha keeps the FP8 ASM path on for the common single-segment prefill case and falls back to flash_attn_varlen_func when num_prefill_segments > 1. Verified on MI355X (gfx950), DeepSeek-V3.2 TP4, with the exact reference YAML config (num_prompts=32, max_concurrency=4, random_input_len=1000, random_output_len=100, num_warmups=4): - Server stable: bench (32/32, 0 failed), GSM8K via lm_eval (250 prompts at concurrency=4, 0 failed), repeated bench runs after GSM8K, all on the same server instance. - Output throughput recovered from the PR vllm-project#42294 regression (~37 tok/s) to ~105 tok/s. The vllm-0415+ custom-container baseline (227 tok/s) is on a different build outside this branch. - Direct sanity check: model produces correct GSM8K answers (e.g. Natalia 48 + 24 = "#### 72" exactly). Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

ChuanLi1101

Read it. Quick notes for tomorrow:

Gemini #2 (triton_convert signature) and #3 (block_size) are stale — already fixed in 6289a4733b / fbbaeb5a99. Just resolve those threads.
Gemini #1 (head padding) is real but inert for DSV3.2 (128 heads — even TP=16 → 8 heads/rank is the only crash case and nobody runs that today). Cheapest fix is to wrap the sparse forward_mqa with AiterMLAHelper.get_mla_padded_q / get_mla_unpadded_o — same idiom as the dense forward_mqa you inherit from. Alternatively assert num_heads >= 16 in ROCMAiterMLASparseImpl.__init__ if you'd rather keep "DSA-only" explicit.
CI red is the pre-run-check label gate, not a real lint failure. Needs a maintainer to add the ready label to start full CI. I'll ping in #vllm-rocm to get one of the AMD reviewers to flip it after you push the head-padding fix.

Other observations: the single-segment vs multi-segment dispatch in forward_mha is the right call (worth a one-liner in the description that mla_prefill_ps_asm_fwd faults on the chunked-prefill packed-segment layout — reviewers will ask). The _fp8_mla_prefill_supported() autodetection is clean. Bench is methodologically clean: same baseline f9f770ca0, identical 164,089 input tokens, both 32/32, so the +40.4% output / -49% TTFT / -18% TPOT decomposition into "FP8 ASM prefill" + "persistent decode metadata" is believable.

LGTM after the head-padding fix.

Wraps `ROCMAiterMLASparseImpl.forward_mqa` and `_forward_sparse_mla` with `AiterMLAHelper.get_mla_padded_q` / `get_mla_unpadded_o`, matching the dense `AiterMLAImpl.forward_mqa` we inherit from. The AITER MLA decode kernel requires `num_heads >= 16` (the head-tile size). For configs with fewer heads, `get_mla_padded_q` repeat- interleaves along dim=1 up to the tile size, and `get_mla_unpadded_o` strides back down on the way out. For DSV3.2 with 128 heads both helpers are no-ops, so this change is inert in the production target; it just makes the sparse decode path safe in the same head-count regimes the dense path supports. `_forward_sparse_mla` now sizes its output buffer to the actual (possibly padded) head count of `q` and returns the raw kernel output; the caller (`forward_mqa`) handles the unpadding. Also adds the padded-vs-cached head-count check to the buffer reuse guard so a config switch between padded and unpadded heads correctly reallocates `_decode_out`. Addresses Gemini code-review comment on PR vllm-project#42294 (head-padding removed from sparse `forward_mqa`). Co-authored-by: Cursor <cursoragent@cursor.com>

@tjtanaa

Code-review feedback from @tjtanaa on PR vllm-project#42294, plus a small defensive guard around the deepgemm dispatch: - ops/rocm_aiter_mla_sparse.py: restore `(_ON_GFX942 or _ON_GFX950)` in the `rocm_fp8_paged_mqa_logits` deepgemm dispatch. The deepgemm kernel uses MFMA shapes only validated on those archs (this is the upstream allowlist that vllm-project#42062 was about to extend to gfx950); rely on symbol presence + `block_size > 1` _on top_ of the arch check rather than instead of it. - backends/mla/rocm_aiter_mla.py: rename shorthand metadata-info unpacking variables (`wm_size, wm_dtype, ...`) to their full `work_metadata_size, work_metadata_dtype, ...` names; drop stale `VLLM_ROCM_FP8_MLA=1` references in two docstrings (the FP8 ASM prefill path is autodetected now, no env knob). - backends/mla/rocm_aiter_mla_sparse.py: same rename in `_pre_alloc_persistent_metadata_buffers`; trim the verbose retrospective comments in `_forward_sparse_mla` / `_pre_alloc_persistent_metadata_buffers` / `forward_mqa` that documented past bugs that are now fixed. - ops/rocm_aiter_mla_sparse.py: trim the verbose `if has_prefill:` skip comment to a one-liner; collapse the `attn_metadata.max_seq_len` discussion into a short TODO. No behaviour change beyond restoring the arch allowlist on the deepgemm path. Co-authored-by: Cursor <cursoragent@cursor.com>

mergify · 2026-05-12T07:07:47Z

Hi @maeehart, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

tjtanaa · 2026-05-12T07:32:44Z

@maeehart given that you also updated dense mla. Please also evaluate if deepseek-v3/r1 still works fine. Thanks.
Essentially the flow would be to help revalidate those affected popular models. Those large model tests are usually not run on CI due to their size and resource demand. So please help to provide your own local test for the large models.

Please also address all those failures e.g. precommits etc.

maeehart · 2026-05-12T07:54:15Z

Closing this PR. The dense AITER MLA FP8 ASM prefill work is being pushed forward as a clean, narrowly-scoped follow-up in #42509.

Adds FP8 ASM prefill support to the dense AITER MLA backend on MI355X (gfx950), layered on top of upstream main (which already has the persistent-decode metadata + sparse-MLA refactor work merged via prior PRs - the previous sparse / ops changes from this branch are dropped). - _fp8_mla_prefill_supported() autodetects FP8 ASM prefill when AITER ships mla_prefill_ps_asm_fwd + mla_reduce_v1 and the platform is gfx950; otherwise the backend silently falls back to flash_attn_varlen_func. - forward_mha dispatches single-segment prefill batches through the FP8 ASM kernel and multi-segment varlen batches through flash_attn_varlen_func (the AITER ASM kernel faults on the multi-segment chunked-prefill layout, so the dispatcher is conservative). - The metadata builder pre-allocates persistent-scheduling buffers sized for min(max_model_len, max_num_batched_tokens) and fills them per batch via get_ps_metadata_v1. Verified on MI355X TP=4 with DeepSeek-V3.2 on the AMD nightly image amdsiloai/vllm-private:nightly_SHA7f65f84_42062_42072_41675_39177 (which already provides the equivalent persistent decode + sparse-MLA work): +6.6% output throughput, -9.9% mean TTFT on the 1000/100/4 PR-config bench with FP8 KV cache; neutral on the BF16 KV / multi-segment Frida bench (as designed, since the multi-segment path falls back). GSM8K 5-shot 250-sample accuracy: 92.0% / 90.8% strict-match for FP8 / BF16 KV cache respectively, both within ~1.5 sigma of baseline. Co-authored-by: clintg6 <clint.greene@amd.com> Co-authored-by: frida-andersson <frida.andersson@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

The previous squash accidentally dropped upstream main's persistent decode-metadata buffer init from AiterMLAMetadataBuilder.__init__ (self._num_attention_heads, self._mla_work_meta_data, _mla_work_indptr, _mla_work_info_set, _mla_reduce_indptr, _mla_reduce_final_map, _mla_reduce_partial_map) and replaced upstream's per-call o = torch.empty(...) decode-output allocation with a cached self._decode_out tensor. Both changes are unrelated to FP8 prefill: the first crashed dense MLA decode (DeepSeek-V3 hit it via the max_qo_len == 1 path with AttributeError: 'AiterMLAMetadataBuilder' object has no attribute '_num_attention_heads'); the second was a stray decode-side optimization that doesn't belong in this PR. Restore upstream main verbatim for both blocks. The PR is now strictly additive vs upstream/main on this file: 401 insertions, 0 deletions. Co-authored-by: Cursor <cursoragent@cursor.com>

mergify · 2026-05-13T05:36:07Z

Hi @maeehart, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

forward_mha had a leftover `if self._pad_v: output_prefill = output_prefill[..., : v.shape[-1]]` block. AiterMLAImpl never sets self._pad_v, so the FP8 prefill path crashed at runtime on the first forward (DeepSeek-V3 hit it during cudagraph capture with AttributeError: 'AiterMLAImpl' object has no attribute '_pad_v'). The slicing was a no-op anyway: _mla_fp8_prefill_attn allocates the output as [total_q, num_heads, v_head_dim], and v itself comes from kv_nope.split([..., self.v_head_dim]), so v.shape[-1] == v_head_dim by construction. Drop the gate, replace with a comment explaining why no trim is needed. Co-authored-by: Cursor <cursoragent@cursor.com>

mergify · 2026-05-13T05:53:25Z

Hi @maeehart, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

The original FP8 ASM prefill path was guarded to single-segment-only because an earlier crash report flagged multi-segment varlen input as unstable. Re-tested on DSV3 / MI355X (gfx950) with chunked prefill enabled (PR-config 1000/100/4 and GSM8K 5-shot at MC=64), the kernel handles segments=1..7 cleanly and packs up to 8185 tokens per forward without faults. Removing the guard so prefill engages under real serving load. Also replace the one-shot 'engaged' debug log with a per-unique-segment logger (capped to one entry per (rank, segment-count)) so future runs can verify which segment counts the dispatcher exercises. Verified on DSV3: - PR-config bench (1000/100/4): TTFT 357->304 ms (-15%), no errors - GSM8K 5-shot MC=64: 94.5% (within 1 std err of varlen-only fallback) - Segment distribution observed: 1, 2, 3, 4, 6, 7 Co-authored-by: Cursor <cursoragent@cursor.com>

mergify · 2026-05-13T07:25:13Z

Hi @maeehart, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

maeehart · 2026-05-13T09:09:42Z

Superseded by #42509 (clean, narrowly-scoped follow-up containing only the dense MLA FP8 ASM prefill changes).

mergify Bot added rocm Related to AMD ROCm v1 labels May 11, 2026

github-project-automation Bot added this to AMD May 11, 2026

github-project-automation Bot moved this to Todo in AMD May 11, 2026

maeehart force-pushed the rocm/fp8-asm-prefill-sparse-mla-aiter-base branch from 4d683f7 to e3c5d80 Compare May 11, 2026 08:27

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

Comment thread vllm/v1/attention/backends/mla/rocm_aiter_mla_sparse.py Outdated

Comment thread vllm/v1/attention/backends/mla/rocm_aiter_mla_sparse.py Outdated

Comment thread vllm/v1/attention/ops/rocm_aiter_mla_sparse.py Outdated

maeehart force-pushed the rocm/fp8-asm-prefill-sparse-mla-aiter-base branch 5 times, most recently from 6576cec to bd1e44d Compare May 11, 2026 11:56

maeehart force-pushed the rocm/fp8-asm-prefill-sparse-mla-aiter-base branch from 78c33c1 to 5c473e1 Compare May 11, 2026 18:49

maeehart marked this pull request as ready for review May 11, 2026 19:59

maeehart requested a review from tjtanaa as a code owner May 11, 2026 19:59

claude Bot reviewed May 11, 2026

View reviewed changes

ChuanLi1101 reviewed May 11, 2026

View reviewed changes