Skip to content

Port DeepSeek-V4-Flash serving to MI300X#11

Closed
fergusfinn wants to merge 6 commits into
mainfrom
port/dsv4-mi300x-bringup
Closed

Port DeepSeek-V4-Flash serving to MI300X#11
fergusfinn wants to merge 6 commits into
mainfrom
port/dsv4-mi300x-bringup

Conversation

@fergusfinn
Copy link
Copy Markdown

@fergusfinn fergusfinn commented May 26, 2026

Summary

Adds the ROCm support needed to serve DeepSeek-V4-Flash on AMD MI300X.

The main changes cover MI300X FP8 handling, DeepSeek-V4 cache layout, ROCm sparse attention fallbacks, MXFP4 expert routing, and HIP-graph-safe decode metadata.

What Changed

  • Handles MI300X's FNUZ FP8 dialect in DeepSeek-V4 cache and compressor paths.
  • Adds ROCm fallbacks for paged MQA and sparse MLA attention paths that are not covered by tuned kernels on MI300X.
  • Fixes SWA K-cache layout and quantized cache insertion on ROCm.
  • Selects the correct expert map for the MXFP4 non-AITER backend under expert parallelism.
  • Masks padded MXFP4 bitmatrix lanes by logical block size.
  • Makes sparse MLA decode metadata safe to use under HIP graph replay.

Test Plan

  • Existing kernel coverage for DeepSeek-V4 ROCm FP8 cache/compressor paths.
  • End-to-end DeepSeek-V4-Flash serving on MI300X under HIP graph replay.
  • Expert-parallel routing correctness on the non-AITER MXFP4 backend.
  • Basic generated-output comparison against the reference bring-up path.

Bring up DeepSeek V4 on MI300X (gfx942) by routing attention through
ROCm-safe paths where the AITER / CUDA fast paths are missing or aren't
safe under HIP-graph capture.

* Provide ROCm fallbacks for paged MQA and sparse MLA prefill/decode.
* Make the paged MQA fallback HIP-graph capture safe by avoiding
  capture-time host->device scalar writes and dynamic allocations.
* Avoid the AITER prefill MQA logits path on gfx942; gate the AITER
  sparse prefill logits path behind a guard.
* Guard the ROCm sparse top-k fast paths so they only run when the
  caller's shape is supported.
* Shrink and bound the sparse-prefill workspace and logits fallback.
* Add correctness coverage for the cache-layout and triton-attn paths.
* Register the DSV4 custom ops with platform guards so import is safe
  on non-ROCm builds.

Squashed from 478a228, 4c3092f, 395d500, fd5672e, d514dae,
6cd3c61, bba75c7, 11acddb, 4d72865, baa13eb.
The DSV4 sliding-window attention K-cache write path does not have an
AITER fast path on ROCm; route it through a ROCm-specific fused
quantise-and-insert helper instead.

Adds `fused_qnorm_rope_quant_insert_k_cache` to the deepseek_v4_ops
cache utilities and uses it from the attention layer on ROCm, with the
CUDA path unchanged.

Squashed from 45ac601.
MI300X uses the `fnuz` FP8 dialect, while later AMD parts and CUDA-
oriented code assume the non-`fnuz` variant. The DSV4 compressor and
KV cache write path were not explicitly ROCm-aware, so cache writes,
scales, dequantisation and fallbacks could disagree on the value
format while all looking locally reasonable.

Make the compressor + fused compress/quant/cache path use
`current_platform.fp8_dtype()` so the format is consistent on
MI300X, and remove the `VLLM_ROCM_DSV4_OVERWRITE_SWA_CACHE_E4NV`
correctness workaround that the previous SWA K-cache PR introduced
as a temporary bridge.

Squashed from 4537832.
… the non-AITER backend

vLLM's non-AITER MXFP4 backend was reading an AITER-style expert mask
purely because ROCm AITER support was globally enabled. The mask is
correct for the AITER backend; for the non-AITER one it mis-routes
tokens under expert parallelism, so each individual matmul stays
locally correct while the model produces garbage.

Two changes:
* FusedMoE.expert_map: return the canonical _expert_map when the active
  MXFP4 backend is not AITER_MXFP4_BF16, regardless of whether ROCm
  AITER is globally enabled.
* convert_weight_to_mxfp4_moe_kernel_format: accept Mxfp4MoeBackend.EMULATION
  so the non-AITER backend can be selected on ROCm.

Split out of an earlier 'routing + direct W2 reduce' commit so the
correctness fix lands separately from the performance optimisation.
The MXFP4 MoE bitmatrix kernel pads its block columns to a convenient
Triton block size, but the padded lanes still need to be masked
against the *logical* block size, not just the global tensor bound.
Under high concurrency the padded lanes can carry stale bits, and
the resulting bitmatrix mis-routes tokens; observed as engine
corruption at serving-scale.

One-line fix: change the mask predicate to use the logical block size.

Not ROCm-specific; affects any deployment hitting the padded path.

Squashed from 1fd5f96.
The ROCm MLA sparse decode metadata path previously rebuilt ragged
allocations and issued host->device scalar writes at decode time.
These are not HIP-graph-safe: under capture, the writes are recorded
once and replayed verbatim, leading to silently wrong metadata for
subsequent decode steps.

Move the metadata into a static, capture-friendly layout:

* Pre-allocate the ragged buffers at warmup time.
* Replace per-step host->device writes with pre-populated tensors
  consumed by indexing.
* Keep the non-graph CUDA path on its existing dynamic codepath.

Allows the high-throughput DPA/EP serving shape to run with HIP
graphs enabled on MI300X.

Squashed from 590e25f.
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

Copy link
Copy Markdown

@doubleword-code doubleword-code Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR (DSV4 on MI300X bring-up) introduces ROCm-specific sparse attention kernels, FP8 FNUZ quantization support, and MXFP4 MoE fixes for AMD MI300X GPUs. The changes span 18 files with ~2137 lines modified, adding:

  1. ROCm sparse MLA attention (rocm.py, rocm_aiter_mla_sparse.py): Triton kernels for prefill/decode with HIP-graph-safe ragged index building
  2. FP8 FNUZ conversion (w8a8_utils.py, cache_utils.py): Correct scale adjustment for E4M3FNUZ dialect (exponent bias 7→8)
  3. MXFP4 MoE bitmatrix fix (gpt_oss_triton_kernels_moe.py): Padded lane masking by logical block size
  4. Expert map property fix (layer.py): Non-AITER MXFP4 backends return _expert_map instead of AITER-style expert_mask

The code is generally well-structured with appropriate platform guards. However, I found one Blocking correctness issue in the expert_map property logic and several Non-blocking concerns around edge-case handling.

Verdict: Needs changes — address the Blocking finding before merging.


Research notes

  • FP8 formats: ONNX spec confirms E4M3FNUZ uses exponent bias 7 vs E4M3FN bias 8. The +1 uint8 clamped adjustment in w8a8_utils.py:128-133 is correct per ONNX float8 documentation.
  • Triton FP8 types: tl.float8e4b8 (FNUZ) and tl.float8e4nv (FN) are the correct type names for ROCm Triton; verified against kernel usage in rocm_aiter_mla_sparse.py:1350-1352.
  • HIP graph capture: The build_ragged_indices_from_dense_out function correctly avoids .item() host sync by using persistent buffers with pre-initialized indptr_out[0]=0 (line 1098-1099 in rocm_aiter_mla_sparse.py).

Suggested next steps

  1. [Blocking] Fix expert_map property in layer.py:1319-1326 — the condition logic is inverted for non-AITER MXFP4 backends
  2. [Non-blocking] Add boundary check in normalize_e4m3fn_to_e4m3fnuz for uint8=255 overflow case (already clamped, but document the behavior)
  3. [Non-blocking] Verify _env_flag import in rocm.py (uses os directly but _env_flag is defined locally — works, but consider centralizing)

General findings

Correctness

Expert map property regression risk (layer.py:1319-1326):
The new condition returns _expert_map for all non-AITER MXFP4 backends, but the original logic was expert_mask if rocm_aiter_fmoe_enabled else _expert_map. This changes behavior for EMULATION backend — verify this is intended.

Performance

Dynamic workspace allocation (rocm.py:798-812): The DSV4_DYNAMIC_PREFILL_KV_WORKSPACE path computes max_N via host-side .item() calls in a loop. This breaks graph capture if enabled accidentally. Consider asserting this flag is off during captured execution.

Testing

Test coverage looks solid (test_compressor_kv_cache.py uses current_platform.fp8_dtype()), but add a test for the expert_map property across all MXFP4 backend variants (AITER, EMULATION, future backends).

General findings (auto-demoted from inline due to pre-validation)

  • Non-blocking vllm/model_executor/layers/quantization/utils/w8a8_utils.py:130 — Scale overflow edge case handled correctly, but worth documenting.
    • (demoted: code self-check failed at vllm/model_executor/layers/quantization/utils/w8a8_utils.py:130: diff has weight_scale = (, model claimed torch.clamp(weight_scale_as_uint8.to(torch.int16) + 1, max=255))
  • Non-blocking vllm/models/deepseek_v4/amd/rocm.py:808 — Host-side .item() in dynamic workspace path breaks HIP graph capture if enabled.
    • (demoted: code self-check failed at vllm/models/deepseek_v4/amd/rocm.py:808: diff has chunk_gather = int(, model claimed chunk_N = int()
  • Nit vllm/v1/attention/ops/rocm_aiter_mla_sparse.py:1350 — Consistent FP8 type naming — consider adding comment for tl.float8e4b8 vs tl.float8e4nv.
    • (demoted: code self-check failed at vllm/v1/attention/ops/rocm_aiter_mla_sparse.py:1350: diff has encoded_scales = tl.load(, model claimed if IS_FNUZ:)

mxfp4_backend is not None
and getattr(mxfp4_backend, "name", "") != "AITER_MXFP4_BF16"
):
return self._expert_map
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: Expert map property logic appears inverted for non-AITER MXFP4 backends.

Why it matters: The condition at line 1321-1324 returns _expert_map when mxfp4_backend.name != "AITER_MXFP4_BF16", but the original intent (before this PR) was to return expert_mask only when rocm_aiter_fmoe_enabled. For the EMULATION backend (mxfp4.name == "EMULATION"), this now returns _expert_map instead of going through the rocm_aiter_fmoe_enabled check at line 1326. This could cause incorrect expert routing if EMULATION backend expects expert_mask semantics.

Trace: FusedMoE.expert_map is consumed by triton_kernel_moe_forward:375-390 which applies expert_map[topk_ids] remapping. If the wrong tensor is returned, tokens route to wrong physical experts.

Suggested fix: Clarify the intended behavior. If EMULATION should use _expert_map, add explicit comment. Otherwise, restore original logic:

if mxfp4_backend is not None and mxfp4_backend.name == "AITER_MXFP4_BF16":
    return self.expert_mask
return self._expert_map

Or keep current structure but verify EMULATION backend compatibility.

@fergusfinn fergusfinn changed the title DSV4 on MI300X: bring-up (port, 6 commits) Port DeepSeek-V4-Flash serving to MI300X Jun 1, 2026
@fergusfinn fergusfinn closed this Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant