Skip to content

Merge upstream and validate#4

Merged
lcskrishna merged 89 commits into
deepseekv4-rocmfrom
dsv4-rocm-update-0512
May 12, 2026
Merged

Merge upstream and validate#4
lcskrishna merged 89 commits into
deepseekv4-rocmfrom
dsv4-rocm-update-0512

Conversation

@lcskrishna

@lcskrishna lcskrishna commented May 12, 2026

Copy link
Copy Markdown
Owner

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

dzhengAP and others added 30 commits May 9, 2026 11:37
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
… dist-info registration for XPU/ROCm (vllm-project#42040)

Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
Signed-off-by: David Zheng <153074367+dzhengAP@users.noreply.github.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Andreas Karatzas <akaratza@amd.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
…parameter (vllm-project#42061)

Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Signed-off-by: Terrencezzj <terrence@cohere.ai>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ol parser (vllm-project#42026)

Signed-off-by: Rishapveer Singh <singhrishapveer@gmail.com>
Signed-off-by: Richard Barnes <rbarnes@meta.com>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
… operand layout with WGMMA (vllm-project#42076)

Signed-off-by: kermit <ckeming@outlook.com>
…ng CUDA graph capture failure (vllm-project#42070)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: ZhanqiuHu <zhu@redhat.com>
…r issue (vllm-project#40708)

Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
…ivations) support (vllm-project#41769)

Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
…atures without `KVCacheConfig` (vllm-project#39832)

The v0.12.0 release contained initial support for HMA in KV Connectors. As part
of these changes, a KVCacheConfig argument was added to KV connector
constructors. Backwards compatibility support for out-of-tree connectors was
included in this change, with a very prominent warning. See vllm-project#25712 and vllm-project#27887.

Since the warning has been around for over 5 months, we can safely remove
the support of it.

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
…llm-project#41846)

Signed-off-by: Nave Assaf <nassaf@nvidia.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
vllm-project#33322)

Signed-off-by: Xingran Wang <wangxingran123456@outlook.com>
Signed-off-by: Hongjian Zhang <hirokenovo@gmail.com>
Co-authored-by: Hongjian Zhang <hirokenovo@gmail.com>
…0951)

Signed-off-by: Christian Van <cvan20191@gmail.com>
Co-authored-by: Christian Van <cvan20191@gmail.com>
…ject#39306)

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <etelis2019@gmail.com>
Signed-off-by: Itay Etelis <92247226+Etelis@users.noreply.github.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Itay Etelis <etelis2019@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
)

Signed-off-by: AbhiOnGithub <abhiOnGithub@users.noreply.github.com>
Co-authored-by: AbhiOnGithub <abhiOnGithub@users.noreply.github.com>
chaunceyjiang and others added 29 commits May 11, 2026 11:59
…2272)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…usable-buffer loaders like runai distributed streaming (vllm-project#42244)

Signed-off-by: Noa Neria <nneria@nvidia.com>
…ect#41928)

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
…t#40392)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
…dingManager` (vllm-project#41727)

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…modelopt_quant_algo` (vllm-project#42181)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
…42081)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
…mprovement (vllm-project#40408)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Signed-off-by: Vinay R Damodaran <vrdn@hey.com>
Co-authored-by: Russell Bryant <russell.bryant@gmail.com>
…lm-project#41987)

Signed-off-by: alexagriffith <agriffith96@gmail.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…llm-project#42201)

Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Vensen <vensenmu@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ject#41825)

Signed-off-by: Frida Andersson <fanderss@amd.com>
Signed-off-by: Chuan Li <chuali@amd.com>
Co-authored-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Chuan Li <chuali@amd.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Frida Andersson <frida-andersson@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
…ect#41761)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
…uired tool_choice (vllm-project#42292)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
…42217)

Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Resolved two conflicts in vllm/model_executor/layers/deepseek_v4_attention.py:

  * Decode path: dropped the
    ``VLLM_ROCM_USE_V4_TRITON_FALLBACK``-gated ``rocm_forward_decode_fallback``
    branch — upstream unified the call to ``flash_mla_with_kvcache`` for
    both CUDA and ROCm. The ROCm path is already routed to our
    ``flash_mla_with_kvcache_rocm`` Triton kernel via
    ``vllm.v1.attention.ops.flashmla`` (which already accepts the new
    ``is_fp8_kvcache``/``extra_k_cache``/``extra_indices_in_kvcache``
    kwargs).

  * Prefill path: dropped the env-gated branch around
    ``flash_mla_sparse_fwd`` and adopted upstream's signature (no longer
    returns a 3-tuple). Our ``flash_mla_sparse_fwd_rocm`` writes via
    ``out=`` so the return value is harmless to ignore.

Post-merge cleanup:

  * vllm/platforms/rocm.py: removed our duplicate "deepseek_v4_fp8"
    entry — upstream now adds it as the first member of
    ``supported_quantization``.

  * vllm/envs.py: trimmed the ``VLLM_ROCM_USE_V4_TRITON_FALLBACK``
    docstring from four call sites down to two (SWA K-cache writer and
    sparse indexer). The MLA decode / sparse-prefill paths are now
    permanently routed through the ROCm Triton fallbacks via flashmla.py
    on ROCm — no env-var toggle needed there any more.

Kept (still required after the merge):

  * vllm/model_executor/layers/sparse_attn_indexer.py — dispatch to
    ``rocm_sparse_attn_indexer_no_insert`` when
    skip_k_cache_insert + AITER disabled + env-var on.
  * vllm/v1/attention/ops/rocm_sparse_attn_indexer.py (recovered
    pre-rebase orchestration).
  * vllm/v1/attention/ops/rocm_flash_mla_sparse.py +
    flashmla.py ROCm dispatch.
  * vllm/model_executor/models/deepseek_v4.py:
    ``_resolve_deepseek_v4_expert_dtype`` — still required because
    upstream's new cached property only honours an explicit
    ``hf_config.expert_dtype`` and otherwise defaults to ``"fp4"``,
    misrouting FP8 checkpoints that ship without the field.
  * The Python SWA K-cache writer reference + env-gate around the
    HIPified ``fused_deepseek_v4_qnorm_rope_kv_rope_quant_insert``
    C++ kernel (still buggy on MI300X / FNUZ).

Backup tag: pre-upstream-merge-0512.

Co-authored-by: Cursor <cursoragent@cursor.com>
Upstream-added ``mhc_fused_post_pre`` calls three tilelang kernels
(``mhc_fused_tilelang``, ``mhc_post_tilelang``, ``mhc_pre_big_fuse_tilelang``)
that all use Program Dependent Launch (PDL — Hopper-only). On ROCm
tilelang's ``MarkCudaSyncCalls`` raises ``PDL is not supported`` at
JIT-compile time, taking down every TP worker during profile_run:

  [TileLang:...]: TileLang begins to compile kernel `mhc_post_tilelang`
  tvm.error.InternalError: Check failed: ... PDL is not supported

The non-fused ``mhc_pre`` and ``mhc_post`` already carry torch ROCm
fallbacks; this commit composes them to back the fused op on ROCm,
matching the contract (4-tuple of residual_cur / post_mix_cur /
comb_mix_cur / layer_input_cur with the exact same shapes and dtypes
as the tilelang path). The CUDA path is untouched.

This unblocks DSv4-Flash-Base-FP8 profile_run on MI300X after the
upstream merge that wired the fused op into the layer forward path.

Co-authored-by: Cursor <cursoragent@cursor.com>
@lcskrishna lcskrishna merged commit e1f778d into deepseekv4-rocm May 12, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.