Skip to content

v0.20.1 Cherry pick PRs#41435

Closed
khluu wants to merge 13 commits intoreleases/v0.20.1from
codex/backport-v0.20.1-milestone-27
Closed

v0.20.1 Cherry pick PRs#41435
khluu wants to merge 13 commits intoreleases/v0.20.1from
codex/backport-v0.20.1-milestone-27

Conversation

@khluu
Copy link
Copy Markdown
Collaborator

@khluu khluu commented May 1, 2026

Summary

Backport the current closed PR list from milestone 27 (v0.20.1) onto releases/v0.20.1 as one PR. I refreshed the milestone page on May 1, 2026; it has 14 closed PRs.

The commits in this PR are cherry-picked with -x in chronological merge-to-main order, earliest mergedAt first:

  1. Auto-disable expandable_segments around cumem memory pool #40812 - Auto-disable expandable_segments around cumem memory pool
  2. [Model][DSV4] Support base model #41006 - [Model][DSV4] Support base model
  3. [DSV4] Enable Multi-stream for Pre-Attn GEMM #41061 - [DSV4] Enable Multi-stream for Pre-Attn GEMM
  4. [Core] Account for num_gpu_blocks_override in max_model_len checks #41069 - [Core] Account for num_gpu_blocks_override in max_model_len checks
  5. [DSV4] Align aux stream API with DeepseekV4DecoderLayer #41171 - [DSV4] Align aux stream API with DeepseekV4DecoderLayer
  6. [Bugfix] Fix Deepseek V4 import error due to AOT compile cache loading #41090 - [Bugfix] Fix Deepseek V4 import error due to AOT compile cache loading
  7. [Bugfix] fix inductor error for dpsk v4 #41135 - [Bugfix] fix inductor error for dpsk v4
  8. [Bugfix] Fix max_num_batched_token not captured in cuda graph  #40734 - [Bugfix] Fix max_num_batched_token not captured in cuda graph
  9. [Bugfix] DSV32/V4 add missing type conversion for non-streaming tool calls #41198 - [Bugfix] DSV32/V4 add missing type conversion for non-streaming tool calls
  10. [Bugfix] Fix repeated DSv4 RoPE cache initialization #41148 - [Bugfix] Fix repeated DSv4 RoPE cache initialization
  11. [DSv4] Use cvt PTX for FP32->FP4 conversion #41015 - [DSv4] Use cvt PTX for FP32->FP4 conversion
  12. [Bugfix] Fix persistent_topk cooperative deadlock at TopK=1024 #41189 - [Bugfix] Fix persistent_topk cooperative deadlock at TopK=1024
  13. [DSV4] Add BF16 and MXFP8 A2A support for flashinfer a2a one sided #40960 - [DSV4] Add BF16 and MXFP8 A2A support for flashinfer a2a one sided
  14. Temporary disable persistent topk #41442 - Temporary disable persistent topk (already present in current releases/v0.20.1)

Because releases/v0.20.1 already contains #41442 (05ebca525009a4fe3dc89ff53de6469cb2ac0800), this PR is rebased on that tip and contributes the other 13 cherry-picks.

Duplicate-work check

This is not duplicating an existing open PR. I checked open PRs with these searches:

  • v0.20.1 in:title,body
  • releases/v0.20.1 in:title,body
  • cherry pick v0.20.1 in:title,body
  • 40734 40812 40960 41006 41015 41061 41069 41090 41135 41148 41171 41189 41198 41442 in:body

After the 14-item refresh, the full PR-number search returns only this draft PR (#41435). There is no single linked issue for gh issue view; this backport is sourced from milestone 27.

Tests

  • git diff --check origin/releases/v0.20.1...HEAD - passed
  • git diff --name-only origin/releases/v0.20.1...HEAD | rg '\.py$' | xargs pre-commit run ruff-check --files - passed
  • git diff --name-only origin/releases/v0.20.1...HEAD | rg '\.py$' | xargs pre-commit run ruff-format --files - passed
  • .venv/bin/python -m pytest tests/tool_parsers/test_deepseekv32_tool_parser.py tests/v1/core/test_kv_cache_utils.py -q - did not run successfully locally; collection stopped in tests/conftest.py because the local venv does not have the full test/runtime dependency set (ModuleNotFoundError: No module named 'tblib').

AI assistance was used to prepare this backport PR; the submitting human should review every changed line and run the relevant release validation before marking ready.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 1, 2026

Documentation preview: https://vllm--41435.org.readthedocs.build/en/41435/

@mergify mergify Bot added documentation Improvements or additions to documentation deepseek Related to DeepSeek models nvidia v1 labels May 1, 2026
@mergify mergify Bot added the tool-calling label May 1, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes DeepSeek-V4 by parallelizing input GEMMs across multiple CUDA streams, adding support for FP8 expert dtypes alongside FP4, and implementing MXFP4 quantization via PTX inline assembly. It also refactors KV cache sizing to respect block overrides and improves tool parsing with schema-based type conversion. Critical issues were identified in the attention layer where ReplicatedLinear outputs were not correctly unpacked, leading to potential AttributeError exceptions. Additionally, a NameError was found in the MoE preparation logic due to an undefined variable, and a quantization parameter was incorrectly hardcoded to False.


def kv_insert_and_compress() -> None:
def wq_b_kv_insert_and_compress() -> torch.Tensor:
q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The self.wq_b(qr) call returns a tuple (output, bias) because wq_b is a ReplicatedLinear layer. Attempting to call .view() directly on the tuple will raise an AttributeError: 'tuple' object has no attribute 'view'. The output tensor must be unpacked first, as correctly done in other parts of this PR (e.g., line 1137).

Suggested change
q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim)
q, _ = self.wq_b(qr)
q = q.view(-1, self.n_local_heads, self.head_dim)

),

def wq_b_kv_insert() -> torch.Tensor:
q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The self.wq_b(qr) call returns a tuple (output, bias) because wq_b is a ReplicatedLinear layer. Attempting to call .view() directly on the tuple will raise an AttributeError. The output tensor must be unpacked first.

Suggested change
q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim)
q, _ = self.wq_b(qr)
q = q.view(-1, self.n_local_heads, self.head_dim)

)
else:
# SWA-only layer: no compressor, no overlap.
q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The self.wq_b(qr) call returns a tuple (output, bias) because wq_b is a ReplicatedLinear layer. Attempting to call .view() directly on the tuple will raise an AttributeError. The output tensor must be unpacked first.

Suggested change
q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim)
q, _ = self.wq_b(qr)
q = q.view(-1, self.n_local_heads, self.head_dim)

quant_config.block_shape,
is_fp4_scale_swizzled=False, # delay swizzle to after comm
)
if defer_input_quant:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The variable defer_input_quant is used here but it is not defined within the prepare method scope, nor is it passed as an argument in the method signature. This will result in a NameError at runtime. It should likely be added to the prepare method signature or passed via kwargs.

per_act_token_quant,
block_shape,
is_sf_swizzled_layout=is_fp4_scale_swizzled,
is_sf_swizzled_layout=False,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The is_sf_swizzled_layout parameter is hardcoded to False here, which effectively ignores the is_fp4_scale_swizzled argument passed to the moe_kernel_quantize_input function. This change might break other callers that rely on the swizzled layout for nvfp4 quantization. It should respect the function argument.

Suggested change
is_sf_swizzled_layout=False,
is_sf_swizzled_layout=is_fp4_scale_swizzled,

@khluu khluu changed the title [codex] Backport v0.20.1 milestone PRs v0.20.1 Cherry pick PRs May 1, 2026
@khluu khluu force-pushed the codex/backport-v0.20.1-milestone-27 branch 2 times, most recently from 3af9d87 to 2b510f5 Compare May 1, 2026 04:35
youkaichao and others added 13 commits April 30, 2026 21:37
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
(cherry picked from commit 2ce95a7)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
(cherry picked from commit 2c8b76c)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
(cherry picked from commit 5aa371d)
#41069)

Signed-off-by: Nick Hill <nickhill123@gmail.com>
(cherry picked from commit e68fa1b)
Signed-off-by: zixi-qi <zixi@inferact.ai>
(cherry picked from commit 6fb3f7b)
#41090)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
(cherry picked from commit 803b9d7)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
(cherry picked from commit 2ae73c7)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Co-authored-by: Wei Zhao (Engrg-Hardware 1) <weizha@login-bia02.bia.clusters.nvidia.com>
(cherry picked from commit 8b49cf3)
…calls (#41198)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
(cherry picked from commit 762022c)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
(cherry picked from commit 9d8ad5b)
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
(cherry picked from commit 296741d)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit a749a33)
…40960)

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com>
Co-authored-by: Zijing Liu <liuzijing2014@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit b4806c8)
@khluu khluu force-pushed the codex/backport-v0.20.1-milestone-27 branch from 2b510f5 to bf04f1f Compare May 1, 2026 04:38
@khluu
Copy link
Copy Markdown
Collaborator Author

khluu commented May 1, 2026

Closed because the requested milestone cherry-picks were pushed directly to releases/v0.20.1 in chronological merge order.

@khluu khluu closed this May 1, 2026
@khluu khluu deleted the codex/backport-v0.20.1-milestone-27 branch May 1, 2026 04:39
@github-project-automation github-project-automation Bot moved this to Done in NVIDIA May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models documentation Improvements or additions to documentation nvidia tool-calling v1

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

10 participants