v0.20.1 Cherry pick PRs by khluu · Pull Request #41435 · vllm-project/vllm

khluu · 2026-05-01T00:59:52Z

Summary

Backport the current closed PR list from milestone 27 (v0.20.1) onto releases/v0.20.1 as one PR. I refreshed the milestone page on May 1, 2026; it has 14 closed PRs.

The commits in this PR are cherry-picked with -x in chronological merge-to-main order, earliest mergedAt first:

Auto-disable expandable_segments around cumem memory pool #40812 - Auto-disable expandable_segments around cumem memory pool
[Model][DSV4] Support base model #41006 - [Model][DSV4] Support base model
[DSV4] Enable Multi-stream for Pre-Attn GEMM #41061 - [DSV4] Enable Multi-stream for Pre-Attn GEMM
[Core] Account for num_gpu_blocks_override in max_model_len checks #41069 - [Core] Account for num_gpu_blocks_override in max_model_len checks
[DSV4] Align aux stream API with DeepseekV4DecoderLayer #41171 - [DSV4] Align aux stream API with DeepseekV4DecoderLayer
[Bugfix] Fix Deepseek V4 import error due to AOT compile cache loading #41090 - [Bugfix] Fix Deepseek V4 import error due to AOT compile cache loading
[Bugfix] fix inductor error for dpsk v4 #41135 - [Bugfix] fix inductor error for dpsk v4
[Bugfix] Fix max_num_batched_token not captured in cuda graph #40734 - [Bugfix] Fix max_num_batched_token not captured in cuda graph
[Bugfix] DSV32/V4 add missing type conversion for non-streaming tool calls #41198 - [Bugfix] DSV32/V4 add missing type conversion for non-streaming tool calls
[Bugfix] Fix repeated DSv4 RoPE cache initialization #41148 - [Bugfix] Fix repeated DSv4 RoPE cache initialization
[DSv4] Use cvt PTX for FP32->FP4 conversion #41015 - [DSv4] Use cvt PTX for FP32->FP4 conversion
[Bugfix] Fix persistent_topk cooperative deadlock at TopK=1024 #41189 - [Bugfix] Fix persistent_topk cooperative deadlock at TopK=1024
[DSV4] Add BF16 and MXFP8 A2A support for flashinfer a2a one sided #40960 - [DSV4] Add BF16 and MXFP8 A2A support for flashinfer a2a one sided
Temporary disable persistent topk #41442 - Temporary disable persistent topk (already present in current releases/v0.20.1)

Because releases/v0.20.1 already contains #41442 (05ebca525009a4fe3dc89ff53de6469cb2ac0800), this PR is rebased on that tip and contributes the other 13 cherry-picks.

Duplicate-work check

This is not duplicating an existing open PR. I checked open PRs with these searches:

v0.20.1 in:title,body
releases/v0.20.1 in:title,body
cherry pick v0.20.1 in:title,body
40734 40812 40960 41006 41015 41061 41069 41090 41135 41148 41171 41189 41198 41442 in:body

After the 14-item refresh, the full PR-number search returns only this draft PR (#41435). There is no single linked issue for gh issue view; this backport is sourced from milestone 27.

Tests

git diff --check origin/releases/v0.20.1...HEAD - passed
git diff --name-only origin/releases/v0.20.1...HEAD | rg '\.py$' | xargs pre-commit run ruff-check --files - passed
git diff --name-only origin/releases/v0.20.1...HEAD | rg '\.py$' | xargs pre-commit run ruff-format --files - passed
.venv/bin/python -m pytest tests/tool_parsers/test_deepseekv32_tool_parser.py tests/v1/core/test_kv_cache_utils.py -q - did not run successfully locally; collection stopped in tests/conftest.py because the local venv does not have the full test/runtime dependency set (ModuleNotFoundError: No module named 'tblib').

AI assistance was used to prepare this backport PR; the submitting human should review every changed line and run the relevant release validation before marking ready.

mergify · 2026-05-01T01:00:40Z

Documentation preview: https://vllm--41435.org.readthedocs.build/en/41435/

gemini-code-assist

Code Review

This pull request optimizes DeepSeek-V4 by parallelizing input GEMMs across multiple CUDA streams, adding support for FP8 expert dtypes alongside FP4, and implementing MXFP4 quantization via PTX inline assembly. It also refactors KV cache sizing to respect block overrides and improves tool parsing with schema-based type conversion. Critical issues were identified in the attention layer where ReplicatedLinear outputs were not correctly unpacked, leading to potential AttributeError exceptions. Additionally, a NameError was found in the MoE preparation logic due to an undefined variable, and a quantization parameter was incorrectly hardcoded to False.

gemini-code-assist · 2026-05-01T01:04:47Z


-            def kv_insert_and_compress() -> None:
+            def wq_b_kv_insert_and_compress() -> torch.Tensor:
+                q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim)


The self.wq_b(qr) call returns a tuple (output, bias) because wq_b is a ReplicatedLinear layer. Attempting to call .view() directly on the tuple will raise an AttributeError: 'tuple' object has no attribute 'view'. The output tensor must be unpacked first, as correctly done in other parts of this PR (e.g., line 1137).

Suggested change

q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim)

q, _ = self.wq_b(qr)

q = q.view(-1, self.n_local_heads, self.head_dim)

gemini-code-assist · 2026-05-01T01:04:48Z

-                ),
+
+            def wq_b_kv_insert() -> torch.Tensor:
+                q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim)


The self.wq_b(qr) call returns a tuple (output, bias) because wq_b is a ReplicatedLinear layer. Attempting to call .view() directly on the tuple will raise an AttributeError. The output tensor must be unpacked first.

Suggested change

q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim)

q, _ = self.wq_b(qr)

q = q.view(-1, self.n_local_heads, self.head_dim)

gemini-code-assist · 2026-05-01T01:04:48Z

            )
        else:
            # SWA-only layer: no compressor, no overlap.
+            q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim)


The self.wq_b(qr) call returns a tuple (output, bias) because wq_b is a ReplicatedLinear layer. Attempting to call .view() directly on the tuple will raise an AttributeError. The output tensor must be unpacked first.

Suggested change

q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim)

q, _ = self.wq_b(qr)

q = q.view(-1, self.n_local_heads, self.head_dim)

gemini-code-assist · 2026-05-01T01:04:48Z

-            quant_config.block_shape,
-            is_fp4_scale_swizzled=False,  # delay swizzle to after comm
-        )
+        if defer_input_quant:


The variable defer_input_quant is used here but it is not defined within the prepare method scope, nor is it passed as an argument in the method signature. This will result in a NameError at runtime. It should likely be added to the prepare method signature or passed via kwargs.

gemini-code-assist · 2026-05-01T01:04:48Z

            per_act_token_quant,
            block_shape,
-            is_sf_swizzled_layout=is_fp4_scale_swizzled,
+            is_sf_swizzled_layout=False,


The is_sf_swizzled_layout parameter is hardcoded to False here, which effectively ignores the is_fp4_scale_swizzled argument passed to the moe_kernel_quantize_input function. This change might break other callers that rely on the swizzled layout for nvfp4 quantization. It should respect the function argument.

Suggested change

is_sf_swizzled_layout=False,

is_sf_swizzled_layout=is_fp4_scale_swizzled,

Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> (cherry picked from commit 2ce95a7)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> (cherry picked from commit 2c8b76c)

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> (cherry picked from commit 5aa371d)

#41069) Signed-off-by: Nick Hill <nickhill123@gmail.com> (cherry picked from commit e68fa1b)

Signed-off-by: zixi-qi <zixi@inferact.ai> (cherry picked from commit 6fb3f7b)

#41090) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> (cherry picked from commit 803b9d7)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> (cherry picked from commit 2ae73c7)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Co-authored-by: Wei Zhao (Engrg-Hardware 1) <weizha@login-bia02.bia.clusters.nvidia.com> (cherry picked from commit 8b49cf3)

…calls (#41198) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> (cherry picked from commit 762022c)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> (cherry picked from commit 9d8ad5b)

Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> (cherry picked from commit 296741d)

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit a749a33)

…40960) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Zijing Liu <liuzijing2014@gmail.com> Co-authored-by: Zijing Liu <liuzijing2014@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit b4806c8)

khluu · 2026-05-01T04:39:47Z

Closed because the requested milestone cherry-picks were pushed directly to releases/v0.20.1 in chronological merge order.

mergify Bot added documentation Improvements or additions to documentation deepseek Related to DeepSeek models nvidia v1 labels May 1, 2026

github-project-automation Bot added this to NVIDIA May 1, 2026

mergify Bot added the tool-calling label May 1, 2026

github-project-automation Bot added this to Tool Calling May 1, 2026

gemini-code-assist Bot reviewed May 1, 2026

View reviewed changes

khluu changed the title ~~[codex] Backport v0.20.1 milestone PRs~~ v0.20.1 Cherry pick PRs May 1, 2026

khluu force-pushed the codex/backport-v0.20.1-milestone-27 branch 2 times, most recently from 3af9d87 to 2b510f5 Compare May 1, 2026 04:35

youkaichao and others added 13 commits April 30, 2026 21:37

[Model][DSV4] Support base model (#41006)

d3311c8

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> (cherry picked from commit 2c8b76c)

[DSV4] Enable Multi-stream for Pre-Attn GEMM (#41061)

97a464c

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> (cherry picked from commit 5aa371d)

[Core] Account for num_gpu_blocks_override in max_model_len checks (

ef26e1e

#41069) Signed-off-by: Nick Hill <nickhill123@gmail.com> (cherry picked from commit e68fa1b)

[DSV4] Align aux stream API with DeepseekV4DecoderLayer (#41171)

da7e463

Signed-off-by: zixi-qi <zixi@inferact.ai> (cherry picked from commit 6fb3f7b)

[Bugfix] Fix Deepseek V4 import error due to AOT compile cache loading (

a03a6e2

#41090) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> (cherry picked from commit 803b9d7)

[Bugfix] fix inductor error for dpsk v4 (#41135)

a98582c

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> (cherry picked from commit 2ae73c7)

[Bugfix] DSV32/V4 add missing type conversion for non-streaming tool …

1d49643

…calls (#41198) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> (cherry picked from commit 762022c)

[Bugfix] Fix repeated DSv4 RoPE cache initialization (#41148)

594fc85

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> (cherry picked from commit 9d8ad5b)

[DSv4] Use cvt PTX for FP32->FP4 conversion (#41015)

957f9ef

Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> (cherry picked from commit 296741d)

[Bugfix] Fix persistent_topk cooperative deadlock at TopK=1024 (#41189)

f363962

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit a749a33)

khluu force-pushed the codex/backport-v0.20.1-milestone-27 branch from 2b510f5 to bf04f1f Compare May 1, 2026 04:38

khluu closed this May 1, 2026

khluu deleted the codex/backport-v0.20.1-milestone-27 branch May 1, 2026 04:39

github-project-automation Bot moved this to Done in NVIDIA May 1, 2026

github-project-automation Bot moved this to Done in Tool Calling May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.20.1 Cherry pick PRs#41435

v0.20.1 Cherry pick PRs#41435
khluu wants to merge 13 commits intoreleases/v0.20.1from
codex/backport-v0.20.1-milestone-27

khluu commented May 1, 2026 •

edited

Loading

Uh oh!

mergify Bot commented May 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 1, 2026

Uh oh!

gemini-code-assist Bot May 1, 2026

Uh oh!

gemini-code-assist Bot May 1, 2026

Uh oh!

gemini-code-assist Bot May 1, 2026

Uh oh!

gemini-code-assist Bot May 1, 2026

Uh oh!

khluu commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

	q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim)
	q, _ = self.wq_b(qr)
	q = q.view(-1, self.n_local_heads, self.head_dim)

	is_sf_swizzled_layout=False,
	is_sf_swizzled_layout=is_fp4_scale_swizzled,

Uh oh!

Conversation

khluu commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Duplicate-work check

Tests

Uh oh!

mergify Bot commented May 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

khluu commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

khluu commented May 1, 2026 •

edited

Loading