v0.20.1 Cherry pick PRs#41435
Conversation
|
Documentation preview: https://vllm--41435.org.readthedocs.build/en/41435/ |
There was a problem hiding this comment.
Code Review
This pull request optimizes DeepSeek-V4 by parallelizing input GEMMs across multiple CUDA streams, adding support for FP8 expert dtypes alongside FP4, and implementing MXFP4 quantization via PTX inline assembly. It also refactors KV cache sizing to respect block overrides and improves tool parsing with schema-based type conversion. Critical issues were identified in the attention layer where ReplicatedLinear outputs were not correctly unpacked, leading to potential AttributeError exceptions. Additionally, a NameError was found in the MoE preparation logic due to an undefined variable, and a quantization parameter was incorrectly hardcoded to False.
|
|
||
| def kv_insert_and_compress() -> None: | ||
| def wq_b_kv_insert_and_compress() -> torch.Tensor: | ||
| q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim) |
There was a problem hiding this comment.
The self.wq_b(qr) call returns a tuple (output, bias) because wq_b is a ReplicatedLinear layer. Attempting to call .view() directly on the tuple will raise an AttributeError: 'tuple' object has no attribute 'view'. The output tensor must be unpacked first, as correctly done in other parts of this PR (e.g., line 1137).
| q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim) | |
| q, _ = self.wq_b(qr) | |
| q = q.view(-1, self.n_local_heads, self.head_dim) |
| ), | ||
|
|
||
| def wq_b_kv_insert() -> torch.Tensor: | ||
| q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim) |
There was a problem hiding this comment.
The self.wq_b(qr) call returns a tuple (output, bias) because wq_b is a ReplicatedLinear layer. Attempting to call .view() directly on the tuple will raise an AttributeError. The output tensor must be unpacked first.
| q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim) | |
| q, _ = self.wq_b(qr) | |
| q = q.view(-1, self.n_local_heads, self.head_dim) |
| ) | ||
| else: | ||
| # SWA-only layer: no compressor, no overlap. | ||
| q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim) |
There was a problem hiding this comment.
The self.wq_b(qr) call returns a tuple (output, bias) because wq_b is a ReplicatedLinear layer. Attempting to call .view() directly on the tuple will raise an AttributeError. The output tensor must be unpacked first.
| q = self.wq_b(qr).view(-1, self.n_local_heads, self.head_dim) | |
| q, _ = self.wq_b(qr) | |
| q = q.view(-1, self.n_local_heads, self.head_dim) |
| quant_config.block_shape, | ||
| is_fp4_scale_swizzled=False, # delay swizzle to after comm | ||
| ) | ||
| if defer_input_quant: |
There was a problem hiding this comment.
| per_act_token_quant, | ||
| block_shape, | ||
| is_sf_swizzled_layout=is_fp4_scale_swizzled, | ||
| is_sf_swizzled_layout=False, |
There was a problem hiding this comment.
The is_sf_swizzled_layout parameter is hardcoded to False here, which effectively ignores the is_fp4_scale_swizzled argument passed to the moe_kernel_quantize_input function. This change might break other callers that rely on the swizzled layout for nvfp4 quantization. It should respect the function argument.
| is_sf_swizzled_layout=False, | |
| is_sf_swizzled_layout=is_fp4_scale_swizzled, |
3af9d87 to
2b510f5
Compare
Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> (cherry picked from commit 2ce95a7)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> (cherry picked from commit 2c8b76c)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> (cherry picked from commit 5aa371d)
Signed-off-by: zixi-qi <zixi@inferact.ai> (cherry picked from commit 6fb3f7b)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> (cherry picked from commit 2ae73c7)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Co-authored-by: Wei Zhao (Engrg-Hardware 1) <weizha@login-bia02.bia.clusters.nvidia.com> (cherry picked from commit 8b49cf3)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> (cherry picked from commit 9d8ad5b)
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> (cherry picked from commit 296741d)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit a749a33)
…40960) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Zijing Liu <liuzijing2014@gmail.com> Co-authored-by: Zijing Liu <liuzijing2014@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit b4806c8)
2b510f5 to
bf04f1f
Compare
|
Closed because the requested milestone cherry-picks were pushed directly to |
Summary
Backport the current closed PR list from milestone 27 (
v0.20.1) ontoreleases/v0.20.1as one PR. I refreshed the milestone page on May 1, 2026; it has 14 closed PRs.The commits in this PR are cherry-picked with
-xin chronological merge-to-main order, earliestmergedAtfirst:num_gpu_blocks_overrideinmax_model_lenchecks #41069 - [Core] Account fornum_gpu_blocks_overrideinmax_model_lencheckscvtPTX for FP32->FP4 conversion #41015 - [DSv4] UsecvtPTX for FP32->FP4 conversionreleases/v0.20.1)Because
releases/v0.20.1already contains #41442 (05ebca525009a4fe3dc89ff53de6469cb2ac0800), this PR is rebased on that tip and contributes the other 13 cherry-picks.Duplicate-work check
This is not duplicating an existing open PR. I checked open PRs with these searches:
v0.20.1 in:title,bodyreleases/v0.20.1 in:title,bodycherry pick v0.20.1 in:title,body40734 40812 40960 41006 41015 41061 41069 41090 41135 41148 41171 41189 41198 41442 in:bodyAfter the 14-item refresh, the full PR-number search returns only this draft PR (#41435). There is no single linked issue for
gh issue view; this backport is sourced from milestone 27.Tests
git diff --check origin/releases/v0.20.1...HEAD- passedgit diff --name-only origin/releases/v0.20.1...HEAD | rg '\.py$' | xargs pre-commit run ruff-check --files- passedgit diff --name-only origin/releases/v0.20.1...HEAD | rg '\.py$' | xargs pre-commit run ruff-format --files- passed.venv/bin/python -m pytest tests/tool_parsers/test_deepseekv32_tool_parser.py tests/v1/core/test_kv_cache_utils.py -q- did not run successfully locally; collection stopped intests/conftest.pybecause the local venv does not have the full test/runtime dependency set (ModuleNotFoundError: No module named 'tblib').AI assistance was used to prepare this backport PR; the submitting human should review every changed line and run the relevant release validation before marking ready.