[Misc] Main2Main 0605#10250
Conversation
|
/e2e tests/e2e/pull_request/one_card/model_runner_v2/test_basic.py::test_qwen3_dense_graph_mode |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on maintaining cross-version compatibility between vLLM v0.21.0 and the current main branch. It introduces conditional logic throughout the Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
Suggested PR Title:
[Misc][Feature] Support compatibility with vLLM 0.21.0 and newer versionsSuggested PR Summary:
### What this PR does / why we need it?
This pull request introduces comprehensive compatibility support for both vLLM v0.21.0 and newer versions (including vLLM main). It achieves this by adding version-conditional imports, backporting helpers that were removed in newer vLLM versions (such as `get_decode_context_model_parallel_world_size`, `get_decode_context_model_parallel_rank`, and `patch_tensor_parallel_group`), and adapting signatures to handle renamed or added arguments (e.g., `use_eagle` vs `drop_eagle_block`, and `scheduler_block_size`). Additionally, it introduces an NPU-compatible structured output bitmask for the v2 model runner and updates attention, speculator, and logprob implementations to align with upstream changes.
During the review, several critical issues were identified:
- In `vllm_ascend/core/single_type_kv_cache_manager.py` and `vllm_ascend/patch/platform/patch_mamba_manager.py`, the modified signature of `find_longest_cache_hit` breaks positional argument compatibility, which could lead to severe runtime errors or silent correctness bugs. A unified signature is suggested to maintain full backward and forward compatibility.
- In `vllm_ascend/worker/v2/sample/logprob.py`, the Triton kernel `_fill_logprob_token_ids_kernel` is called with an unexpected keyword argument `multibuffer=False`, which will raise a `TypeError` at runtime.
### Does this PR introduce _any_ user-facing change?
No, this PR focuses on internal compatibility and alignment with upstream vLLM versions.
### How was this patch tested?
The changes were tested using existing and updated end-to-end and unit tests, including model runner, guided decoding, and spec decode tests.| alignment_tokens: int, | ||
| dcp_world_size: int = 1, | ||
| pcp_world_size: int = 1, | ||
| use_eagle: bool = False, | ||
| drop_eagle_block: bool = False, | ||
| ) -> tuple[list[KVCacheBlock], ...]: |
There was a problem hiding this comment.
Similar to single_type_kv_cache_manager.py, the signature of find_longest_cache_hit in AscendMambaManager has been modified in a way that breaks positional argument compatibility. We should apply the same compatible signature here to prevent TypeError or argument mismatch when called positionally or via keyword arguments.
| alignment_tokens: int, | |
| dcp_world_size: int = 1, | |
| pcp_world_size: int = 1, | |
| use_eagle: bool = False, | |
| drop_eagle_block: bool = False, | |
| ) -> tuple[list[KVCacheBlock], ...]: | |
| use_eagle_or_drop_block: bool = False, | |
| alignment_tokens: int = 0, | |
| dcp_world_size: int = 1, | |
| pcp_world_size: int = 1, | |
| use_eagle: bool = False, | |
| drop_eagle_block: bool = False, | |
| ) -> tuple[list[KVCacheBlock], ...]: |
| NUM_TOPK=num_logprobs, | ||
| PADDED_COLS=triton.next_power_of_2(num_cols), | ||
| multibuffer=False, | ||
| ) |
There was a problem hiding this comment.
The Triton kernel _fill_logprob_token_ids_kernel is called with multibuffer=False. However, multibuffer is not defined as a parameter in the kernel's signature. Passing an unexpected keyword argument to a Triton JIT function will raise a TypeError at runtime.
We should remove the multibuffer=False argument from the kernel call.
NUM_TOPK=num_logprobs,
PADDED_COLS=triton.next_power_of_2(num_cols),
)|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
c41d75f to
e59a9f7
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
1 similar comment
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
8efaf68 to
0647dcc
Compare
|
/e2e tests/e2e/pull_request/four_card/test_data_parallel_tp2.py::test_qwen3_inference_dp2_tp2 |
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: zhao-stack <2020265299@qq.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: nofushanquan <1255959842@qq.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: nofushanquan <1255959842@qq.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: nofushanquan <1255959842@qq.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
3f9ceb5 to
d08f33e
Compare
vllm main2main adaption - vLLM version: v0.21.0 - vLLM main: vllm-project/vllm@9090368 --------- Signed-off-by: nofushanquan <1255959842@qq.com> Signed-off-by: shenzhao <shenzhao9@huawei.com> Signed-off-by: zhao-stack <2020265299@qq.com> Co-authored-by: nofushanquan <1255959842@qq.com> Co-authored-by: liyishi <1252651434@qq.com> Co-authored-by: shenzhao <shenzhao9@huawei.com> Signed-off-by: zhaorifa <865071616@qq.com>
vllm main2main adaption - vLLM version: v0.21.0 - vLLM main: vllm-project/vllm@9090368 --------- Signed-off-by: nofushanquan <1255959842@qq.com> Signed-off-by: shenzhao <shenzhao9@huawei.com> Signed-off-by: zhao-stack <2020265299@qq.com> Co-authored-by: nofushanquan <1255959842@qq.com> Co-authored-by: liyishi <1252651434@qq.com> Co-authored-by: shenzhao <shenzhao9@huawei.com> Signed-off-by: Fager10086 <865071616@qq.com>
1. vllm-ascend PR #10250 — Per-File Change Log (37 files)
Branch:
Misc]-test-m2m-e2evsmainBase commit range: vLLM main
9090368b→efc347f1b(also pinned in.github/vllm-main-verified.commit)Dual-version guard:
if vllm_version_is("0.21.0"): ... else: ...unless notedScope: this document lists 37 production/test files. Excluded from the numbered list: verified-commit pin, reverted/removed paths (
logprob.py,patch_structured_outputs.py).2.
tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_fused_qkvzba_split_reshape_cat.pyWhy: GatedDeltaNet import path moved on vLLM main.
Version guard: Yes —
0.21.0importsmamba.gdn_linear_attn; else importsmamba.gdn.base.Upstream PR: #43556 — Mamba LINEAR attention module refactor (split GDN layout).
3.
tests/e2e/pull_request/four_card/test_data_parallel_tp2.pyWhy: Stabilize DP2+TP2 e2e under m2m CI (memory + graph capture flakiness).
Change: Add
@wait_until_npu_memory_free; pass--enforce-eagerto offline DP script.Upstream trigger: Ascend CI hardening for dual-version m2m — not tied to one vLLM PR.
4.
tests/e2e/pull_request/four_card/test_qwen3_next.pyWhy: Qwen3-Next graph capture OOM at 0.7 utilization after main moved compile paths.
Change: Raise
gpu_memory_utilization0.7→0.8.Upstream trigger: Compile/graph stack churn on main (Qwen3-Next); no single blocking PR — related to torch.compile logging fixes on Ascend side.
5.
tests/e2e/pull_request/one_card/test_guided_decoding.pyWhy: v2 model runner is supported on 0.21.0 in this PR; old skip checked wrong tag
0.20.1.Change: Skip v2 only when
vllm_version_is("0.21.0")is false for the negative case — actually inverted: skip v2 on versions other than the intended matrix. (Fixes wrong skip predicate.)Upstream trigger: Align test matrix with #40559 MRV2 availability on 0.21+.
6.
tests/ut/patch/platform/test_patch_glm47_tool_call_parser.pyWhy: Parser surface changed on main.
Version guard: Yes
_WrappedParserfrom upstream._WrappedParserremoved; use thinDelegatingParsersubclass.parse_delta: main adds required kw-onlyfinished(#44017 refactor); 0.21 has no such arg — helper_parse_deltabranches.Upstream PRs: #44017 (parser refactor);
_WrappedParserremoval is part of main parser cleanup in that timeframe.7.
tests/ut/patch/platform/test_patch_tool_choice_none_content.pyWhy:
OpenAIServing._parse_tool_calls_from_contentpatch is 0.21-only; on main fix lives inDelegatingParseronly.Change:
@pytest.mark.skipif(not vllm_version_is("0.21.0"))on the OpenAIServing-specific test.Upstream PR: #42752 — honor
tool_choice="none"in streaming (main routes throughDelegatingParser; 0.21 still needsOpenAIServinghook).8.
tests/ut/patch/platform/test_prefix_cache_cp_patches.pyWhy:
AscendMambaManageron main requiresscheduler_block_sizein__init__.Version guard: Pass
scheduler_block_size=mamba_spec.block_sizeonly when not0.21.0.Upstream PR: #44165 — thread
scheduler_block_sizeinto KV cache managers.9.
vllm_ascend/attention/context_parallel/attention_cp.pyWhy: DCP helper symbols removed from
vllm.distributedon main.Change: Import
get_decode_context_model_parallel_{world_size,rank}fromvllm_ascend.distributed.utilsinstead of upstream.Upstream PR: #41471 — remove dead
get_decode_context_model_parallel_*fromparallel_state.10.
vllm_ascend/attention/context_parallel/common_cp.pyWhy: Same as item 12 — import relocation only.
Upstream PR: #41471.
11.
vllm_ascend/attention/context_parallel/mla_cp.pyWhy: Same as item 12 — import relocation only.
Upstream PR: #41471.
12.
vllm_ascend/core/recompute_scheduler.pyWhy:
register_ascend_mla_spec_in_manager()must not call main-only registry APIs on 0.21.Version guard:
spec_manager_map[AscendMLAAttentionSpec](exact key lookup — [bugfix] restore pr-7029 and fix patch error #7294 class).register_all_kvcache_specs+KVCacheSpecRegistry.Upstream PRs:
MLAAttentionSpecpatch.KVCacheSpecRegistryreplacesspec_manager_map.13.
vllm_ascend/core/single_type_kv_cache_manager.pyWhy: Core KV manager dual-version adapter.
Version guards / API mapping:
spec_manager_map[type]KVCacheSpecRegistry.get_manager_class()use_eagledrop_eagle_blockcache_blocksparamretention_intervalWhy else branches exist: each column is an upstream API rename/addition after 0.21.
14.
vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_connector.pyWhy: Import relocation for removed DCP helpers (item #12).
Upstream PR: #41471.
Note: Mooncake coordinator on main also uses
KVCacheSpecRegistry+drop_eagle_block(#37505, #44082) — this file only changes imports, not coordinator logic.14.
vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.pyWhy: Same DCP import relocation for
get_decode_context_model_parallel_rank.Upstream PR: #41471.
15.
vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_worker.pyWhy: Same DCP import relocation.
Upstream PR: #41471.
16.
vllm_ascend/distributed/kv_transfer/kv_pool/cpu_offload/cpu_kv_cache_manager.pyWhy: CPU offload KV manager must call ascend
get_manager_for_kv_cache_specand main allocator APIs.Version guard / logic:
scheduler_block_sizeto manager ctor — #44165.find_longest_cache_hitusesdrop_eagle_block— #44082.get_num_blocks_to_allocate(..., total_computed_tokens=..., num_tokens_main_model=...)andallocate_new_computed_blocks(...)— main block-pool API (post-0.21 scheduler/coordinator refactor; same family as #44165).use_eagleand older allocate/save call shapes.17.
vllm_ascend/distributed/utils.pyWhy: Call sites still need DCP rank/world-size after upstream deletion.
Change: Reintroduce
get_decode_context_model_parallel_world_size/rank()wrappingget_dcp_group().Upstream PR: #41471 (removed on main; backported here for both versions).
18.
vllm_ascend/ops/bailing_moe_linear_attn.pyWhy: Linear-attention helpers moved out of
mamba.linear_attnon main.Version guard:
vllm.model_executor.layers.mamba.linear_attn.vllm.model_executor.layers.mamba.linear.minimax_linear_attn.Upstream PR: #43556 — Mamba LINEAR module refactor.
19.
vllm_ascend/ops/triton/fla/fused_qkvzba_split_reshape.pyWhy:
logger.debugwith tensor shapes breaks / pollutes logs undertorch.compile(Qwen3-Next graph capture).Change: Guard debug logging with
if not torch.compiler.is_compiling():.Upstream trigger: torch.compile integration on main models; Ascend-specific fix (no vLLM PR — compile artifact on NPU).
20.
vllm_ascend/patch/platform/patch_kv_cache_coordinator.pyWhy: Ascend hybrid prefix-cache coordinator must follow main KV coordinator constructor while staying 0.21-compatible.
Version guards:
scheduler_block_sizeto managers/coordinators only on else — #44165.use_eaglevsdrop_eagle_blockinfind_longest_cache_hit— #44082.VLLM_PREFIX_CACHE_RETENTION_INTERVALvalidation — #43447.envs→envs_ascendto avoid clash withvllm.envs(retention env lives on main).Logic (no guard): Forward
scheduler_block_sizethroughget_kv_cache_coordinatorsignature added on main.21.
vllm_ascend/patch/platform/patch_mamba_manager.pyWhy: Signature compatibility + avoid breaking main registry init order.
Version guard:
use_eagleanddrop_eagle_blockinfind_longest_cache_hit(#44082).AscendMambaManagerinspec_manager_maponly on 0.21.0; on main useKVCacheSpecRegistryviaregister_all_kvcache_specs(#37505) — earlyspec_manager_mapwrite had caused missingFullAttentionSpecregistration.22.
vllm_ascend/patch/platform/patch_tool_choice_none_content.pyWhy: Upstream moved streaming tool-call suppression into
DelegatingParser;OpenAIServing._parse_tool_calls_from_contentstill exists on 0.21.Version guard:
DelegatingParser._parse_tool_calls(works on both lines post-#42752).OpenAIServing._parse_tool_calls_from_contentpatch (main no longer needs it).Upstream PR: #44267.
23.
vllm_ascend/patch/worker/patch_mamba_utils.pyWhy: Document + preserve mamba state cleanup when main factored helper differently.
Change: Comment + inline cleanup for finished/preempted/resumed reqs (both versions).
Upstream context: Main worker mamba path refactors (#44539 KDA cache unification); 0.21 keeps inline cleanup — Ascend patch stays compatible with both.
24.
vllm_ascend/patch/worker/patch_minimax_m2.pyWhy:
MiniMaxText01RMSNormTPmoved out ofmamba.linear_attnon main.Version guard:
mamba.linear_attn.minimax_rms_norm.Upstream PR: #43556 (module split).
25.
vllm_ascend/patch/worker/patch_minimax_m2_linear_attn.pyWhy: Same RMSNorm import move as item #28.
Upstream PR: #43556.
26.
vllm_ascend/platform.pyWhy: Main sleep-mode validation calls
Platform.is_cumem_allocator_available()before NPU custom op init.Change: Return
TrueonNPUPlatform(NPU usesCaMemAllocator).Upstream PR: #43838 — add platform cumem probe; no 0.21 API (guard not needed — method absent on 0.21, harmless override).
27.
vllm_ascend/spec_decode/llm_base_proposer.pyWhy: Draft-model spec decode still needs temporary TP group swap; symbol removed on main.
Version guard:
patch_tensor_parallel_group._ps._TP.Upstream PR: #41471 removed
patch_tensor_parallel_groupfromparallel_state.28.
vllm_ascend/worker/utils.pyWhy: On main,
KVBlockZeroer.__init__takes full metadata and runs init in ctor; on 0.21 ctor is(device, pin_memory)+ separateinit_meta.AscendKVBlockZeroerkeeps 0.21-style split API for NPU Triton zeroer.Change: Explicit
__init__initializing_meta/_ids_*fields so subclass does not invoke main’s expanded base__init__signature incorrectly.Upstream PR: #35219 introduced
KVBlockZeroer; main later merged init paths into ctor (post-0.21). Ascend dual-version shim — novllm_version_isguard, inheritance layout fix.29.
vllm_ascend/worker/v2/model_runner.pyWhy: MRV2 input batch and PP sampling metadata diverged on main.
Version guard:
num_computed_tokens_np,prefill_len_np,num_computed_prefill_tokens_np, optionalmax_seq_len_npintoAscendInputBatch— #42187 (PP bubble avoidance / extended batch fields).Logic (both versions): Split prefill detection into two numpy reads; add
postprocess_sampledoverride +_copy_num_computed_tokens_to_cpu()so NPU attention still sees CPUseq_lensmirror.30.
vllm_ascend/worker/v2/spec_decode/eagle/aclgraph.pyWhy: Eagle CUDA graph managers moved under
autoregressive/on main; 0.21 keeps monolithiceagle/cudagraph.py.Version guard:
DecodeEagleCudaGraphManager,PrefillEagleCudaGraphManager,CapturedAttentionStatefromeagle.cudagraph.DecodeSpeculatorCudaGraphManager,PrefillSpeculatorCudaGraphManager,AttentionStatePairfrom autoregressive +cudagraph_utils.Upstream PR: #43241 — MRV2 speculator modularization (Eagle/MTP/Gemma4 split).
31.
vllm_ascend/worker/v2/spec_decode/eagle/speculator.pyWhy: Largest MRV2 dual-version adapter — upstream split Eagle vs autoregressive speculator modules.
Version guard: Import roots and helpers (
update_eagle_draft_inputs→update_draft_inputs,_BUILD_ATTN_METADATA_MODULE, prefill cudagraph class) branch on0.21.0.Logic (both): Ascend-specific
generate_draft/ attn metadata / aclgraph integration retained insideAscendEagleSpeculator.Upstream PRs:
32.
vllm_ascend/worker/v2/states.pyWhy:
RequestState.add_request()gained requiredmax_tokenson main.Version guard:
super().add_request(...).max_tokens=max_tokensfor PP/max-seq tracking.Upstream PR: #42187.
33.
vllm_ascend/worker/worker.pyWhy: KV connector handshake dict keying changed for pipeline parallel.
Version guard:
{tp_rank: metadata}(legacy).{(pp_rank, tp_rank): metadata}with typed returnKVConnectorHandshakeMetadata.Upstream PR: #43720 — PP-aware KV connector handshake aggregation.
34.
tests/ut/test_compressed_prefix_cache.pyWhy: The compressed prefix-cache UT directly instantiates
CompressAttentionManager, bypassing the normal KV cache manager factory/coordinator path. On vLLM main,SingleTypeKVCacheManager.__init__now requiresscheduler_block_size, so the direct UT construction failed with:35.
vllm_ascend/_310p/kv_block_zeroer.pyWhy: On main,
KVBlockZeroer.__init__takes full metadata and runs init in ctor; on 0.21 ctor is(device, pin_memory)+ separateinit_meta.AscendKVBlockZeroerkeeps 0.21-style split API for NPU Triton zeroer.Change: Explicit
__init__initializing_meta/_ids_*fields so subclass does not invoke main’s expanded base__init__signature incorrectly.Upstream PR: #35219 introduced
KVBlockZeroer; main later merged init paths into ctor (post-0.21). Ascend dual-version shim — novllm_version_isguard, inheritance layout fix.