Skip to content

[Misc] Main2Main 0605#10250

Merged
MengqingCao merged 57 commits into
vllm-project:mainfrom
zhao-stack:Misc]-test-m2m-e2e
Jun 12, 2026
Merged

[Misc] Main2Main 0605#10250
MengqingCao merged 57 commits into
vllm-project:mainfrom
zhao-stack:Misc]-test-m2m-e2e

Conversation

@zhao-stack

@zhao-stack zhao-stack commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

1. vllm-ascend PR #10250 — Per-File Change Log (37 files)

Branch: Misc]-test-m2m-e2e vs main
Base commit range: vLLM main 9090368befc347f1b (also pinned in .github/vllm-main-verified.commit)
Dual-version guard: if vllm_version_is("0.21.0"): ... else: ... unless noted
Scope: this document lists 37 production/test files. Excluded from the numbered list: verified-commit pin, reverted/removed paths (logprob.py, patch_structured_outputs.py).


2. tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_fused_qkvzba_split_reshape_cat.py

Why: GatedDeltaNet import path moved on vLLM main.
Version guard: Yes — 0.21.0 imports mamba.gdn_linear_attn; else imports mamba.gdn.base.
Upstream PR: #43556 — Mamba LINEAR attention module refactor (split GDN layout).


3. tests/e2e/pull_request/four_card/test_data_parallel_tp2.py

Why: Stabilize DP2+TP2 e2e under m2m CI (memory + graph capture flakiness).
Change: Add @wait_until_npu_memory_free; pass --enforce-eager to offline DP script.
Upstream trigger: Ascend CI hardening for dual-version m2m — not tied to one vLLM PR.


4. tests/e2e/pull_request/four_card/test_qwen3_next.py

Why: Qwen3-Next graph capture OOM at 0.7 utilization after main moved compile paths.
Change: Raise gpu_memory_utilization 0.70.8.
Upstream trigger: Compile/graph stack churn on main (Qwen3-Next); no single blocking PR — related to torch.compile logging fixes on Ascend side.


5. tests/e2e/pull_request/one_card/test_guided_decoding.py

Why: v2 model runner is supported on 0.21.0 in this PR; old skip checked wrong tag 0.20.1.
Change: Skip v2 only when vllm_version_is("0.21.0") is false for the negative case — actually inverted: skip v2 on versions other than the intended matrix. (Fixes wrong skip predicate.)
Upstream trigger: Align test matrix with #40559 MRV2 availability on 0.21+.


6. tests/ut/patch/platform/test_patch_glm47_tool_call_parser.py

Why: Parser surface changed on main.
Version guard: Yes

  • 0.21.0: import _WrappedParser from upstream.
  • else: _WrappedParser removed; use thin DelegatingParser subclass.
  • parse_delta: main adds required kw-only finished (#44017 refactor); 0.21 has no such arg — helper _parse_delta branches.
    Upstream PRs: #44017 (parser refactor); _WrappedParser removal is part of main parser cleanup in that timeframe.

7. tests/ut/patch/platform/test_patch_tool_choice_none_content.py

Why: OpenAIServing._parse_tool_calls_from_content patch is 0.21-only; on main fix lives in DelegatingParser only.
Change: @pytest.mark.skipif(not vllm_version_is("0.21.0")) on the OpenAIServing-specific test.
Upstream PR: #42752 — honor tool_choice="none" in streaming (main routes through DelegatingParser; 0.21 still needs OpenAIServing hook).


8. tests/ut/patch/platform/test_prefix_cache_cp_patches.py

Why: AscendMambaManager on main requires scheduler_block_size in __init__.
Version guard: Pass scheduler_block_size=mamba_spec.block_size only when not 0.21.0.
Upstream PR: #44165 — thread scheduler_block_size into KV cache managers.


9. vllm_ascend/attention/context_parallel/attention_cp.py

Why: DCP helper symbols removed from vllm.distributed on main.
Change: Import get_decode_context_model_parallel_{world_size,rank} from vllm_ascend.distributed.utils instead of upstream.
Upstream PR: #41471 — remove dead get_decode_context_model_parallel_* from parallel_state.


10. vllm_ascend/attention/context_parallel/common_cp.py

Why: Same as item 12 — import relocation only.
Upstream PR: #41471.


11. vllm_ascend/attention/context_parallel/mla_cp.py

Why: Same as item 12 — import relocation only.
Upstream PR: #41471.

12. vllm_ascend/core/recompute_scheduler.py

Why: register_ascend_mla_spec_in_manager() must not call main-only registry APIs on 0.21.
Version guard:


13. vllm_ascend/core/single_type_kv_cache_manager.py

Why: Core KV manager dual-version adapter.
Version guards / API mapping:

Change 0.21.0 main (else) Upstream PR
Manager lookup spec_manager_map[type] KVCacheSpecRegistry.get_manager_class() #37505
EAGLE prefix arg use_eagle drop_eagle_block #44082
cache_blocks param N/A name on 0.21 retention_interval #43447

Why else branches exist: each column is an upstream API rename/addition after 0.21.


14. vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_connector.py

Why: Import relocation for removed DCP helpers (item #12).
Upstream PR: #41471.
Note: Mooncake coordinator on main also uses KVCacheSpecRegistry + drop_eagle_block (#37505, #44082) — this file only changes imports, not coordinator logic.


14. vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py

Why: Same DCP import relocation for get_decode_context_model_parallel_rank.
Upstream PR: #41471.


15. vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_worker.py

Why: Same DCP import relocation.
Upstream PR: #41471.


16. vllm_ascend/distributed/kv_transfer/kv_pool/cpu_offload/cpu_kv_cache_manager.py

Why: CPU offload KV manager must call ascend get_manager_for_kv_cache_spec and main allocator APIs.
Version guard / logic:

  • else: change get_manager_for_kv_cache_spec — #40946.
  • else: pass scheduler_block_size to manager ctor — #44165.
  • else: find_longest_cache_hit uses drop_eagle_block#44082.
  • else: get_num_blocks_to_allocate(..., total_computed_tokens=..., num_tokens_main_model=...) and allocate_new_computed_blocks(...) — main block-pool API (post-0.21 scheduler/coordinator refactor; same family as #44165).
  • 0.21.0: keeps use_eagle and older allocate/save call shapes.

17. vllm_ascend/distributed/utils.py

Why: Call sites still need DCP rank/world-size after upstream deletion.
Change: Reintroduce get_decode_context_model_parallel_world_size/rank() wrapping get_dcp_group().
Upstream PR: #41471 (removed on main; backported here for both versions).


18. vllm_ascend/ops/bailing_moe_linear_attn.py

Why: Linear-attention helpers moved out of mamba.linear_attn on main.
Version guard:

  • 0.21.0: vllm.model_executor.layers.mamba.linear_attn.
  • else: vllm.model_executor.layers.mamba.linear.minimax_linear_attn.
    Upstream PR: #43556 — Mamba LINEAR module refactor.

19. vllm_ascend/ops/triton/fla/fused_qkvzba_split_reshape.py

Why: logger.debug with tensor shapes breaks / pollutes logs under torch.compile (Qwen3-Next graph capture).
Change: Guard debug logging with if not torch.compiler.is_compiling():.
Upstream trigger: torch.compile integration on main models; Ascend-specific fix (no vLLM PR — compile artifact on NPU).


20. vllm_ascend/patch/platform/patch_kv_cache_coordinator.py

Why: Ascend hybrid prefix-cache coordinator must follow main KV coordinator constructor while staying 0.21-compatible.
Version guards:

  • Pass scheduler_block_size to managers/coordinators only on else#44165.
  • use_eagle vs drop_eagle_block in find_longest_cache_hit#44082.
  • VLLM_PREFIX_CACHE_RETENTION_INTERVAL validation — #43447.
  • Rename envsenvs_ascend to avoid clash with vllm.envs (retention env lives on main).
    Logic (no guard): Forward scheduler_block_size through get_kv_cache_coordinator signature added on main.

21. vllm_ascend/patch/platform/patch_mamba_manager.py

Why: Signature compatibility + avoid breaking main registry init order.
Version guard:

  • Accept both use_eagle and drop_eagle_block in find_longest_cache_hit (#44082).
  • Register AscendMambaManager in spec_manager_map only on 0.21.0; on main use KVCacheSpecRegistry via register_all_kvcache_specs (#37505) — early spec_manager_map write had caused missing FullAttentionSpec registration.

22. vllm_ascend/patch/platform/patch_tool_choice_none_content.py

Why: Upstream moved streaming tool-call suppression into DelegatingParser; OpenAIServing._parse_tool_calls_from_content still exists on 0.21.
Version guard:

  • Always: patch DelegatingParser._parse_tool_calls (works on both lines post-#42752).
  • 0.21.0 only: keep OpenAIServing._parse_tool_calls_from_content patch (main no longer needs it).
    Upstream PR: #44267.

23. vllm_ascend/patch/worker/patch_mamba_utils.py

Why: Document + preserve mamba state cleanup when main factored helper differently.
Change: Comment + inline cleanup for finished/preempted/resumed reqs (both versions).
Upstream context: Main worker mamba path refactors (#44539 KDA cache unification); 0.21 keeps inline cleanup — Ascend patch stays compatible with both.


24. vllm_ascend/patch/worker/patch_minimax_m2.py

Why: MiniMaxText01RMSNormTP moved out of mamba.linear_attn on main.
Version guard:

  • 0.21.0: import from mamba.linear_attn.
  • else: import from minimax_rms_norm.
    Upstream PR: #43556 (module split).

25. vllm_ascend/patch/worker/patch_minimax_m2_linear_attn.py

Why: Same RMSNorm import move as item #28.
Upstream PR: #43556.


26. vllm_ascend/platform.py

Why: Main sleep-mode validation calls Platform.is_cumem_allocator_available() before NPU custom op init.
Change: Return True on NPUPlatform (NPU uses CaMemAllocator).
Upstream PR: #43838 — add platform cumem probe; no 0.21 API (guard not needed — method absent on 0.21, harmless override).


27. vllm_ascend/spec_decode/llm_base_proposer.py

Why: Draft-model spec decode still needs temporary TP group swap; symbol removed on main.
Version guard:

  • 0.21.0: import upstream patch_tensor_parallel_group.
  • else: local backport context manager mutating _ps._TP.
    Upstream PR: #41471 removed patch_tensor_parallel_group from parallel_state.

28. vllm_ascend/worker/utils.py

Why: On main, KVBlockZeroer.__init__ takes full metadata and runs init in ctor; on 0.21 ctor is (device, pin_memory) + separate init_meta. AscendKVBlockZeroer keeps 0.21-style split API for NPU Triton zeroer.
Change: Explicit __init__ initializing _meta/_ids_* fields so subclass does not invoke main’s expanded base __init__ signature incorrectly.
Upstream PR: #35219 introduced KVBlockZeroer; main later merged init paths into ctor (post-0.21). Ascend dual-version shim — no vllm_version_is guard, inheritance layout fix.


29. vllm_ascend/worker/v2/model_runner.py

Why: MRV2 input batch and PP sampling metadata diverged on main.
Version guard:

  • else: pass num_computed_tokens_np, prefill_len_np, num_computed_prefill_tokens_np, optional max_seq_len_np into AscendInputBatch#42187 (PP bubble avoidance / extended batch fields).
    Logic (both versions): Split prefill detection into two numpy reads; add postprocess_sampled override + _copy_num_computed_tokens_to_cpu() so NPU attention still sees CPU seq_lens mirror.

30. vllm_ascend/worker/v2/spec_decode/eagle/aclgraph.py

Why: Eagle CUDA graph managers moved under autoregressive/ on main; 0.21 keeps monolithic eagle/cudagraph.py.
Version guard:

  • 0.21.0: DecodeEagleCudaGraphManager, PrefillEagleCudaGraphManager, CapturedAttentionState from eagle.cudagraph.
  • else: DecodeSpeculatorCudaGraphManager, PrefillSpeculatorCudaGraphManager, AttentionStatePair from autoregressive + cudagraph_utils.
    Upstream PR: #43241 — MRV2 speculator modularization (Eagle/MTP/Gemma4 split).

31. vllm_ascend/worker/v2/spec_decode/eagle/speculator.py

Why: Largest MRV2 dual-version adapter — upstream split Eagle vs autoregressive speculator modules.
Version guard: Import roots and helpers (update_eagle_draft_inputsupdate_draft_inputs, _BUILD_ATTN_METADATA_MODULE, prefill cudagraph class) branch on 0.21.0.
Logic (both): Ascend-specific generate_draft / attn metadata / aclgraph integration retained inside AscendEagleSpeculator.
Upstream PRs:

  • #43241 — module split.
  • #44253, #43991 — follow-up speculator/cudagraph fixes on main.

32. vllm_ascend/worker/v2/states.py

Why: RequestState.add_request() gained required max_tokens on main.
Version guard:

  • 0.21.0: call 4-arg super().add_request(...).
  • else: pass max_tokens=max_tokens for PP/max-seq tracking.
    Upstream PR: #42187.

33. vllm_ascend/worker/worker.py

Why: KV connector handshake dict keying changed for pipeline parallel.
Version guard:

  • 0.21.0: {tp_rank: metadata} (legacy).
  • else: {(pp_rank, tp_rank): metadata} with typed return KVConnectorHandshakeMetadata.
    Upstream PR: #43720 — PP-aware KV connector handshake aggregation.


34. tests/ut/test_compressed_prefix_cache.py

Why: The compressed prefix-cache UT directly instantiates CompressAttentionManager, bypassing the normal KV cache manager factory/coordinator path. On vLLM main, SingleTypeKVCacheManager.__init__ now requires scheduler_block_size, so the direct UT construction failed with:

TypeError: SingleTypeKVCacheManager.__init__() missing 1 required positional argument: 'scheduler_block_size'
Version guard: Yes.

0.21.0: keep the old constructor call without scheduler_block_size.
main / else: pass scheduler_block_size=block_size when constructing CompressAttentionManager.
Upstream PR: vLLM #44165 — threads scheduler_block_size into KV cache manager initialization.

Note: Production code already goes through the manager factory/coordinator path where scheduler_block_size is handled. This fix only updates the UT helper that directly constructs the manager.

35. vllm_ascend/_310p/kv_block_zeroer.py

Why: On main, KVBlockZeroer.__init__ takes full metadata and runs init in ctor; on 0.21 ctor is (device, pin_memory) + separate init_meta. AscendKVBlockZeroer keeps 0.21-style split API for NPU Triton zeroer.
Change: Explicit __init__ initializing _meta/_ids_* fields so subclass does not invoke main’s expanded base __init__ signature incorrectly.
Upstream PR: #35219 introduced KVBlockZeroer; main later merged init paths into ctor (post-0.21). Ascend dual-version shim — no vllm_version_is guard, inheritance layout fix.

---

## Not in the 35(reference only)

| Path | Note |
|------|------|
| `.github/vllm-main-verified.commit` | Pin moved `9090368b` → `efc347f1b` — defines version B for entire PR |
| `vllm_ascend/worker/model_runner_v1.py` | **Intentionally unchanged** in #10250 |
| `vllm_ascend/worker/v2/sample/logprob.py` | **Reverted / out of scope** — slow-path `max_per_req_token_ids` adapter removed from PR |
| `vllm_ascend/patch/worker/patch_v2/patch_structured_outputs.py` | **Deleted** — guided decoding uses existing `#8443` Triton path via `patch_v2/patch_triton.py` |
---

## Upstream PR quick index

| PR | Topic |
|----|--------|
| [#37505](https://github.com/vllm-project/vllm/pull/37505) | `KVCacheSpecRegistry` |
| [#41471](https://github.com/vllm-project/vllm/pull/41471) | Remove DCP helpers; remove `patch_tensor_parallel_group` |
| [#42187](https://github.com/vllm-project/vllm/pull/42187) | MRV2 PP fields; `max_tokens` in `RequestState` |
| [#42752](https://github.com/vllm-project/vllm/pull/42752) | `tool_choice="none"` streaming |
| [#43241](https://github.com/vllm-project/vllm/pull/43241) | MRV2 speculator modularization |
| [#43447](https://github.com/vllm-project/vllm/pull/43447) | Prefix retention interval |
| [#43556](https://github.com/vllm-project/vllm/pull/43556) | Mamba LINEAR / GDN module paths |
| [#43720](https://github.com/vllm-project/vllm/pull/43720) | PP-aware KV handshake |
| [#43838](https://github.com/vllm-project/vllm/pull/43838) | `is_cumem_allocator_available` |
| [#44017](https://github.com/vllm-project/vllm/pull/44017) | Parser `parse_delta(finished=...)` |
| [#44082](https://github.com/vllm-project/vllm/pull/44082) | `drop_eagle_block` rename |
| [#44165](https://github.com/vllm-project/vllm/pull/44165) | `scheduler_block_size` threading |
| [#40559](https://github.com/vllm-project/vllm/pull/40559) | MRV2 baseline (guided-decoding test matrix) |
| [#35219](https://github.com/vllm-project/vllm/pull/35219) | `KVBlockZeroer` introduction |

- --

- vLLM version: v0.21.0
- vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5

@zhao-stack

zhao-stack commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

/e2e tests/e2e/pull_request/one_card/model_runner_v2/test_basic.py::test_qwen3_dense_graph_mode
[Bot]: e2e command triggered. See workflow run for details.
[Bot]: e2e command failed.

@mergify

mergify Bot commented Jun 9, 2026

Copy link
Copy Markdown

⚠️ The sha of the head commit of this PR conflicts with #10233. Mergify cannot evaluate rules on this PR. Once #10233 is merged or closed, Mergify will resume processing this PR. ⚠️

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on maintaining cross-version compatibility between vLLM v0.21.0 and the current main branch. It introduces conditional logic throughout the vllm_ascend codebase to handle API changes, such as refactored parser classes and manager registrations. Additionally, it enhances NPU support by adding specific patches for structured outputs and synchronization, while also refining E2E testing to ensure numerical consistency across different execution modes.

Highlights

  • Version Compatibility: Implemented extensive version-gating logic using vllm_version_is to ensure the codebase remains compatible with both vLLM v0.21.0 and the current main branch.
  • Structured Outputs: Added a new patch for the v2 model runner to provide NPU-compatible structured output bitmasking, reusing the validated xgrammar in-place path from v1.
  • Testing Improvements: Updated E2E tests to utilize compare_logprobs for better numerical consistency verification and added necessary flags for graph-mode testing.
  • Platform Patches: Added a patch for torch.accelerator.synchronize to improve NPU compatibility and implemented is_cumem_allocator_available in NPUPlatform.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (5)
    • .github/workflows/dockerfiles/Dockerfile.lint
    • .github/workflows/pr_e2e_command.yml
    • .github/workflows/pr_test.yaml
    • .github/workflows/scripts/run_selected_tests.sh
    • .github/workflows/scripts/run_suite.py
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title:

[Misc][Feature] Support compatibility with vLLM 0.21.0 and newer versions

Suggested PR Summary:

### What this PR does / why we need it?

This pull request introduces comprehensive compatibility support for both vLLM v0.21.0 and newer versions (including vLLM main). It achieves this by adding version-conditional imports, backporting helpers that were removed in newer vLLM versions (such as `get_decode_context_model_parallel_world_size`, `get_decode_context_model_parallel_rank`, and `patch_tensor_parallel_group`), and adapting signatures to handle renamed or added arguments (e.g., `use_eagle` vs `drop_eagle_block`, and `scheduler_block_size`). Additionally, it introduces an NPU-compatible structured output bitmask for the v2 model runner and updates attention, speculator, and logprob implementations to align with upstream changes.

During the review, several critical issues were identified:
- In `vllm_ascend/core/single_type_kv_cache_manager.py` and `vllm_ascend/patch/platform/patch_mamba_manager.py`, the modified signature of `find_longest_cache_hit` breaks positional argument compatibility, which could lead to severe runtime errors or silent correctness bugs. A unified signature is suggested to maintain full backward and forward compatibility.
- In `vllm_ascend/worker/v2/sample/logprob.py`, the Triton kernel `_fill_logprob_token_ids_kernel` is called with an unexpected keyword argument `multibuffer=False`, which will raise a `TypeError` at runtime.

### Does this PR introduce _any_ user-facing change?

No, this PR focuses on internal compatibility and alignment with upstream vLLM versions.

### How was this patch tested?

The changes were tested using existing and updated end-to-end and unit tests, including model runner, guided decoding, and spec decode tests.

Comment thread vllm_ascend/core/single_type_kv_cache_manager.py
Comment on lines 34 to 39
alignment_tokens: int,
dcp_world_size: int = 1,
pcp_world_size: int = 1,
use_eagle: bool = False,
drop_eagle_block: bool = False,
) -> tuple[list[KVCacheBlock], ...]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to single_type_kv_cache_manager.py, the signature of find_longest_cache_hit in AscendMambaManager has been modified in a way that breaks positional argument compatibility. We should apply the same compatible signature here to prevent TypeError or argument mismatch when called positionally or via keyword arguments.

Suggested change
alignment_tokens: int,
dcp_world_size: int = 1,
pcp_world_size: int = 1,
use_eagle: bool = False,
drop_eagle_block: bool = False,
) -> tuple[list[KVCacheBlock], ...]:
use_eagle_or_drop_block: bool = False,
alignment_tokens: int = 0,
dcp_world_size: int = 1,
pcp_world_size: int = 1,
use_eagle: bool = False,
drop_eagle_block: bool = False,
) -> tuple[list[KVCacheBlock], ...]:

Comment thread vllm_ascend/worker/v2/sample/logprob.py Outdated
Comment on lines +207 to +210
NUM_TOPK=num_logprobs,
PADDED_COLS=triton.next_power_of_2(num_cols),
multibuffer=False,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The Triton kernel _fill_logprob_token_ids_kernel is called with multibuffer=False. However, multibuffer is not defined as a parameter in the kernel's signature. Passing an unexpected keyword argument to a Triton JIT function will raise a TypeError at runtime.

We should remove the multibuffer=False argument from the kernel call.

            NUM_TOPK=num_logprobs,
            PADDED_COLS=triton.next_power_of_2(num_cols),
        )

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@zhao-stack zhao-stack added the ready enable e2e test for PR label Jun 9, 2026
@zhao-stack zhao-stack force-pushed the Misc]-test-m2m-e2e branch from c41d75f to e59a9f7 Compare June 10, 2026 04:39
@github-actions

Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@zhao-stack

Copy link
Copy Markdown
Collaborator Author

@zhao-stack zhao-stack force-pushed the Misc]-test-m2m-e2e branch from 8efaf68 to 0647dcc Compare June 10, 2026 11:44
@github-actions github-actions Bot removed merge-conflicts documentation Improvements or additions to documentation labels Jun 10, 2026
@zhao-stack

zhao-stack commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

/e2e tests/e2e/pull_request/four_card/test_data_parallel_tp2.py::test_qwen3_inference_dp2_tp2
[Bot]: e2e command triggered. See workflow run for details.
[Bot]: e2e command failed.

@zhao-stack

Copy link
Copy Markdown
Collaborator Author

@zhao-stack

Copy link
Copy Markdown
Collaborator Author

nofushanquan and others added 24 commits June 12, 2026 21:12
Signed-off-by: nofushanquan <1255959842@qq.com>
Signed-off-by: nofushanquan <1255959842@qq.com>
Signed-off-by: nofushanquan <1255959842@qq.com>
Signed-off-by: nofushanquan <1255959842@qq.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: zhao-stack <2020265299@qq.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: nofushanquan <1255959842@qq.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: nofushanquan <1255959842@qq.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: nofushanquan <1255959842@qq.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>

@MengqingCao MengqingCao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thx!

@MengqingCao MengqingCao merged commit 72797cf into vllm-project:main Jun 12, 2026
16 of 18 checks passed
Fager10086 pushed a commit to Fager10086/vllm-ascend that referenced this pull request Jun 15, 2026
vllm main2main adaption

- vLLM version: v0.21.0
- vLLM main: vllm-project/vllm@9090368
---------
Signed-off-by: nofushanquan <1255959842@qq.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: zhao-stack <2020265299@qq.com>
Co-authored-by: nofushanquan <1255959842@qq.com>
Co-authored-by: liyishi <1252651434@qq.com>
Co-authored-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: zhaorifa <865071616@qq.com>
Fager10086 pushed a commit to Fager10086/vllm-ascend that referenced this pull request Jun 15, 2026
vllm main2main adaption

- vLLM version: v0.21.0
- vLLM main: vllm-project/vllm@9090368
---------
Signed-off-by: nofushanquan <1255959842@qq.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: zhao-stack <2020265299@qq.com>
Co-authored-by: nofushanquan <1255959842@qq.com>
Co-authored-by: liyishi <1252651434@qq.com>
Co-authored-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: Fager10086 <865071616@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants