[Misc]main2main 0522 by zhao-stack · Pull Request #9399 · vllm-project/vllm-ascend

zhao-stack · 2026-05-21T03:05:00Z

This PR updates vllm-ascend main2main validation to:

Main upstream changes and vllm-ascend adaptations:

vLLM PRs:
DeepSeek V4 model refactoring

Upstream changes:
- Migrates DeepSeek V4 implementation from old vllm.model_executor.layers.* paths to vllm.models.deepseek_v4.*.
- Moves DeepSeek V4 attention / compressor related classes to the new model package.
vllm-ascend adaptation:
- Update vllm_ascend/models/deepseek_v4.py to import CompressorStateCache and DeepseekV4IndexerCache from the correct path.
- Update vllm_ascend/patch/worker/patch_deepseek_compressor.py to patch the correct module object.
- Keep compatibility with v0.20.2 by using the old import path when vllm_version_is("0.20.2").
vLLM PR: [Bugfix][MRV2] Fix KVCache tensor explicit kernel_block_size dim vllm#42766
[Bugfix][MRV2] Fix KVCache tensor explicit kernel_block_size dim

Upstream changes:
- Adds explicit kernel_block_sizes to V2 attention / KV cache initialization.
- Changes BlockTables construction and KV cache reshape logic to distinguish logical block size from kernel block size.
vllm-ascend adaptation:
- Update vllm_ascend/worker/v2/block_table.py to accept the new kernel_block_sizes argument.
- Keep old v0.20.2 constructor behavior with vllm_version_is("0.20.2").
- Update vllm_ascend/worker/v2/attn_utils.py to reshape KV cache with kernel block size while preserving storage block size handling.
vLLM PR: [Feature] Support manually enabling the cumem allocator vllm#33648
Support manually enabling the cumem allocator

Upstream changes:
- Adds CuMem allocator availability validation in ModelConfig.
- The validation runs before Ascend worker initialization.
vllm-ascend adaptation:
- Add vllm_ascend/patch/platform/patch_camem_allocator.py.
- Patch is_cumem_allocator_available so Ascend CaMem sleep-mode support satisfies the allocator check.
- Register the patch from vllm_ascend/patch/platform/__init__.py.
vLLM PRs:
- [Perf] [Hybrid] Fused Triton kernel for GPU-side Mamba state postprocessing vllm#40172
- [Bugfix] Zero stale is_prefilling in padded CUDA graph rows for Mamba vllm#41873
Mamba state postprocess / is_prefilling changes

Upstream changes:
- Introduces MambaBuffers and fused GPU-side Mamba postprocess staging.
- Adds is_prefilling handling and clears padded rows to avoid stale metadata.
vllm-ascend adaptation:
- Update vllm_ascend/worker/model_runner_v1.py to support both old MambaCopyBuffers and new MambaBuffers.
- Stage Mamba postprocess inputs when the new upstream helper exists.
- Pass is_prefilling into common attention metadata and clear padded rows.
vLLM PR: [Bugfix] Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2 vllm#36329
Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2

Upstream changes:
- Adds split_ba helper in GatedDeltaNet attention to correctly split / slice ba under TP.
vllm-ascend adaptation:
- Add _split_ba_for_tp in vllm_ascend/ops/gdn.py.
- Use upstream split_ba when available.
- Fall back to old ba.chunk(2, dim=-1) behavior for older vLLM versions.
vLLM PR: [BugFix] Use correct logprobs for logprob_token_ids vllm#43125
Use correct logprobs for logprob_token_ids

Upstream changes:
- Propagates logprobs_mode into TopKTopPSampler.
vllm-ascend adaptation:
- Update vllm_ascend/sample/sampler.py to construct AscendTopKTopPSampler(logprobs_mode=logprobs_mode).

How was this patch tested?

Validation focus:
- DeepSeek V4 import / patch compatibility
- V2 KV cache block table / kernel block size
- V1 Mamba / GDN metadata compatibility


- vLLM version: v0.20.2
- vLLM main: https://github.com/vllm-project/vllm/commit/1ac10f159a09897baada01b14b6a0dd6442aefd6

github-actions · 2026-05-21T03:05:20Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-05-21T03:08:29Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces several improvements to the Ascend NPU backend, primarily focusing on memory management for KV caches and enhancing the Gumbel sampling kernel. These changes improve flexibility in block size handling and add support for capturing processed logits during sampling, while also hardening the implementation against unsupported data types like FP64.

Highlights

KV Cache Reshaping Improvements: Enhanced the KV cache reshaping logic to support kernel-specific block sizes, allowing for more flexible memory management.
Gumbel Sampling Enhancements: Updated the Gumbel sampling kernel to support outputting processed logits and added explicit checks to prevent unsupported FP64 operations on NPU.
Logging and Configuration Updates: Standardized logging using std::printf in error headers and updated the main vLLM commit hash in documentation.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files

Ignored by pattern: .github/workflows/** (4)
- .github/workflows/_e2e_test.yaml
- .github/workflows/dockerfiles/Dockerfile.lint
- .github/workflows/pr_test_full.yaml
- .github/workflows/pr_test_light.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

Suggested PR Title:

[Attention][Misc] Refactor logging macros, KV cache reshaping, and sampling kernels

Suggested PR Summary:

### What this PR does / why we need it?
This PR updates logging macros in MoE tiling headers to use `std::printf`, refactors the KV cache reshaping logic in `attn_utils.py` to incorporate `kernel_block_sizes`, and enhances Gumbel sampling kernels to support processed logits output. It also adds explicit checks to prevent unsupported FP64 operations on NPU and updates the vLLM commit hash in documentation.

Feedback from the review highlights several issues:
1. The logging macros in `error_log.h` introduce performance overhead due to temporary string creation, risk compilation failure if `__VA_ARGS__` is empty, and lack atomicity due to multiple `printf` calls.
2. The `num_blocks` calculation in `attn_utils.py` is incorrect when the kernel block size is larger than the logical block size, which could result in zero blocks being allocated.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with existing tests.

gemini-code-assist · 2026-05-21T03:10:46Z

+        std::printf("[WARN][%s] ", std::string(opname).c_str()); \
+        std::printf(__VA_ARGS__);        \
+        std::printf("\n");              \


The updated logging macros introduce performance overhead and potential compilation issues. \n1. Performance: std::string(opname).c_str() creates a temporary std::string object on every log call. If opname is already a std::string (like the result of GetNodeType()), calling .c_str() directly is preferred. If it is a const char*, it should be used directly. \n2. Compilation Risk: std::printf(__VA_ARGS__) will fail to compile if __VA_ARGS__ is empty (e.g., OP_LOGW("opname")), as printf requires at least a format string argument. \n3. Atomicity: Splitting the log into three printf calls increases the risk of interleaved output from different threads.

gemini-code-assist · 2026-05-21T03:10:46Z

+        std::printf("[WARN][%s] ", std::string(opname).c_str()); \
+        std::printf(__VA_ARGS__);        \
+        std::printf("\n");              \


The updated logging macros introduce performance overhead and potential compilation issues. \n1. Performance: std::string(opname).c_str() creates a temporary std::string object on every log call. If opname is already a std::string, calling .c_str() directly is preferred. \n2. Compilation Risk: std::printf(__VA_ARGS__) will fail to compile if __VA_ARGS__ is empty, as printf requires at least a format string argument. \n3. Atomicity: Splitting the log into three printf calls increases the risk of interleaved output from different threads.

gemini-code-assist · 2026-05-21T03:10:46Z

+                if kv_cache_group_id < len(kernel_block_sizes):
+                    kernel_block_size = kernel_block_sizes[kv_cache_group_id]
+                    num_blocks *= kv_cache_spec.block_size // kernel_block_size


The calculation of num_blocks is incorrect when kernel_block_size is larger than kv_cache_spec.block_size. In Ascend, the kernel block size (e.g., 128) is often larger than the logical block size (e.g., 16). In such cases, kv_cache_spec.block_size // kernel_block_size evaluates to 0, which incorrectly sets num_blocks to 0. The logic should instead calculate the total number of tokens and then divide by the kernel block size.

if kv_cache_group_id < len(kernel_block_sizes):\n kernel_block_size = kernel_block_sizes[kv_cache_group_id]\n num_blocks = (num_blocks * kv_cache_spec.block_size) // kernel_block_size

github-actions · 2026-05-21T08:52:47Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

zhao-stack · 2026-05-25T01:54:37Z

/e2e tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_dflash_acceptance

zhao-stack · 2026-05-25T01:59:07Z

/e2e tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_dflash_acceptance

Signed-off-by: shenzhao <shenzhao9@huawei.com>

zhao-stack · 2026-05-27T16:56:35Z

310P e2e-light passed: https://github.com/vllm-project/vllm-ascend/actions/runs/26520570143/job/78112538595?pr=9399

Signed-off-by: zhao-stack <80399320+zhao-stack@users.noreply.github.com>

MengqingCao · 2026-05-28T02:10:36Z

+if typing.TYPE_CHECKING:
+    from vllm.models.deepseek_v4.attention import DeepseekV4IndexerCache
+    from vllm.models.deepseek_v4.compressor import CompressorStateCache
+else:
+    if vllm_version_is("0.20.2"):
+        _deepseek_compressor = typing.cast(
+            typing.Any, importlib.import_module("vllm.model_executor.layers.deepseek_compressor")
+        )
+        _deepseek_v4_attention = typing.cast(
+            typing.Any, importlib.import_module("vllm.model_executor.layers.deepseek_v4_attention")
+        )
+    else:
+        _deepseek_compressor = typing.cast(typing.Any, importlib.import_module("vllm.models.deepseek_v4.compressor"))
+        _deepseek_v4_attention = typing.cast(typing.Any, importlib.import_module("vllm.models.deepseek_v4.attention"))
+    CompressorStateCache = _deepseek_compressor.CompressorStateCache
+    DeepseekV4IndexerCache = _deepseek_v4_attention.DeepseekV4IndexerCache


Suggested change

if typing.TYPE_CHECKING:

from vllm.models.deepseek_v4.attention import DeepseekV4IndexerCache

from vllm.models.deepseek_v4.compressor import CompressorStateCache

else:

if vllm_version_is("0.20.2"):

_deepseek_compressor = typing.cast(

typing.Any, importlib.import_module("vllm.model_executor.layers.deepseek_compressor")

)

_deepseek_v4_attention = typing.cast(

typing.Any, importlib.import_module("vllm.model_executor.layers.deepseek_v4_attention")

)

else:

_deepseek_compressor = typing.cast(typing.Any, importlib.import_module("vllm.models.deepseek_v4.compressor"))

_deepseek_v4_attention = typing.cast(typing.Any, importlib.import_module("vllm.models.deepseek_v4.attention"))

CompressorStateCache = _deepseek_compressor.CompressorStateCache

DeepseekV4IndexerCache = _deepseek_v4_attention.DeepseekV4IndexerCache

if not vllm_version_is("0.20.2"):

from vllm.models.deepseek_v4.attention import DeepseekV4IndexerCache # noqa

from vllm.models.deepseek_v4.compressor import CompressorStateCache # noqa

else:

from vllm.model_executor.layers.deepseek_compressor ...

The import path has been changed.

MengqingCao · 2026-05-28T02:13:16Z

+    return True
+
+
+def _patched_is_cumem_allocator_available() -> bool:


let's double check if this is reasonable

Unnecessary code has been deleted.

MengqingCao · 2026-05-28T02:13:31Z

 from vllm_ascend.patch.platform.patch_kv_cache_interface import AscendMLAAttentionSpec
+from vllm_ascend.utils import vllm_version_is
+
+if TYPE_CHECKING:


The import path has been changed.

MengqingCao · 2026-05-28T02:15:51Z

                    if deferred_state_corrections_fn:
                        deferred_state_corrections_fn()
                        deferred_state_corrections_fn = None
+                    if hasattr(mamba_utils, "MambaBuffers"):


Suggested change

if hasattr(mamba_utils, "MambaBuffers"):

if not version_is(0.20.2) and hasattr(mamba_utils, "MambaBuffers"):

MengqingCao · 2026-05-28T02:16:24Z

                    self.num_accepted_tokens.copy_to_gpu(num_reqs)
+
+                    postprocess_bufs = getattr(mamba_bufs, "postprocess_align", None)
+                    if postprocess_bufs is not None and hasattr(


Signed-off-by: shenzhao <shenzhao9@huawei.com>

…-0521

Signed-off-by: shenzhao <shenzhao9@huawei.com>

MengqingCao

LGTM

zhao-stack · 2026-05-28T11:15:52Z

e2e-full pass:https://github.com/vllm-project/vllm-ascend/actions/runs/26553149942/job/78219613699?pr=9399

Signed-off-by: shenzhao <shenzhao9@huawei.com>

This PR updates vllm-ascend main2main validation to: Main upstream changes and vllm-ascend adaptations: 1. vLLM PRs: - vllm-project/vllm#43004 - vllm-project/vllm#43039 - vllm-project/vllm#43073 - vllm-project/vllm#43077 `DeepSeek V4 model refactoring` Upstream changes: - Migrates DeepSeek V4 implementation from old `vllm.model_executor.layers.*` paths to `vllm.models.deepseek_v4.*`. - Moves DeepSeek V4 attention / compressor related classes to the new model package. vllm-ascend adaptation: - Update `vllm_ascend/models/deepseek_v4.py` to import `CompressorStateCache` and `DeepseekV4IndexerCache` from the correct path. - Update `vllm_ascend/patch/worker/patch_deepseek_compressor.py` to patch the correct module object. - Keep compatibility with `v0.20.2` by using the old import path when `vllm_version_is("0.20.2")`. 2. vLLM PR: vllm-project/vllm#42766 `[Bugfix][MRV2] Fix KVCache tensor explicit kernel_block_size dim` Upstream changes: - Adds explicit `kernel_block_sizes` to V2 attention / KV cache initialization. - Changes `BlockTables` construction and KV cache reshape logic to distinguish logical block size from kernel block size. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/block_table.py` to accept the new `kernel_block_sizes` argument. - Keep old `v0.20.2` constructor behavior with `vllm_version_is("0.20.2")`. - Update `vllm_ascend/worker/v2/attn_utils.py` to reshape KV cache with kernel block size while preserving storage block size handling. 3. vLLM PR: vllm-project/vllm#33648 `Support manually enabling the cumem allocator` Upstream changes: - Adds CuMem allocator availability validation in `ModelConfig`. - The validation runs before Ascend worker initialization. vllm-ascend adaptation: - Add `vllm_ascend/patch/platform/patch_camem_allocator.py`. - Patch `is_cumem_allocator_available` so Ascend CaMem sleep-mode support satisfies the allocator check. - Register the patch from `vllm_ascend/patch/platform/__init__.py`. 4. vLLM PRs: - vllm-project/vllm#40172 - vllm-project/vllm#41873 `Mamba state postprocess / is_prefilling changes` Upstream changes: - Introduces `MambaBuffers` and fused GPU-side Mamba postprocess staging. - Adds `is_prefilling` handling and clears padded rows to avoid stale metadata. vllm-ascend adaptation: - Update `vllm_ascend/worker/model_runner_v1.py` to support both old `MambaCopyBuffers` and new `MambaBuffers`. - Stage Mamba postprocess inputs when the new upstream helper exists. - Pass `is_prefilling` into common attention metadata and clear padded rows. 5. vLLM PR: vllm-project/vllm#36329 `Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2` Upstream changes: - Adds `split_ba` helper in GatedDeltaNet attention to correctly split / slice `ba` under TP. vllm-ascend adaptation: - Add `_split_ba_for_tp` in `vllm_ascend/ops/gdn.py`. - Use upstream `split_ba` when available. - Fall back to old `ba.chunk(2, dim=-1)` behavior for older vLLM versions. 6. vLLM PR: vllm-project/vllm#43125 `Use correct logprobs for logprob_token_ids` Upstream changes: - Propagates `logprobs_mode` into `TopKTopPSampler`. vllm-ascend adaptation: - Update `vllm_ascend/sample/sampler.py` to construct `AscendTopKTopPSampler(logprobs_mode=logprobs_mode)`. - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Signed-off-by: zhao-stack <80399320+zhao-stack@users.noreply.github.com> Co-authored-by: shenzhao <shenzhao9@huawei.com> Signed-off-by: XhgAtHuawei <guoxiaohui7@huawei.com>

This PR updates vllm-ascend main2main validation to: Main upstream changes and vllm-ascend adaptations: 1. vLLM PRs: - vllm-project/vllm#43004 - vllm-project/vllm#43039 - vllm-project/vllm#43073 - vllm-project/vllm#43077 `DeepSeek V4 model refactoring` Upstream changes: - Migrates DeepSeek V4 implementation from old `vllm.model_executor.layers.*` paths to `vllm.models.deepseek_v4.*`. - Moves DeepSeek V4 attention / compressor related classes to the new model package. vllm-ascend adaptation: - Update `vllm_ascend/models/deepseek_v4.py` to import `CompressorStateCache` and `DeepseekV4IndexerCache` from the correct path. - Update `vllm_ascend/patch/worker/patch_deepseek_compressor.py` to patch the correct module object. - Keep compatibility with `v0.20.2` by using the old import path when `vllm_version_is("0.20.2")`. 2. vLLM PR: vllm-project/vllm#42766 `[Bugfix][MRV2] Fix KVCache tensor explicit kernel_block_size dim` Upstream changes: - Adds explicit `kernel_block_sizes` to V2 attention / KV cache initialization. - Changes `BlockTables` construction and KV cache reshape logic to distinguish logical block size from kernel block size. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/block_table.py` to accept the new `kernel_block_sizes` argument. - Keep old `v0.20.2` constructor behavior with `vllm_version_is("0.20.2")`. - Update `vllm_ascend/worker/v2/attn_utils.py` to reshape KV cache with kernel block size while preserving storage block size handling. 3. vLLM PR: vllm-project/vllm#33648 `Support manually enabling the cumem allocator` Upstream changes: - Adds CuMem allocator availability validation in `ModelConfig`. - The validation runs before Ascend worker initialization. vllm-ascend adaptation: - Add `vllm_ascend/patch/platform/patch_camem_allocator.py`. - Patch `is_cumem_allocator_available` so Ascend CaMem sleep-mode support satisfies the allocator check. - Register the patch from `vllm_ascend/patch/platform/__init__.py`. 4. vLLM PRs: - vllm-project/vllm#40172 - vllm-project/vllm#41873 `Mamba state postprocess / is_prefilling changes` Upstream changes: - Introduces `MambaBuffers` and fused GPU-side Mamba postprocess staging. - Adds `is_prefilling` handling and clears padded rows to avoid stale metadata. vllm-ascend adaptation: - Update `vllm_ascend/worker/model_runner_v1.py` to support both old `MambaCopyBuffers` and new `MambaBuffers`. - Stage Mamba postprocess inputs when the new upstream helper exists. - Pass `is_prefilling` into common attention metadata and clear padded rows. 5. vLLM PR: vllm-project/vllm#36329 `Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2` Upstream changes: - Adds `split_ba` helper in GatedDeltaNet attention to correctly split / slice `ba` under TP. vllm-ascend adaptation: - Add `_split_ba_for_tp` in `vllm_ascend/ops/gdn.py`. - Use upstream `split_ba` when available. - Fall back to old `ba.chunk(2, dim=-1)` behavior for older vLLM versions. 6. vLLM PR: vllm-project/vllm#43125 `Use correct logprobs for logprob_token_ids` Upstream changes: - Propagates `logprobs_mode` into `TopKTopPSampler`. vllm-ascend adaptation: - Update `vllm_ascend/sample/sampler.py` to construct `AscendTopKTopPSampler(logprobs_mode=logprobs_mode)`. - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Signed-off-by: zhao-stack <80399320+zhao-stack@users.noreply.github.com> Co-authored-by: shenzhao <shenzhao9@huawei.com> Signed-off-by: yilunh <hanyilun1@huawei.com>

zhao-stack requested review from LCAIZJ, MengqingCao, Yikun, wangxiyuan and zzzzwwjj as code owners May 21, 2026 03:05

github-actions Bot added documentation Improvements or additions to documentation ci/build labels May 21, 2026

zhangxinyuehfad added ready read for review ready-for-test start test by label for PR labels May 21, 2026

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

zhao-stack changed the title ~~m2m 0521~~ [Misc]m2m 0521 May 21, 2026

zhao-stack changed the title ~~[Misc]m2m 0521~~ [Misc]main2main 0521 May 21, 2026

github-actions Bot added the merge-conflicts label May 21, 2026

github-actions Bot removed the merge-conflicts label May 21, 2026

zhao-stack requested review from realliujiaxu and whx-sjtu as code owners May 22, 2026 13:16

github-actions Bot added the module:ops label May 22, 2026

zhao-stack force-pushed the m2m-0521 branch 2 times, most recently from 137d01b to 383a5ac Compare May 23, 2026 12:40

zhao-stack changed the title ~~[Misc]main2main 0521~~ [Misc]main2main 0522 May 23, 2026

zhao-stack force-pushed the m2m-0521 branch from 4ee8f87 to 383a5ac Compare May 24, 2026 15:10

zhangxinyuehfad removed the ready-for-test start test by label for PR label May 25, 2026

zhangxinyuehfad added ready read for review e2e-test and removed ready read for review labels May 25, 2026

zhangxinyuehfad added the e2e-test label May 27, 2026

vllm-ascend-ci added e2e-test and removed e2e-test labels May 27, 2026

trigger CI

cf504d6

Signed-off-by: shenzhao <shenzhao9@huawei.com>

github-actions Bot added the module:tests label May 27, 2026

zhangxinyuehfad removed ready read for review e2e-test labels May 27, 2026

trigger CI

f27bb3f

Signed-off-by: shenzhao <shenzhao9@huawei.com>

Update test_vl_model_multicard.py

7865963

Signed-off-by: zhao-stack <80399320+zhao-stack@users.noreply.github.com>

Tflowers-0129 approved these changes May 28, 2026

View reviewed changes

MengqingCao reviewed May 28, 2026

View reviewed changes

MengqingCao added ready read for review ready-for-test start test by label for PR labels May 28, 2026

shenzhao added 4 commits May 28, 2026 11:07

fix

3a8aca8

Signed-off-by: shenzhao <shenzhao9@huawei.com>

Merge branch 'm2m-0521' of github.com:zhao-stack/vllm-ascend into m2m…

76b5eb5

…-0521

fix import

31e6cd8

Signed-off-by: shenzhao <shenzhao9@huawei.com>

fix

c65189e

Signed-off-by: shenzhao <shenzhao9@huawei.com>

MengqingCao approved these changes May 28, 2026

View reviewed changes

shenzhao added 2 commits May 28, 2026 19:22

ignore import

41478f8

Signed-off-by: shenzhao <shenzhao9@huawei.com>

fix pre-commit

f6acdac

Signed-off-by: shenzhao <shenzhao9@huawei.com>

MengqingCao approved these changes May 28, 2026

View reviewed changes

MengqingCao merged commit 360d47c into vllm-project:main May 28, 2026
56 of 58 checks passed

		return True


		def _patched_is_cumem_allocator_available() -> bool:

	if hasattr(mamba_utils, "MambaBuffers"):
	if not version_is(0.20.2) and hasattr(mamba_utils, "MambaBuffers"):

Conversation

zhao-stack commented May 21, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How was this patch tested?

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

gemini-code-assist Bot commented May 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

zhao-stack commented May 25, 2026

Uh oh!

zhao-stack commented May 25, 2026

Uh oh!

zhao-stack commented May 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MengqingCao left a comment

Choose a reason for hiding this comment

Uh oh!

zhao-stack commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zhao-stack commented May 21, 2026 •

edited by github-actions Bot

Loading