Skip to content

[Misc]main2main 0522#9399

Merged
MengqingCao merged 27 commits into
vllm-project:mainfrom
zhao-stack:m2m-0521
May 28, 2026
Merged

[Misc]main2main 0522#9399
MengqingCao merged 27 commits into
vllm-project:mainfrom
zhao-stack:m2m-0521

Conversation

@zhao-stack
Copy link
Copy Markdown
Contributor

@zhao-stack zhao-stack commented May 21, 2026

This PR updates vllm-ascend main2main validation to:

Main upstream changes and vllm-ascend adaptations:

  1. vLLM PRs:

    DeepSeek V4 model refactoring

    Upstream changes:

    • Migrates DeepSeek V4 implementation from old vllm.model_executor.layers.* paths to vllm.models.deepseek_v4.*.
    • Moves DeepSeek V4 attention / compressor related classes to the new model package.

    vllm-ascend adaptation:

    • Update vllm_ascend/models/deepseek_v4.py to import CompressorStateCache and DeepseekV4IndexerCache from the correct path.
    • Update vllm_ascend/patch/worker/patch_deepseek_compressor.py to patch the correct module object.
    • Keep compatibility with v0.20.2 by using the old import path when vllm_version_is("0.20.2").
  2. vLLM PR: [Bugfix][MRV2] Fix KVCache tensor explicit kernel_block_size dim vllm#42766
    [Bugfix][MRV2] Fix KVCache tensor explicit kernel_block_size dim

    Upstream changes:

    • Adds explicit kernel_block_sizes to V2 attention / KV cache initialization.
    • Changes BlockTables construction and KV cache reshape logic to distinguish logical block size from kernel block size.

    vllm-ascend adaptation:

    • Update vllm_ascend/worker/v2/block_table.py to accept the new kernel_block_sizes argument.
    • Keep old v0.20.2 constructor behavior with vllm_version_is("0.20.2").
    • Update vllm_ascend/worker/v2/attn_utils.py to reshape KV cache with kernel block size while preserving storage block size handling.
  3. vLLM PR: [Feature] Support manually enabling the cumem allocator vllm#33648
    Support manually enabling the cumem allocator

    Upstream changes:

    • Adds CuMem allocator availability validation in ModelConfig.
    • The validation runs before Ascend worker initialization.

    vllm-ascend adaptation:

    • Add vllm_ascend/patch/platform/patch_camem_allocator.py.
    • Patch is_cumem_allocator_available so Ascend CaMem sleep-mode support satisfies the allocator check.
    • Register the patch from vllm_ascend/patch/platform/__init__.py.
  4. vLLM PRs:

    Mamba state postprocess / is_prefilling changes

    Upstream changes:

    • Introduces MambaBuffers and fused GPU-side Mamba postprocess staging.
    • Adds is_prefilling handling and clears padded rows to avoid stale metadata.

    vllm-ascend adaptation:

    • Update vllm_ascend/worker/model_runner_v1.py to support both old MambaCopyBuffers and new MambaBuffers.
    • Stage Mamba postprocess inputs when the new upstream helper exists.
    • Pass is_prefilling into common attention metadata and clear padded rows.
  5. vLLM PR: [Bugfix] Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2 vllm#36329
    Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2

    Upstream changes:

    • Adds split_ba helper in GatedDeltaNet attention to correctly split / slice ba under TP.

    vllm-ascend adaptation:

    • Add _split_ba_for_tp in vllm_ascend/ops/gdn.py.
    • Use upstream split_ba when available.
    • Fall back to old ba.chunk(2, dim=-1) behavior for older vLLM versions.
  6. vLLM PR: [BugFix] Use correct logprobs for logprob_token_ids vllm#43125
    Use correct logprobs for logprob_token_ids

    Upstream changes:

    • Propagates logprobs_mode into TopKTopPSampler.

    vllm-ascend adaptation:

    • Update vllm_ascend/sample/sampler.py to construct AscendTopKTopPSampler(logprobs_mode=logprobs_mode).

How was this patch tested?

  • Validation focus:
    • DeepSeek V4 import / patch compatibility
    • V2 KV cache block table / kernel block size
    • V1 Mamba / GDN metadata compatibility

- vLLM version: v0.20.2
- vLLM main: https://github.com/vllm-project/vllm/commit/1ac10f159a09897baada01b14b6a0dd6442aefd6

@github-actions github-actions Bot added documentation Improvements or additions to documentation ci/build labels May 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@zhangxinyuehfad zhangxinyuehfad added ready read for review ready-for-test start test by label for PR labels May 21, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces several improvements to the Ascend NPU backend, primarily focusing on memory management for KV caches and enhancing the Gumbel sampling kernel. These changes improve flexibility in block size handling and add support for capturing processed logits during sampling, while also hardening the implementation against unsupported data types like FP64.

Highlights

  • KV Cache Reshaping Improvements: Enhanced the KV cache reshaping logic to support kernel-specific block sizes, allowing for more flexible memory management.
  • Gumbel Sampling Enhancements: Updated the Gumbel sampling kernel to support outputting processed logits and added explicit checks to prevent unsupported FP64 operations on NPU.
  • Logging and Configuration Updates: Standardized logging using std::printf in error headers and updated the main vLLM commit hash in documentation.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (4)
    • .github/workflows/_e2e_test.yaml
    • .github/workflows/dockerfiles/Dockerfile.lint
    • .github/workflows/pr_test_full.yaml
    • .github/workflows/pr_test_light.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title:

[Attention][Misc] Refactor logging macros, KV cache reshaping, and sampling kernels

Suggested PR Summary:

### What this PR does / why we need it?
This PR updates logging macros in MoE tiling headers to use `std::printf`, refactors the KV cache reshaping logic in `attn_utils.py` to incorporate `kernel_block_sizes`, and enhances Gumbel sampling kernels to support processed logits output. It also adds explicit checks to prevent unsupported FP64 operations on NPU and updates the vLLM commit hash in documentation.

Feedback from the review highlights several issues:
1. The logging macros in `error_log.h` introduce performance overhead due to temporary string creation, risk compilation failure if `__VA_ARGS__` is empty, and lack atomicity due to multiple `printf` calls.
2. The `num_blocks` calculation in `attn_utils.py` is incorrect when the kernel block size is larger than the logical block size, which could result in zero blocks being allocated.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with existing tests.

Comment on lines +11 to +13
std::printf("[WARN][%s] ", std::string(opname).c_str()); \
std::printf(__VA_ARGS__); \
std::printf("\n"); \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The updated logging macros introduce performance overhead and potential compilation issues. \n1. Performance: std::string(opname).c_str() creates a temporary std::string object on every log call. If opname is already a std::string (like the result of GetNodeType()), calling .c_str() directly is preferred. If it is a const char*, it should be used directly. \n2. Compilation Risk: std::printf(__VA_ARGS__) will fail to compile if __VA_ARGS__ is empty (e.g., OP_LOGW("opname")), as printf requires at least a format string argument. \n3. Atomicity: Splitting the log into three printf calls increases the risk of interleaved output from different threads.

Comment on lines +11 to +13
std::printf("[WARN][%s] ", std::string(opname).c_str()); \
std::printf(__VA_ARGS__); \
std::printf("\n"); \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The updated logging macros introduce performance overhead and potential compilation issues. \n1. Performance: std::string(opname).c_str() creates a temporary std::string object on every log call. If opname is already a std::string, calling .c_str() directly is preferred. \n2. Compilation Risk: std::printf(__VA_ARGS__) will fail to compile if __VA_ARGS__ is empty, as printf requires at least a format string argument. \n3. Atomicity: Splitting the log into three printf calls increases the risk of interleaved output from different threads.

Comment on lines +342 to +344
if kv_cache_group_id < len(kernel_block_sizes):
kernel_block_size = kernel_block_sizes[kv_cache_group_id]
num_blocks *= kv_cache_spec.block_size // kernel_block_size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation of num_blocks is incorrect when kernel_block_size is larger than kv_cache_spec.block_size. In Ascend, the kernel block size (e.g., 128) is often larger than the logical block size (e.g., 16). In such cases, kv_cache_spec.block_size // kernel_block_size evaluates to 0, which incorrectly sets num_blocks to 0. The logic should instead calculate the total number of tokens and then divide by the kernel block size.

                if kv_cache_group_id < len(kernel_block_sizes):\n                    kernel_block_size = kernel_block_sizes[kv_cache_group_id]\n                    num_blocks = (num_blocks * kv_cache_spec.block_size) // kernel_block_size

@zhao-stack zhao-stack changed the title m2m 0521 [Misc]m2m 0521 May 21, 2026
@zhao-stack zhao-stack changed the title [Misc]m2m 0521 [Misc]main2main 0521 May 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@zhao-stack zhao-stack force-pushed the m2m-0521 branch 2 times, most recently from 137d01b to 383a5ac Compare May 23, 2026 12:40
@zhao-stack zhao-stack changed the title [Misc]main2main 0521 [Misc]main2main 0522 May 23, 2026
@zhangxinyuehfad zhangxinyuehfad removed the ready-for-test start test by label for PR label May 25, 2026
@zhao-stack
Copy link
Copy Markdown
Contributor Author

/e2e tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_dflash_acceptance

@zhangxinyuehfad zhangxinyuehfad added ready read for review e2e-test and removed ready read for review labels May 25, 2026
@zhao-stack
Copy link
Copy Markdown
Contributor Author

/e2e tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_dflash_acceptance

Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
@zhao-stack
Copy link
Copy Markdown
Contributor Author

Signed-off-by: zhao-stack <80399320+zhao-stack@users.noreply.github.com>
Comment thread vllm_ascend/models/deepseek_v4.py Outdated
Comment on lines +86 to +101
if typing.TYPE_CHECKING:
from vllm.models.deepseek_v4.attention import DeepseekV4IndexerCache
from vllm.models.deepseek_v4.compressor import CompressorStateCache
else:
if vllm_version_is("0.20.2"):
_deepseek_compressor = typing.cast(
typing.Any, importlib.import_module("vllm.model_executor.layers.deepseek_compressor")
)
_deepseek_v4_attention = typing.cast(
typing.Any, importlib.import_module("vllm.model_executor.layers.deepseek_v4_attention")
)
else:
_deepseek_compressor = typing.cast(typing.Any, importlib.import_module("vllm.models.deepseek_v4.compressor"))
_deepseek_v4_attention = typing.cast(typing.Any, importlib.import_module("vllm.models.deepseek_v4.attention"))
CompressorStateCache = _deepseek_compressor.CompressorStateCache
DeepseekV4IndexerCache = _deepseek_v4_attention.DeepseekV4IndexerCache
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if typing.TYPE_CHECKING:
from vllm.models.deepseek_v4.attention import DeepseekV4IndexerCache
from vllm.models.deepseek_v4.compressor import CompressorStateCache
else:
if vllm_version_is("0.20.2"):
_deepseek_compressor = typing.cast(
typing.Any, importlib.import_module("vllm.model_executor.layers.deepseek_compressor")
)
_deepseek_v4_attention = typing.cast(
typing.Any, importlib.import_module("vllm.model_executor.layers.deepseek_v4_attention")
)
else:
_deepseek_compressor = typing.cast(typing.Any, importlib.import_module("vllm.models.deepseek_v4.compressor"))
_deepseek_v4_attention = typing.cast(typing.Any, importlib.import_module("vllm.models.deepseek_v4.attention"))
CompressorStateCache = _deepseek_compressor.CompressorStateCache
DeepseekV4IndexerCache = _deepseek_v4_attention.DeepseekV4IndexerCache
if not vllm_version_is("0.20.2"):
from vllm.models.deepseek_v4.attention import DeepseekV4IndexerCache # noqa
from vllm.models.deepseek_v4.compressor import CompressorStateCache # noqa
else:
from vllm.model_executor.layers.deepseek_compressor ...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import path has been changed.

return True


def _patched_is_cumem_allocator_available() -> bool:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's double check if this is reasonable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary code has been deleted.

from vllm_ascend.patch.platform.patch_kv_cache_interface import AscendMLAAttentionSpec
from vllm_ascend.utils import vllm_version_is

if TYPE_CHECKING:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import path has been changed.

Comment thread vllm_ascend/worker/model_runner_v1.py Outdated
if deferred_state_corrections_fn:
deferred_state_corrections_fn()
deferred_state_corrections_fn = None
if hasattr(mamba_utils, "MambaBuffers"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if hasattr(mamba_utils, "MambaBuffers"):
if not version_is(0.20.2) and hasattr(mamba_utils, "MambaBuffers"):

Comment thread vllm_ascend/worker/model_runner_v1.py Outdated
self.num_accepted_tokens.copy_to_gpu(num_reqs)

postprocess_bufs = getattr(mamba_bufs, "postprocess_align", None)
if postprocess_bufs is not None and hasattr(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified

@MengqingCao MengqingCao added ready read for review ready-for-test start test by label for PR labels May 28, 2026
shenzhao added 4 commits May 28, 2026 11:07
fix
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
fix
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Copy link
Copy Markdown
Collaborator

@MengqingCao MengqingCao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhao-stack
Copy link
Copy Markdown
Contributor Author

shenzhao added 2 commits May 28, 2026 19:22
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
@MengqingCao MengqingCao merged commit 360d47c into vllm-project:main May 28, 2026
56 of 58 checks passed
Biuapha pushed a commit to Biuapha/vllm-ascend that referenced this pull request May 30, 2026
This PR updates vllm-ascend main2main validation to:

Main upstream changes and vllm-ascend adaptations:

1. vLLM PRs:
   - vllm-project/vllm#43004
   - vllm-project/vllm#43039
   - vllm-project/vllm#43073
   - vllm-project/vllm#43077

   `DeepSeek V4 model refactoring`

   Upstream changes:
- Migrates DeepSeek V4 implementation from old
`vllm.model_executor.layers.*` paths to `vllm.models.deepseek_v4.*`.
- Moves DeepSeek V4 attention / compressor related classes to the new
model package.

   vllm-ascend adaptation:
- Update `vllm_ascend/models/deepseek_v4.py` to import
`CompressorStateCache` and `DeepseekV4IndexerCache` from the correct
path.
- Update `vllm_ascend/patch/worker/patch_deepseek_compressor.py` to
patch the correct module object.
- Keep compatibility with `v0.20.2` by using the old import path when
`vllm_version_is("0.20.2")`.

2. vLLM PR: vllm-project/vllm#42766
   `[Bugfix][MRV2] Fix KVCache tensor explicit kernel_block_size dim`

   Upstream changes:
- Adds explicit `kernel_block_sizes` to V2 attention / KV cache
initialization.
- Changes `BlockTables` construction and KV cache reshape logic to
distinguish logical block size from kernel block size.

   vllm-ascend adaptation:
- Update `vllm_ascend/worker/v2/block_table.py` to accept the new
`kernel_block_sizes` argument.
- Keep old `v0.20.2` constructor behavior with
`vllm_version_is("0.20.2")`.
- Update `vllm_ascend/worker/v2/attn_utils.py` to reshape KV cache with
kernel block size while preserving storage block size handling.

3. vLLM PR: vllm-project/vllm#33648
   `Support manually enabling the cumem allocator`

   Upstream changes:
   - Adds CuMem allocator availability validation in `ModelConfig`.
   - The validation runs before Ascend worker initialization.

   vllm-ascend adaptation:
   - Add `vllm_ascend/patch/platform/patch_camem_allocator.py`.
- Patch `is_cumem_allocator_available` so Ascend CaMem sleep-mode
support satisfies the allocator check.
   - Register the patch from `vllm_ascend/patch/platform/__init__.py`.

4. vLLM PRs:
   - vllm-project/vllm#40172
   - vllm-project/vllm#41873

   `Mamba state postprocess / is_prefilling changes`

   Upstream changes:
- Introduces `MambaBuffers` and fused GPU-side Mamba postprocess
staging.
- Adds `is_prefilling` handling and clears padded rows to avoid stale
metadata.

   vllm-ascend adaptation:
- Update `vllm_ascend/worker/model_runner_v1.py` to support both old
`MambaCopyBuffers` and new `MambaBuffers`.
   - Stage Mamba postprocess inputs when the new upstream helper exists.
- Pass `is_prefilling` into common attention metadata and clear padded
rows.

5. vLLM PR: vllm-project/vllm#36329
   `Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2`

   Upstream changes:
- Adds `split_ba` helper in GatedDeltaNet attention to correctly split /
slice `ba` under TP.

   vllm-ascend adaptation:
   - Add `_split_ba_for_tp` in `vllm_ascend/ops/gdn.py`.
   - Use upstream `split_ba` when available.
- Fall back to old `ba.chunk(2, dim=-1)` behavior for older vLLM
versions.

6. vLLM PR: vllm-project/vllm#43125
   `Use correct logprobs for logprob_token_ids`

   Upstream changes:
   - Propagates `logprobs_mode` into `TopKTopPSampler`.

   vllm-ascend adaptation:
- Update `vllm_ascend/sample/sampler.py` to construct
`AscendTopKTopPSampler(logprobs_mode=logprobs_mode)`.

- vLLM version: v0.20.2
- vLLM main: vllm-project/vllm@1ac10f1
---------
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: zhao-stack <80399320+zhao-stack@users.noreply.github.com>
Co-authored-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: XhgAtHuawei <guoxiaohui7@huawei.com>
yilunh998 pushed a commit to yilunh998/vllm-ascend that referenced this pull request Jun 2, 2026
This PR updates vllm-ascend main2main validation to:

Main upstream changes and vllm-ascend adaptations:

1. vLLM PRs:
   - vllm-project/vllm#43004
   - vllm-project/vllm#43039
   - vllm-project/vllm#43073
   - vllm-project/vllm#43077

   `DeepSeek V4 model refactoring`

   Upstream changes:
- Migrates DeepSeek V4 implementation from old
`vllm.model_executor.layers.*` paths to `vllm.models.deepseek_v4.*`.
- Moves DeepSeek V4 attention / compressor related classes to the new
model package.

   vllm-ascend adaptation:
- Update `vllm_ascend/models/deepseek_v4.py` to import
`CompressorStateCache` and `DeepseekV4IndexerCache` from the correct
path.
- Update `vllm_ascend/patch/worker/patch_deepseek_compressor.py` to
patch the correct module object.
- Keep compatibility with `v0.20.2` by using the old import path when
`vllm_version_is("0.20.2")`.

2. vLLM PR: vllm-project/vllm#42766
   `[Bugfix][MRV2] Fix KVCache tensor explicit kernel_block_size dim`

   Upstream changes:
- Adds explicit `kernel_block_sizes` to V2 attention / KV cache
initialization.
- Changes `BlockTables` construction and KV cache reshape logic to
distinguish logical block size from kernel block size.

   vllm-ascend adaptation:
- Update `vllm_ascend/worker/v2/block_table.py` to accept the new
`kernel_block_sizes` argument.
- Keep old `v0.20.2` constructor behavior with
`vllm_version_is("0.20.2")`.
- Update `vllm_ascend/worker/v2/attn_utils.py` to reshape KV cache with
kernel block size while preserving storage block size handling.

3. vLLM PR: vllm-project/vllm#33648
   `Support manually enabling the cumem allocator`

   Upstream changes:
   - Adds CuMem allocator availability validation in `ModelConfig`.
   - The validation runs before Ascend worker initialization.

   vllm-ascend adaptation:
   - Add `vllm_ascend/patch/platform/patch_camem_allocator.py`.
- Patch `is_cumem_allocator_available` so Ascend CaMem sleep-mode
support satisfies the allocator check.
   - Register the patch from `vllm_ascend/patch/platform/__init__.py`.

4. vLLM PRs:
   - vllm-project/vllm#40172
   - vllm-project/vllm#41873

   `Mamba state postprocess / is_prefilling changes`

   Upstream changes:
- Introduces `MambaBuffers` and fused GPU-side Mamba postprocess
staging.
- Adds `is_prefilling` handling and clears padded rows to avoid stale
metadata.

   vllm-ascend adaptation:
- Update `vllm_ascend/worker/model_runner_v1.py` to support both old
`MambaCopyBuffers` and new `MambaBuffers`.
   - Stage Mamba postprocess inputs when the new upstream helper exists.
- Pass `is_prefilling` into common attention metadata and clear padded
rows.

5. vLLM PR: vllm-project/vllm#36329
   `Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2`

   Upstream changes:
- Adds `split_ba` helper in GatedDeltaNet attention to correctly split /
slice `ba` under TP.

   vllm-ascend adaptation:
   - Add `_split_ba_for_tp` in `vllm_ascend/ops/gdn.py`.
   - Use upstream `split_ba` when available.
- Fall back to old `ba.chunk(2, dim=-1)` behavior for older vLLM
versions.

6. vLLM PR: vllm-project/vllm#43125
   `Use correct logprobs for logprob_token_ids`

   Upstream changes:
   - Propagates `logprobs_mode` into `TopKTopPSampler`.

   vllm-ascend adaptation:
- Update `vllm_ascend/sample/sampler.py` to construct
`AscendTopKTopPSampler(logprobs_mode=logprobs_mode)`.

- vLLM version: v0.20.2
- vLLM main: vllm-project/vllm@1ac10f1
---------
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: zhao-stack <80399320+zhao-stack@users.noreply.github.com>
Co-authored-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: yilunh <hanyilun1@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation module:ops module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants