[Misc]main2main 0522#9399
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces several improvements to the Ascend NPU backend, primarily focusing on memory management for KV caches and enhancing the Gumbel sampling kernel. These changes improve flexibility in block size handling and add support for capturing processed logits during sampling, while also hardening the implementation against unsupported data types like FP64. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
Suggested PR Title:
[Attention][Misc] Refactor logging macros, KV cache reshaping, and sampling kernelsSuggested PR Summary:
### What this PR does / why we need it?
This PR updates logging macros in MoE tiling headers to use `std::printf`, refactors the KV cache reshaping logic in `attn_utils.py` to incorporate `kernel_block_sizes`, and enhances Gumbel sampling kernels to support processed logits output. It also adds explicit checks to prevent unsupported FP64 operations on NPU and updates the vLLM commit hash in documentation.
Feedback from the review highlights several issues:
1. The logging macros in `error_log.h` introduce performance overhead due to temporary string creation, risk compilation failure if `__VA_ARGS__` is empty, and lack atomicity due to multiple `printf` calls.
2. The `num_blocks` calculation in `attn_utils.py` is incorrect when the kernel block size is larger than the logical block size, which could result in zero blocks being allocated.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with existing tests.| std::printf("[WARN][%s] ", std::string(opname).c_str()); \ | ||
| std::printf(__VA_ARGS__); \ | ||
| std::printf("\n"); \ |
There was a problem hiding this comment.
The updated logging macros introduce performance overhead and potential compilation issues. \n1. Performance: std::string(opname).c_str() creates a temporary std::string object on every log call. If opname is already a std::string (like the result of GetNodeType()), calling .c_str() directly is preferred. If it is a const char*, it should be used directly. \n2. Compilation Risk: std::printf(__VA_ARGS__) will fail to compile if __VA_ARGS__ is empty (e.g., OP_LOGW("opname")), as printf requires at least a format string argument. \n3. Atomicity: Splitting the log into three printf calls increases the risk of interleaved output from different threads.
| std::printf("[WARN][%s] ", std::string(opname).c_str()); \ | ||
| std::printf(__VA_ARGS__); \ | ||
| std::printf("\n"); \ |
There was a problem hiding this comment.
The updated logging macros introduce performance overhead and potential compilation issues. \n1. Performance: std::string(opname).c_str() creates a temporary std::string object on every log call. If opname is already a std::string, calling .c_str() directly is preferred. \n2. Compilation Risk: std::printf(__VA_ARGS__) will fail to compile if __VA_ARGS__ is empty, as printf requires at least a format string argument. \n3. Atomicity: Splitting the log into three printf calls increases the risk of interleaved output from different threads.
| if kv_cache_group_id < len(kernel_block_sizes): | ||
| kernel_block_size = kernel_block_sizes[kv_cache_group_id] | ||
| num_blocks *= kv_cache_spec.block_size // kernel_block_size |
There was a problem hiding this comment.
The calculation of num_blocks is incorrect when kernel_block_size is larger than kv_cache_spec.block_size. In Ascend, the kernel block size (e.g., 128) is often larger than the logical block size (e.g., 16). In such cases, kv_cache_spec.block_size // kernel_block_size evaluates to 0, which incorrectly sets num_blocks to 0. The logic should instead calculate the total number of tokens and then divide by the kernel block size.
if kv_cache_group_id < len(kernel_block_sizes):\n kernel_block_size = kernel_block_sizes[kv_cache_group_id]\n num_blocks = (num_blocks * kv_cache_spec.block_size) // kernel_block_size|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
137d01b to
383a5ac
Compare
|
/e2e tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_dflash_acceptance |
|
/e2e tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_dflash_acceptance |
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: zhao-stack <80399320+zhao-stack@users.noreply.github.com>
| if typing.TYPE_CHECKING: | ||
| from vllm.models.deepseek_v4.attention import DeepseekV4IndexerCache | ||
| from vllm.models.deepseek_v4.compressor import CompressorStateCache | ||
| else: | ||
| if vllm_version_is("0.20.2"): | ||
| _deepseek_compressor = typing.cast( | ||
| typing.Any, importlib.import_module("vllm.model_executor.layers.deepseek_compressor") | ||
| ) | ||
| _deepseek_v4_attention = typing.cast( | ||
| typing.Any, importlib.import_module("vllm.model_executor.layers.deepseek_v4_attention") | ||
| ) | ||
| else: | ||
| _deepseek_compressor = typing.cast(typing.Any, importlib.import_module("vllm.models.deepseek_v4.compressor")) | ||
| _deepseek_v4_attention = typing.cast(typing.Any, importlib.import_module("vllm.models.deepseek_v4.attention")) | ||
| CompressorStateCache = _deepseek_compressor.CompressorStateCache | ||
| DeepseekV4IndexerCache = _deepseek_v4_attention.DeepseekV4IndexerCache |
There was a problem hiding this comment.
| if typing.TYPE_CHECKING: | |
| from vllm.models.deepseek_v4.attention import DeepseekV4IndexerCache | |
| from vllm.models.deepseek_v4.compressor import CompressorStateCache | |
| else: | |
| if vllm_version_is("0.20.2"): | |
| _deepseek_compressor = typing.cast( | |
| typing.Any, importlib.import_module("vllm.model_executor.layers.deepseek_compressor") | |
| ) | |
| _deepseek_v4_attention = typing.cast( | |
| typing.Any, importlib.import_module("vllm.model_executor.layers.deepseek_v4_attention") | |
| ) | |
| else: | |
| _deepseek_compressor = typing.cast(typing.Any, importlib.import_module("vllm.models.deepseek_v4.compressor")) | |
| _deepseek_v4_attention = typing.cast(typing.Any, importlib.import_module("vllm.models.deepseek_v4.attention")) | |
| CompressorStateCache = _deepseek_compressor.CompressorStateCache | |
| DeepseekV4IndexerCache = _deepseek_v4_attention.DeepseekV4IndexerCache | |
| if not vllm_version_is("0.20.2"): | |
| from vllm.models.deepseek_v4.attention import DeepseekV4IndexerCache # noqa | |
| from vllm.models.deepseek_v4.compressor import CompressorStateCache # noqa | |
| else: | |
| from vllm.model_executor.layers.deepseek_compressor ... |
There was a problem hiding this comment.
The import path has been changed.
| return True | ||
|
|
||
|
|
||
| def _patched_is_cumem_allocator_available() -> bool: |
There was a problem hiding this comment.
let's double check if this is reasonable
There was a problem hiding this comment.
Unnecessary code has been deleted.
| from vllm_ascend.patch.platform.patch_kv_cache_interface import AscendMLAAttentionSpec | ||
| from vllm_ascend.utils import vllm_version_is | ||
|
|
||
| if TYPE_CHECKING: |
There was a problem hiding this comment.
The import path has been changed.
| if deferred_state_corrections_fn: | ||
| deferred_state_corrections_fn() | ||
| deferred_state_corrections_fn = None | ||
| if hasattr(mamba_utils, "MambaBuffers"): |
There was a problem hiding this comment.
| if hasattr(mamba_utils, "MambaBuffers"): | |
| if not version_is(0.20.2) and hasattr(mamba_utils, "MambaBuffers"): |
| self.num_accepted_tokens.copy_to_gpu(num_reqs) | ||
|
|
||
| postprocess_bufs = getattr(mamba_bufs, "postprocess_align", None) | ||
| if postprocess_bufs is not None and hasattr( |
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: shenzhao <shenzhao9@huawei.com>
This PR updates vllm-ascend main2main validation to: Main upstream changes and vllm-ascend adaptations: 1. vLLM PRs: - vllm-project/vllm#43004 - vllm-project/vllm#43039 - vllm-project/vllm#43073 - vllm-project/vllm#43077 `DeepSeek V4 model refactoring` Upstream changes: - Migrates DeepSeek V4 implementation from old `vllm.model_executor.layers.*` paths to `vllm.models.deepseek_v4.*`. - Moves DeepSeek V4 attention / compressor related classes to the new model package. vllm-ascend adaptation: - Update `vllm_ascend/models/deepseek_v4.py` to import `CompressorStateCache` and `DeepseekV4IndexerCache` from the correct path. - Update `vllm_ascend/patch/worker/patch_deepseek_compressor.py` to patch the correct module object. - Keep compatibility with `v0.20.2` by using the old import path when `vllm_version_is("0.20.2")`. 2. vLLM PR: vllm-project/vllm#42766 `[Bugfix][MRV2] Fix KVCache tensor explicit kernel_block_size dim` Upstream changes: - Adds explicit `kernel_block_sizes` to V2 attention / KV cache initialization. - Changes `BlockTables` construction and KV cache reshape logic to distinguish logical block size from kernel block size. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/block_table.py` to accept the new `kernel_block_sizes` argument. - Keep old `v0.20.2` constructor behavior with `vllm_version_is("0.20.2")`. - Update `vllm_ascend/worker/v2/attn_utils.py` to reshape KV cache with kernel block size while preserving storage block size handling. 3. vLLM PR: vllm-project/vllm#33648 `Support manually enabling the cumem allocator` Upstream changes: - Adds CuMem allocator availability validation in `ModelConfig`. - The validation runs before Ascend worker initialization. vllm-ascend adaptation: - Add `vllm_ascend/patch/platform/patch_camem_allocator.py`. - Patch `is_cumem_allocator_available` so Ascend CaMem sleep-mode support satisfies the allocator check. - Register the patch from `vllm_ascend/patch/platform/__init__.py`. 4. vLLM PRs: - vllm-project/vllm#40172 - vllm-project/vllm#41873 `Mamba state postprocess / is_prefilling changes` Upstream changes: - Introduces `MambaBuffers` and fused GPU-side Mamba postprocess staging. - Adds `is_prefilling` handling and clears padded rows to avoid stale metadata. vllm-ascend adaptation: - Update `vllm_ascend/worker/model_runner_v1.py` to support both old `MambaCopyBuffers` and new `MambaBuffers`. - Stage Mamba postprocess inputs when the new upstream helper exists. - Pass `is_prefilling` into common attention metadata and clear padded rows. 5. vLLM PR: vllm-project/vllm#36329 `Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2` Upstream changes: - Adds `split_ba` helper in GatedDeltaNet attention to correctly split / slice `ba` under TP. vllm-ascend adaptation: - Add `_split_ba_for_tp` in `vllm_ascend/ops/gdn.py`. - Use upstream `split_ba` when available. - Fall back to old `ba.chunk(2, dim=-1)` behavior for older vLLM versions. 6. vLLM PR: vllm-project/vllm#43125 `Use correct logprobs for logprob_token_ids` Upstream changes: - Propagates `logprobs_mode` into `TopKTopPSampler`. vllm-ascend adaptation: - Update `vllm_ascend/sample/sampler.py` to construct `AscendTopKTopPSampler(logprobs_mode=logprobs_mode)`. - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Signed-off-by: zhao-stack <80399320+zhao-stack@users.noreply.github.com> Co-authored-by: shenzhao <shenzhao9@huawei.com> Signed-off-by: XhgAtHuawei <guoxiaohui7@huawei.com>
This PR updates vllm-ascend main2main validation to: Main upstream changes and vllm-ascend adaptations: 1. vLLM PRs: - vllm-project/vllm#43004 - vllm-project/vllm#43039 - vllm-project/vllm#43073 - vllm-project/vllm#43077 `DeepSeek V4 model refactoring` Upstream changes: - Migrates DeepSeek V4 implementation from old `vllm.model_executor.layers.*` paths to `vllm.models.deepseek_v4.*`. - Moves DeepSeek V4 attention / compressor related classes to the new model package. vllm-ascend adaptation: - Update `vllm_ascend/models/deepseek_v4.py` to import `CompressorStateCache` and `DeepseekV4IndexerCache` from the correct path. - Update `vllm_ascend/patch/worker/patch_deepseek_compressor.py` to patch the correct module object. - Keep compatibility with `v0.20.2` by using the old import path when `vllm_version_is("0.20.2")`. 2. vLLM PR: vllm-project/vllm#42766 `[Bugfix][MRV2] Fix KVCache tensor explicit kernel_block_size dim` Upstream changes: - Adds explicit `kernel_block_sizes` to V2 attention / KV cache initialization. - Changes `BlockTables` construction and KV cache reshape logic to distinguish logical block size from kernel block size. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/block_table.py` to accept the new `kernel_block_sizes` argument. - Keep old `v0.20.2` constructor behavior with `vllm_version_is("0.20.2")`. - Update `vllm_ascend/worker/v2/attn_utils.py` to reshape KV cache with kernel block size while preserving storage block size handling. 3. vLLM PR: vllm-project/vllm#33648 `Support manually enabling the cumem allocator` Upstream changes: - Adds CuMem allocator availability validation in `ModelConfig`. - The validation runs before Ascend worker initialization. vllm-ascend adaptation: - Add `vllm_ascend/patch/platform/patch_camem_allocator.py`. - Patch `is_cumem_allocator_available` so Ascend CaMem sleep-mode support satisfies the allocator check. - Register the patch from `vllm_ascend/patch/platform/__init__.py`. 4. vLLM PRs: - vllm-project/vllm#40172 - vllm-project/vllm#41873 `Mamba state postprocess / is_prefilling changes` Upstream changes: - Introduces `MambaBuffers` and fused GPU-side Mamba postprocess staging. - Adds `is_prefilling` handling and clears padded rows to avoid stale metadata. vllm-ascend adaptation: - Update `vllm_ascend/worker/model_runner_v1.py` to support both old `MambaCopyBuffers` and new `MambaBuffers`. - Stage Mamba postprocess inputs when the new upstream helper exists. - Pass `is_prefilling` into common attention metadata and clear padded rows. 5. vLLM PR: vllm-project/vllm#36329 `Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2` Upstream changes: - Adds `split_ba` helper in GatedDeltaNet attention to correctly split / slice `ba` under TP. vllm-ascend adaptation: - Add `_split_ba_for_tp` in `vllm_ascend/ops/gdn.py`. - Use upstream `split_ba` when available. - Fall back to old `ba.chunk(2, dim=-1)` behavior for older vLLM versions. 6. vLLM PR: vllm-project/vllm#43125 `Use correct logprobs for logprob_token_ids` Upstream changes: - Propagates `logprobs_mode` into `TopKTopPSampler`. vllm-ascend adaptation: - Update `vllm_ascend/sample/sampler.py` to construct `AscendTopKTopPSampler(logprobs_mode=logprobs_mode)`. - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Signed-off-by: zhao-stack <80399320+zhao-stack@users.noreply.github.com> Co-authored-by: shenzhao <shenzhao9@huawei.com> Signed-off-by: yilunh <hanyilun1@huawei.com>
This PR updates vllm-ascend main2main validation to:
Main upstream changes and vllm-ascend adaptations:
vLLM PRs:
models/deepseek_v4/[2/N] vllm#43039DeepSeek V4 model refactoringUpstream changes:
vllm.model_executor.layers.*paths tovllm.models.deepseek_v4.*.vllm-ascend adaptation:
vllm_ascend/models/deepseek_v4.pyto importCompressorStateCacheandDeepseekV4IndexerCachefrom the correct path.vllm_ascend/patch/worker/patch_deepseek_compressor.pyto patch the correct module object.v0.20.2by using the old import path whenvllm_version_is("0.20.2").vLLM PR: [Bugfix][MRV2] Fix KVCache tensor explicit
kernel_block_sizedim vllm#42766[Bugfix][MRV2] Fix KVCache tensor explicit kernel_block_size dimUpstream changes:
kernel_block_sizesto V2 attention / KV cache initialization.BlockTablesconstruction and KV cache reshape logic to distinguish logical block size from kernel block size.vllm-ascend adaptation:
vllm_ascend/worker/v2/block_table.pyto accept the newkernel_block_sizesargument.v0.20.2constructor behavior withvllm_version_is("0.20.2").vllm_ascend/worker/v2/attn_utils.pyto reshape KV cache with kernel block size while preserving storage block size handling.vLLM PR: [Feature] Support manually enabling the cumem allocator vllm#33648
Support manually enabling the cumem allocatorUpstream changes:
ModelConfig.vllm-ascend adaptation:
vllm_ascend/patch/platform/patch_camem_allocator.py.is_cumem_allocator_availableso Ascend CaMem sleep-mode support satisfies the allocator check.vllm_ascend/patch/platform/__init__.py.vLLM PRs:
Mamba state postprocess / is_prefilling changesUpstream changes:
MambaBuffersand fused GPU-side Mamba postprocess staging.is_prefillinghandling and clears padded rows to avoid stale metadata.vllm-ascend adaptation:
vllm_ascend/worker/model_runner_v1.pyto support both oldMambaCopyBuffersand newMambaBuffers.is_prefillinginto common attention metadata and clear padded rows.vLLM PR: [Bugfix] Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2 vllm#36329
Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2Upstream changes:
split_bahelper in GatedDeltaNet attention to correctly split / slicebaunder TP.vllm-ascend adaptation:
_split_ba_for_tpinvllm_ascend/ops/gdn.py.split_bawhen available.ba.chunk(2, dim=-1)behavior for older vLLM versions.vLLM PR: [BugFix] Use correct logprobs for
logprob_token_idsvllm#43125Use correct logprobs for logprob_token_idsUpstream changes:
logprobs_modeintoTopKTopPSampler.vllm-ascend adaptation:
vllm_ascend/sample/sampler.pyto constructAscendTopKTopPSampler(logprobs_mode=logprobs_mode).How was this patch tested?