[feat] cross layer kvcache by HF-001 · Pull Request #4768 · vllm-project/vllm-ascend

HF-001 · 2025-12-08T02:15:09Z

What this PR does / why we need it?

this pr is for #4140 , and refer to: vllm-project/vllm#27743

Following this PR, connectors can turn-on and adapt to the new layout.

This PR enables the model_runner_v1 to allocate the KV cache tensors, so that the KV data for all layers will be contiguous per block. This can yield a significant speed up the transfer time of KV transfers (e.g. X4), such in the case of OffloadingConnector. Currently, this new layout is disabled by default, and will only be enabled when using a connector which explicitly prefers this new layout. Also, this new layout is currently only supported for uniform (non HMA) models.

How was this patch tested?

`export CUDA_VISIBLE_DEVICES=6
export TP=1
export MODEL_PATH=/model/Qwen3-14B
export MODEL_NAME=Qwen3-14B
export PORT=10113
#export ASCEND_LAUNCH_BLOCKING=1
#export ASCEND_SLOG_PRINT_TO_STDOUT=1

python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --no-enable-prefix-caching --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.65 --max-model-len 32768 --trust-remote-code --disable-log-requests
--block-size 128
--kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}'`

Performance testing on qwen3-14b， result is :

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

gemini-code-assist

Code Review

This pull request introduces a cross-layer KV cache layout to improve performance for KV transfers, especially for offloading. The changes are mainly in vllm_ascend/worker/model_runner_v1.py to conditionally allocate a contiguous KV cache tensor for all layers. The attention backends are also updated to support this new layout.

The overall approach is sound, but I've found two critical issues in the implementation of allocate_uniform_kv_caches that could lead to incorrect behavior or runtime errors. One issue is related to the assumption of equal K and V cache sizes for MLA models, and the other is an incorrect permutation logic for non-MLA models. Please see the detailed comments for suggestions on how to fix them.

gemini-code-assist · 2025-12-08T02:20:49Z

vllm_ascend/worker/model_runner_v1.py

+        else :
+            new_kv_cache_shape = tuple(kv_cache_shape[i] for i in kv_cache_stride_order)
+            new_kv_cache_shape[-1] = new_kv_cache_shape[-1]//2
+        logger.info("Allocating a cross layer KV cache of shape %s", new_kv_cache_shape)
+
+        # allocate one contiguous buffer for all layers
+        cross_layers_k_cache = (
+            torch.zeros(total_size//2, dtype=torch.int8, device=device)
+            .view(kv_cache_spec.dtype)
+            .view(new_kv_cache_shape)
+        )
+        cross_layers_v_cache = (
+            torch.zeros(total_size//2, dtype=torch.int8, device=device)
+            .view(kv_cache_spec.dtype)
+            .view(new_kv_cache_shape)
+        )


The logic for calculating new_kv_cache_shape for MLA models and the subsequent allocation of cross_layers_k_cache and cross_layers_v_cache assumes that the key and value caches have the same size. Specifically, it divides head_size by 2 and uses this for both K and V caches. However, for MLA models, K and V caches can have different sizes, determined by kv_lora_rank and qk_rope_head_dim respectively. This will lead to incorrect tensor shapes and memory corruption if kv_lora_rank != qk_rope_head_dim. The allocation size total_size // 2 is also based on this incorrect assumption.

The K and V caches should be handled with their respective shapes and sizes. You should calculate their sizes and shapes separately based on kv_lora_rank and qk_rope_head_dim.

gemini-code-assist · 2025-12-08T02:20:49Z

vllm_ascend/worker/model_runner_v1.py

+            inv_order = [
+                kv_cache_stride_order.index(i) for i in range(len(kv_cache_stride_order)) if kv_cache_shape[kv_cache_stride_order[i]] != 2
+            ]
+            if len(new_kv_cache_shape) != len(kv_cache_shape):
+                inv_order = [i - 1 for i in inv_order]


The calculation of inv_order for non-MLA models is incorrect. The list comprehension logic and the subsequent adjustment [i - 1 for i in inv_order] will produce a wrong permutation, which can contain negative indices. This will lead to incorrect memory access when permute(*inv_order) is called, causing data corruption or runtime errors. The logic needs to be revised to correctly compute the inverse permutation that brings the num_layers dimension to the front of the tensor.

github-actions · 2025-12-08T03:09:55Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

vllm_ascend/worker/model_runner_v1.py

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

HF-001 · 2025-12-10T01:35:20Z

@Sparkheart @MengqingCao hi, this pr is ready，cloud you help review？

github-actions · 2025-12-10T08:17:28Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: kx <1670186653@qq.com>

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

zzzzwwjj · 2025-12-12T03:54:12Z

This PR cannot be merged yet. We need to align on the impact of KV cache layout on functionality and the proposed solutions before merging.
Functions that may be affected:

KV Connector related, such as Prefill-Decode Disaggregation, KV Pool, KV offload. This feature will have an impact on the KV transmission function.
KV calculation related, such as reshape_and_cache, paged_cache_load, and various attention ops. All of these ops need to support strided tensors.

cc @wangxiyuan @weijinqian0 @MengqingCao

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

HF-001 · 2025-12-12T06:55:28Z

This PR cannot be merged yet. We need to align on the impact of KV cache layout on functionality and the proposed solutions before merging. Functions that may be affected:

KV Connector related, such as Prefill-Decode Disaggregation, KV Pool, KV offload. This feature will have an impact on the KV transmission function.

KV calculation related, such as reshape_and_cache, paged_cache_load, and various attention ops. All of these ops need to support strided tensors.

cc @wangxiyuan @weijinqian0 @MengqingCao

@zzzzwwjj hi，1. this PR will only take effect when using an offloading connector and preempt_cross_layer_blocks is true, and will not affect other scenarios. If PD Disaggregation scenarios need to be used, simple adaptation is required.
2. Stride_order will only have an impact when the offloading connector and preempt_cross_layer_blocks are true, and will not affect other scenarios. i think Stride_order will not impact reshape_and_cache, paged_cache_load`.

cc @wangxiyuan @weijinqian0 @MengqingCao

For the convenience of understanding, the following is the design diagram of this PR project.In fact, the cross_layers_k/v_cache used only adjusted the layout through stride_order.

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

github-actions · 2025-12-12T09:30:44Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: kx <1670186653@qq.com>

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

MengqingCao · 2025-12-15T03:28:39Z

This PR cannot be merged yet. We need to align on the impact of KV cache layout on functionality and the proposed solutions before merging. Functions that may be affected:

KV Connector related, such as Prefill-Decode Disaggregation, KV Pool, KV offload. This feature will have an impact on the KV transmission function.

KV calculation related, such as reshape_and_cache, paged_cache_load, and various attention ops. All of these ops need to support strided tensors.

cc @wangxiyuan @weijinqian0 @MengqingCao

@zzzzwwjj hi，1. this PR will only take effect when using an offloading connector and preempt_cross_layer_blocks is true, and will not affect other scenarios. If PD Disaggregation scenarios need to be used, simple adaptation is required. 2. Stride_order will only have an impact when the offloading connector and preempt_cross_layer_blocks are true, and will not affect other scenarios. i think Stride_order will not impact reshape_and_cache, paged_cache_load`.

Actually the op reshape_and_cache doesn't support non-contigous tensor now, not sure how this pr works now when both offloading connector and preempt_cross_layer_blocks are enabled

HF-001 · 2025-12-15T03:48:08Z

This PR cannot be merged yet. We need to align on the impact of KV cache layout on functionality and the proposed solutions before merging. Functions that may be affected:

KV Connector related, such as Prefill-Decode Disaggregation, KV Pool, KV offload. This feature will have an impact on the KV transmission function.

KV calculation related, such as reshape_and_cache, paged_cache_load, and various attention ops. All of these ops need to support strided tensors.

cc @wangxiyuan @weijinqian0 @MengqingCao

@zzzzwwjj hi，1. this PR will only take effect when using an offloading connector and preempt_cross_layer_blocks is true, and will not affect other scenarios. If PD Disaggregation scenarios need to be used, simple adaptation is required. 2. Stride_order will only have an impact when the offloading connector and preempt_cross_layer_blocks are true, and will not affect other scenarios. i think Stride_order will not impact reshape_and_cache, paged_cache_load`.

Actually the op reshape_and_cache doesn't support non-contigous tensor now, not sure how this pr works now when both offloading connector and preempt_cross_layer_blocks are enabled

@MengqingCao In this pr，We are using cross_layers_k_cache and cross_layers_v_cache (continuous tensors) instead of kv_caches (not used), so 'reshape_and_cache' is support. The shape of cross_layers_k_cache/cross_layers_v_cache is (num_blocks, num_layers, block_size, num_heads, head_size).During each transmission, the transmitted block is (num_layers, block_size, num_ heads, head_size) instead of the previous block (block_size, num_ heads, head_size).

in this pr, bytes of per transmitted block is num_layers times larger than the original, similar to batch transmission, which can greatly improve transmission efficiency.

maybe you can refer to: vllm-project/vllm#27742

LCAIZJ · 2025-12-15T06:15:28Z

PD connector need also requires corresponding modifications to adapt to the new layout.

HF-001 · 2025-12-15T06:22:12Z

PD connector need also requires corresponding modifications to adapt to the new layout.

@LCAIZJ Do you mean I need to add code for PD separation adjustment in this PR? Although it can be implemented, it would compromise the consistency with VLLM. Currently, VLLM has only been implemented in the kvcache offloading scenario.

HF-001 · 2025-12-15T09:11:52Z

Actually the op reshape_and_cache doesn't support non-contigous tensor now, not sure how this pr works now when both offloading connector and preempt_cross_layer_blocks are enabled

@MengqingCao do you mean torch_npu._npu_reshape_and_cache()? I have tested the entire process and it can work normally.

MengqingCao · 2025-12-15T09:15:02Z

Actually the op reshape_and_cache doesn't support non-contigous tensor now, not sure how this pr works now when both offloading connector and preempt_cross_layer_blocks are enabled

@MengqingCao do you mean torch_npu._npu_reshape_and_cache()? I have tested the entire process and it can work normally.

Exactly, I'll debug locally with this pr later to check some details

HF-001 · 2025-12-15T10:22:33Z

Exactly, I'll debug locally with this pr later to check some details

@MengqingCao you are right. torch_npu._npu_reshape_and_cache() currently can not support kvcaches, which is non-contigous. Can torch_npu._npu_reshape_and_cache() be optimized to support discontinuous tensors? Like reshape_and_cache() in VLLM.

HF-001 · 2025-12-16T01:21:44Z

@MengqingCao @zzzzwwjj May I ask if there are any other operators that need to be optimized besides the torch_npu. _npu_reshape_and_cache() operator? If it's just torch_npu._npu_reshape_and_cache(), perhaps I can refer to the reshape_and_cache() operator in VLLM and optimize it in Ascend.

@zzzzwwjj

I'd like to nominate @zzzzwwjj @realliujiaxu @LCAIZJ to join vLLM Ascend committer team. @zzzzwwjj --- - Review Quality‌: He has completed 80+reviews since April. 2025, include #3232 (comment), #4822 (comment), #4768 (comment) high quality review. - Sustained Contributions 15+ Valuable bug fix and refactor is very good. https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Azzzzwwjj+is%3Aclosed+review%3Aapproved Continuous optimization of code architecture https://github.com/vllm-project/vllm-ascend/pulls?q=author%3Azzzzwwjj+is%3Amerged - Quality Contribution‌: #1229 #1979 #4359 #4878 - Community Involvement‌: He lead the #1147, to refactor AscendFusedMoE at the first time. He shared topics about large-scale distributed inference and reinforcement learning on vLLM-Ascend meetup on August 2nd. @realliujiaxu --- - Review Quality‌: He has completed about [40+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Arealliujiaxu+-author%3Arealliujiaxu+) since September, include #4868 (comment), #2275 (comment). - Sustained Contributions He has completed (17 commits)[https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged], continuously optimizing the performance of the MoE model. - Quality Contribution‌: Contributed the Flash Comm1 feature to the community, supporting both eager and aclgraph execution modes, while compatible with multiple MoE models including DeepSeek and GLM4.5. - #3334 - #3420 - #3015 co-author: - #3495 - #4868 - Community Involvement‌: 1. Completed two major refactors, enabling vllm-ascend to evolve more rapidly and robustly: [Linear module](#2867) and [rejection sampler](#4975) 2. [fixed 8 bugs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged+bugfix+) in graph mode, spec decoding and async scheduling. @LCAIZJ --- - Review Quality‌: He's been the go-to reviewer for virtually all PD disaggregation and KV Pool related PRs, having completed [30+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3ALCAIZJ+is%3Aopen+-author%3ALCAIZJ+) since May 2025. Notable examples include [discussion_r2553887360](#4345 (comment)), [issuecomment-3540994801](#4161 (comment)), and [discussion_r2492593988](#3981 (comment)), all demonstrating thorough and insightful feedback. - Sustained and Quality Contributions: His contributions reflect a strong grasp of both ‌vLLM‌ and ‌vLLM Ascend‌ codebases, particularly in prefill-decode disaggregation and KV pool areas ([7 PRs merged](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+)). Prefill-Decode Disaggregation: Delivered KV transfer functionality using Mooncake TransferEngine and enabled layerwise KV transfer #1568 #2602 KV Pool: Developed the foundational KV Pool infrastructure and migrated it to the latest ADXL stack #2913 #3350 - Quality Contribution‌: #1568 #2602 #2913 #3350 - Community Involvement‌: He actively responds to [community issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20commenter%3ALCAIZJ%20is%3Aopen%20-author%3ALCAIZJ), continuously monitors functionality and accuracy issues related to PD disaggregation and KV Pool, and proactively delivers [bug fixes](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+bugfix). - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

github-actions · 2025-12-19T01:04:26Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

MengqingCao · 2025-12-19T03:07:44Z

@MengqingCao @zzzzwwjj May I ask if there are any other operators that need to be optimized besides the torch_npu. _npu_reshape_and_cache() operator? If it's just torch_npu._npu_reshape_and_cache(), perhaps I can refer to the reshape_and_cache() operator in VLLM and optimize it in Ascend.

Sorry for the delay, there does exsits other operators, mainly including all the attention ops. Thus I don't recomand you to do this now

HF-001 · 2025-12-19T03:11:42Z

@MengqingCao @zzzzwwjj May I ask if there are any other operators that need to be optimized besides the torch_npu. _npu_reshape_and_cache() operator? If it's just torch_npu._npu_reshape_and_cache(), perhaps I can refer to the reshape_and_cache() operator in VLLM and optimize it in Ascend.

Sorry for the delay, there does exsits other operators, mainly including all the attention ops. Thus I don't recomand you to do this now

thank you

@zzzzwwjj

…t#5152) I'd like to nominate @zzzzwwjj @realliujiaxu @LCAIZJ to join vLLM Ascend committer team. @zzzzwwjj --- - Review Quality‌: He has completed 80+reviews since April. 2025, include vllm-project#3232 (comment), vllm-project#4822 (comment), vllm-project#4768 (comment) high quality review. - Sustained Contributions 15+ Valuable bug fix and refactor is very good. https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Azzzzwwjj+is%3Aclosed+review%3Aapproved Continuous optimization of code architecture https://github.com/vllm-project/vllm-ascend/pulls?q=author%3Azzzzwwjj+is%3Amerged - Quality Contribution‌: vllm-project#1229 vllm-project#1979 vllm-project#4359 vllm-project#4878 - Community Involvement‌: He lead the vllm-project#1147, to refactor AscendFusedMoE at the first time. He shared topics about large-scale distributed inference and reinforcement learning on vLLM-Ascend meetup on August 2nd. @realliujiaxu --- - Review Quality‌: He has completed about [40+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Arealliujiaxu+-author%3Arealliujiaxu+) since September, include vllm-project#4868 (comment), vllm-project#2275 (comment). - Sustained Contributions He has completed (17 commits)[https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged], continuously optimizing the performance of the MoE model. - Quality Contribution‌: Contributed the Flash Comm1 feature to the community, supporting both eager and aclgraph execution modes, while compatible with multiple MoE models including DeepSeek and GLM4.5. - vllm-project#3334 - vllm-project#3420 - vllm-project#3015 co-author: - vllm-project#3495 - vllm-project#4868 - Community Involvement‌: 1. Completed two major refactors, enabling vllm-ascend to evolve more rapidly and robustly: [Linear module](vllm-project#2867) and [rejection sampler](vllm-project#4975) 2. [fixed 8 bugs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged+bugfix+) in graph mode, spec decoding and async scheduling. @LCAIZJ --- - Review Quality‌: He's been the go-to reviewer for virtually all PD disaggregation and KV Pool related PRs, having completed [30+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3ALCAIZJ+is%3Aopen+-author%3ALCAIZJ+) since May 2025. Notable examples include [discussion_r2553887360](vllm-project#4345 (comment)), [issuecomment-3540994801](vllm-project#4161 (comment)), and [discussion_r2492593988](vllm-project#3981 (comment)), all demonstrating thorough and insightful feedback. - Sustained and Quality Contributions: His contributions reflect a strong grasp of both ‌vLLM‌ and ‌vLLM Ascend‌ codebases, particularly in prefill-decode disaggregation and KV pool areas ([7 PRs merged](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+)). Prefill-Decode Disaggregation: Delivered KV transfer functionality using Mooncake TransferEngine and enabled layerwise KV transfer vllm-project#1568 vllm-project#2602 KV Pool: Developed the foundational KV Pool infrastructure and migrated it to the latest ADXL stack vllm-project#2913 vllm-project#3350 - Quality Contribution‌: vllm-project#1568 vllm-project#2602 vllm-project#2913 vllm-project#3350 - Community Involvement‌: He actively responds to [community issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20commenter%3ALCAIZJ%20is%3Aopen%20-author%3ALCAIZJ), continuously monitors functionality and accuracy issues related to PD disaggregation and KV Pool, and proactively delivers [bug fixes](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+bugfix). - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

@zzzzwwjj

…t#5152) I'd like to nominate @zzzzwwjj @realliujiaxu @LCAIZJ to join vLLM Ascend committer team. @zzzzwwjj --- - Review Quality‌: He has completed 80+reviews since April. 2025, include vllm-project#3232 (comment), vllm-project#4822 (comment), vllm-project#4768 (comment) high quality review. - Sustained Contributions 15+ Valuable bug fix and refactor is very good. https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Azzzzwwjj+is%3Aclosed+review%3Aapproved Continuous optimization of code architecture https://github.com/vllm-project/vllm-ascend/pulls?q=author%3Azzzzwwjj+is%3Amerged - Quality Contribution‌: vllm-project#1229 vllm-project#1979 vllm-project#4359 vllm-project#4878 - Community Involvement‌: He lead the vllm-project#1147, to refactor AscendFusedMoE at the first time. He shared topics about large-scale distributed inference and reinforcement learning on vLLM-Ascend meetup on August 2nd. @realliujiaxu --- - Review Quality‌: He has completed about [40+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Arealliujiaxu+-author%3Arealliujiaxu+) since September, include vllm-project#4868 (comment), vllm-project#2275 (comment). - Sustained Contributions He has completed (17 commits)[https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged], continuously optimizing the performance of the MoE model. - Quality Contribution‌: Contributed the Flash Comm1 feature to the community, supporting both eager and aclgraph execution modes, while compatible with multiple MoE models including DeepSeek and GLM4.5. - vllm-project#3334 - vllm-project#3420 - vllm-project#3015 co-author: - vllm-project#3495 - vllm-project#4868 - Community Involvement‌: 1. Completed two major refactors, enabling vllm-ascend to evolve more rapidly and robustly: [Linear module](vllm-project#2867) and [rejection sampler](vllm-project#4975) 2. [fixed 8 bugs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged+bugfix+) in graph mode, spec decoding and async scheduling. @LCAIZJ --- - Review Quality‌: He's been the go-to reviewer for virtually all PD disaggregation and KV Pool related PRs, having completed [30+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3ALCAIZJ+is%3Aopen+-author%3ALCAIZJ+) since May 2025. Notable examples include [discussion_r2553887360](vllm-project#4345 (comment)), [issuecomment-3540994801](vllm-project#4161 (comment)), and [discussion_r2492593988](vllm-project#3981 (comment)), all demonstrating thorough and insightful feedback. - Sustained and Quality Contributions: His contributions reflect a strong grasp of both ‌vLLM‌ and ‌vLLM Ascend‌ codebases, particularly in prefill-decode disaggregation and KV pool areas ([7 PRs merged](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+)). Prefill-Decode Disaggregation: Delivered KV transfer functionality using Mooncake TransferEngine and enabled layerwise KV transfer vllm-project#1568 vllm-project#2602 KV Pool: Developed the foundational KV Pool infrastructure and migrated it to the latest ADXL stack vllm-project#2913 vllm-project#3350 - Quality Contribution‌: vllm-project#1568 vllm-project#2602 vllm-project#2913 vllm-project#3350 - Community Involvement‌: He actively responds to [community issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20commenter%3ALCAIZJ%20is%3Aopen%20-author%3ALCAIZJ), continuously monitors functionality and accuracy issues related to PD disaggregation and KV Pool, and proactively delivers [bug fixes](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+bugfix). - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

@zzzzwwjj

…t#5152) I'd like to nominate @zzzzwwjj @realliujiaxu @LCAIZJ to join vLLM Ascend committer team. @zzzzwwjj --- - Review Quality‌: He has completed 80+reviews since April. 2025, include vllm-project#3232 (comment), vllm-project#4822 (comment), vllm-project#4768 (comment) high quality review. - Sustained Contributions 15+ Valuable bug fix and refactor is very good. https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Azzzzwwjj+is%3Aclosed+review%3Aapproved Continuous optimization of code architecture https://github.com/vllm-project/vllm-ascend/pulls?q=author%3Azzzzwwjj+is%3Amerged - Quality Contribution‌: vllm-project#1229 vllm-project#1979 vllm-project#4359 vllm-project#4878 - Community Involvement‌: He lead the vllm-project#1147, to refactor AscendFusedMoE at the first time. He shared topics about large-scale distributed inference and reinforcement learning on vLLM-Ascend meetup on August 2nd. @realliujiaxu --- - Review Quality‌: He has completed about [40+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Arealliujiaxu+-author%3Arealliujiaxu+) since September, include vllm-project#4868 (comment), vllm-project#2275 (comment). - Sustained Contributions He has completed (17 commits)[https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged], continuously optimizing the performance of the MoE model. - Quality Contribution‌: Contributed the Flash Comm1 feature to the community, supporting both eager and aclgraph execution modes, while compatible with multiple MoE models including DeepSeek and GLM4.5. - vllm-project#3334 - vllm-project#3420 - vllm-project#3015 co-author: - vllm-project#3495 - vllm-project#4868 - Community Involvement‌: 1. Completed two major refactors, enabling vllm-ascend to evolve more rapidly and robustly: [Linear module](vllm-project#2867) and [rejection sampler](vllm-project#4975) 2. [fixed 8 bugs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged+bugfix+) in graph mode, spec decoding and async scheduling. @LCAIZJ --- - Review Quality‌: He's been the go-to reviewer for virtually all PD disaggregation and KV Pool related PRs, having completed [30+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3ALCAIZJ+is%3Aopen+-author%3ALCAIZJ+) since May 2025. Notable examples include [discussion_r2553887360](vllm-project#4345 (comment)), [issuecomment-3540994801](vllm-project#4161 (comment)), and [discussion_r2492593988](vllm-project#3981 (comment)), all demonstrating thorough and insightful feedback. - Sustained and Quality Contributions: His contributions reflect a strong grasp of both ‌vLLM‌ and ‌vLLM Ascend‌ codebases, particularly in prefill-decode disaggregation and KV pool areas ([7 PRs merged](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+)). Prefill-Decode Disaggregation: Delivered KV transfer functionality using Mooncake TransferEngine and enabled layerwise KV transfer vllm-project#1568 vllm-project#2602 KV Pool: Developed the foundational KV Pool infrastructure and migrated it to the latest ADXL stack vllm-project#2913 vllm-project#3350 - Quality Contribution‌: vllm-project#1568 vllm-project#2602 vllm-project#2913 vllm-project#3350 - Community Involvement‌: He actively responds to [community issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20commenter%3ALCAIZJ%20is%3Aopen%20-author%3ALCAIZJ), continuously monitors functionality and accuracy issues related to PD disaggregation and KV Pool, and proactively delivers [bug fixes](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+bugfix). - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

[feat] cross layer kvcache

c008442

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

gemini-code-assist bot reviewed Dec 8, 2025

View reviewed changes

MengqingCao self-assigned this Dec 8, 2025

01267596 added 4 commits December 8, 2025 03:41

[feat] cross layer kvcache

0a3e9a2

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

[feat] cross layer kvcache

7df38c8

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

[feat] cross layer kvcache

2686cef

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

[feat] cross layer kvcache

29b9edf

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

Sparkheart reviewed Dec 8, 2025

View reviewed changes

vllm_ascend/worker/model_runner_v1.py Outdated Show resolved Hide resolved

Sparkheart reviewed Dec 8, 2025

View reviewed changes

vllm_ascend/worker/model_runner_v1.py Show resolved Hide resolved

01267596 and others added 3 commits December 9, 2025 02:16

[feat] cross layer kvcache

21e0df3

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

Merge branch 'main' into cross_layer_kvcache_dev

4bef64f

Merge branch 'main' into cross_layer_kvcache_dev

67d5d9b

github-actions bot added the merge-conflicts label Dec 10, 2025

Merge branch 'main' into cross_layer_kvcache_dev

43e5424

Signed-off-by: kx <1670186653@qq.com>

github-actions bot removed the merge-conflicts label Dec 10, 2025

HF-001 and others added 3 commits December 11, 2025 09:40

Merge branch 'main' into cross_layer_kvcache_dev

af49d1b

[feat] cross layer kvcache

8ed9b76

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

[feat] cross layer kvcache

461145e

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

wangxiyuan approved these changes Dec 11, 2025

View reviewed changes

weijinqian0 approved these changes Dec 11, 2025

View reviewed changes

whx-sjtu approved these changes Dec 11, 2025

View reviewed changes

[feat] cross layer kvcache

f5a61d6

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

[feat] cross layer kvcache

d165d1a

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

github-actions bot added the merge-conflicts label Dec 12, 2025

Merge branch 'main' into cross_layer_kvcache_dev

ba4ccd8

Signed-off-by: kx <1670186653@qq.com>

github-actions bot removed the merge-conflicts label Dec 12, 2025

[feat] cross layer kvcache

5d6e85f

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

wangxiyuan mentioned this pull request Dec 18, 2025

Nominate new maintainers @zzzzwwjj @realliujiaxu @LCAIZJ #5152

Merged

github-actions bot added the merge-conflicts label Dec 19, 2025

HF-001 closed this Dec 19, 2025

Conversation

HF-001 commented Dec 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 8, 2025

Uh oh!

Uh oh!

Uh oh!

HF-001 commented Dec 10, 2025

Uh oh!

github-actions bot commented Dec 10, 2025

Uh oh!

zzzzwwjj commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HF-001 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

MengqingCao commented Dec 15, 2025

Uh oh!

HF-001 commented Dec 15, 2025

Uh oh!

LCAIZJ commented Dec 15, 2025

Uh oh!

HF-001 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HF-001 commented Dec 15, 2025

Uh oh!

MengqingCao commented Dec 15, 2025

Uh oh!

HF-001 commented Dec 15, 2025

Uh oh!

HF-001 commented Dec 16, 2025

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

MengqingCao commented Dec 19, 2025

Uh oh!

HF-001 commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

HF-001 commented Dec 8, 2025 •

edited by github-actions bot

Loading

zzzzwwjj commented Dec 12, 2025 •

edited

Loading

HF-001 commented Dec 12, 2025 •

edited

Loading

HF-001 commented Dec 15, 2025 •

edited

Loading