[CI] Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector by wjunLu · Pull Request #5441 · vllm-project/vllm-ascend

wjunLu · 2025-12-27T09:39:23Z

What this PR does / why we need it?

Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: release/v0.13.0
vLLM main: vllm-project/vllm@81786c8

github-actions · 2025-12-27T09:40:28Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This PR adds a nightly test for Qwen3-235B-A22B. My review found a few critical issues that need to be addressed before merging. The test configuration for the consumer node is missing data-parallel arguments, which will likely cause failures. The run.sh script contains a temporary hack to check out this specific PR, which will break future nightly runs if merged. Additionally, the Kubernetes configuration template hardcodes IP addresses, which harms maintainability. Please see the detailed comments for suggestions.

tests/e2e/nightly/multi_node/config/Qwen3-235B-A22B-Mooncake-Layerwise.yaml

tests/e2e/nightly/multi_node/scripts/run.sh

tests/e2e/nightly/multi_node/scripts/lws.yaml.jinja2

tests/e2e/nightly/multi_node/config/Qwen3-235B-A22B-Mooncake-Layerwise.yaml

tests/e2e/nightly/multi_node/config/models/Qwen3-235B-A22B-Mooncake-Layerwise.yaml

github-actions · 2025-12-27T10:26:00Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-30T11:22:48Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2026-01-07T02:04:47Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

liziyu179 · 2026-01-13T01:56:02Z

tests/e2e/nightly/multi_node/config/Qwen3-235B-A22B-Mooncake-Layerwise.yaml

+        --trust-remote-code
+        --no-enable-prefix-caching
+        --gpu-memory-utilization 0.9
+        --additional-config '{"torchair_graph_config":{"enabled":true}}'


Could you please confirm if this is a use case of torch_air?

github-actions · 2026-01-30T07:57:07Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: wjunLu <wjunlu217@gmail.com>

Signed-off-by: hfadzxy <starmoon_zhang@163.com>

zhangxinyuehfad · 2026-02-27T07:26:54Z

nightly test ok:

…nector (vllm-project#5441) ### What this PR does / why we need it? Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: 08mamba24 <864701928@qq.com>

…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: clean 0.15.0 support (vllm-project#6852) [CI] Refactor to speedup image building and CI Installation (vllm-project#6708) [bugfix] Fixed an accuracy problem of gdn layer in graph (vllm-project#6822) [Doc][Misc] Update release notes for v0.15.0rc1 (vllm-project#6859) [CI] Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector (vllm-project#5441) [Refactor][EAGLE] 7/N Merged PCP and disable_padded interface (vllm-project#6811) [Main2Main] Upgrade vLLM to 0226 (vllm-project#6813) [DOC] enable both flashcomm1 and cudagraph (vllm-project#6807) add release note for 0.15.0rc1 (vllm-project#6839) [Doc] fix the nit in docs (vllm-project#6826) [CI] Fix doc test fail when load model with error information: 'Stale file handle' (vllm-project#6832) [Feat]support sequence parallelism by pass for VL models (vllm-project#5632) [Doc][Release] Add release note skill (vllm-project#6824) [BugFix] Support ALL D-Nodes in fullgraph when running MTP in PD (vllm-project#5472) [Patch][Misc] Cleanup and update patches (vllm-project#6802) [Doc][Misc] Refactor skill documentation and add Claude support instructions (vllm-project#6817) [BugFix] [310p] Fix attention accuracy issue (vllm-project#6803) [Misc] Drop patch_rope.py (vllm-project#6291)

Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update eagle_proposer.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> [CI] Add long and short prompt tests for DeepSeek-V3.2 (vllm-project#6536) This version has no divisibility constraint between tp and mtp+1. However, cudagraph_capture_sizes must be a common multiple of tp and mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed cudagraph_capture_sizes. We added a long-sequence test (64k input, 3k output) for the two-node mixed deployment scenario. Due to the excessive time required for performance benchmarking, we are only verifying functionality. The single-node scenario is skipped because VRAM limitations prevent launching the model with a max-model-len of 68,000. and we also add aime2025 test for dual-node deepseek 3.2 nightly test. test at nightly environment. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com> [Feature][Quant] Auto-detect quantization format from model files (vllm-project#6645) - Add automatic quantization format detection, eliminating the need to manually specify `--quantization` when serving quantized models. - The detection inspects only lightweight JSON files (`quant_model_description.json` and `config.json`) at engine initialization time, with no `.safetensors` reads. - User-explicit `--quantization` flags are always respected; auto-detection only applies when the flag is omitted. **Detection priority:** 1. `quant_model_description.json` exists → `quantization="ascend"` (ModelSlim) 2. `config.json` contains `"quant_method": "compressed-tensors"` → `quantization="compressed-tensors"` (LLM-Compressor) 3. Neither → default float behavior **Technical approach:** Hooked into `NPUPlatform.check_and_update_config()` to run detection after `VllmConfig.__post_init__`. Since `quant_config` is already `None` at that point, we explicitly recreate it via `VllmConfig._get_quantization_config()` to trigger the full quantization initialization pipeline. | File | Description | |------|-------------| | `vllm_ascend/quantization/utils.py` | Added `detect_quantization_method()` and `maybe_auto_detect_quantization()` | | `vllm_ascend/platform.py` | Integrated auto-detection in `check_and_update_config()` | | `vllm_ascend/quantization/modelslim_config.py` | Improved error handling for weight loading | - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> [Misc] Drop patch_rope.py (vllm-project#6291) Part of vllm-project#5304. We have align with vLLM's latest change for `RotaryEmbeddingBase`. Don't need this patch anymore. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: gcanlin <canlinguosdu@gmail.com> [BugFix] [310p] Fix attention accuracy issue (vllm-project#6803) This pull request resolves an attention accuracy issue by enhancing the AttentionMaskBuilder310 to correctly handle the maximum model length. The change ensures that the attention mask generation process is properly parameterized by the model's configuration, rather than relying on a fixed internal value. This leads to more accurate attention mask creation, which is crucial for the correct functioning of the attention mechanism. Update fused_moe to main branch. No Qwen3 dense mode & moe model e2e test - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> [Doc][Misc] Refactor skill documentation and add Claude support instructions (vllm-project#6817) This PR refactors the documentation for vLLM Ascend skills. - It renames and moves the `vllm-ascend-model-adapter` skill's README to serve as a new top-level README for the `.agents` directory. - It adds instructions on how to use the Ascend skills with Claude, including a new README in the `.claude` directory. - It updates `.gitignore` to exclude skills copied for Claude's use. - Add main2main skill This improves the documentation structure, making it more organized and providing clear instructions for developers using these skills with different tools. No, this PR contains only documentation and repository configuration changes. It does not affect any user-facing code functionality. These changes are documentation-only and do not require specific testing. The correctness of the instructions is being verified through this review. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [Patch][Misc] Cleanup and update patches (vllm-project#6802) This PR performs a cleanup and update of the patch mechanism in `vllm-ascend`. - Removes several obsolete patches: `patch_deepseek.py`. - Updates the central patch documentation in `vllm_ascend/patch/__init__.py` to reflect these removals and additions, re-numbering and re-organizing the patch list for better clarity. No. These are internal changes to the patching mechanism and should not affect users. CI passed with new added/existing test. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [BugFix] Support ALL D-Nodes in fullgraph when running MTP in PD (vllm-project#5472) **BUG** When using prefill-decode disaggregation + MTP + full graph +asynchronous scheduling, the KV cache pulled by decode nodes from prefill decodes does not include spec tokens. As a result, the total_num_scheduled_tokens obtained by decode nodes from the scheduler lacks spec tokens. When determining whether to enqueue the full graph on decode nodes, the condition for uniform_decode ` scheduler_output.total_num_scheduled_tokens == self.input_batch.num_reqs * max_query_len` is not met, leading to the current instance not being enqueued into the full graph. The above situation leads to both full graph and eagle mode instances coexisting in the decode instances. Due to the synchronization wait of MoeDispatch, the decode instances in full graph are significantly slowed down by the instance in eagle mode. **Solution** The scenario is PD separation + MTP + Full Graph + asynchronous scheduling. On the decode nodes, the spec tokens of the request with KV cache from P need be padded. Then, the padded spec tokens will be rejected by sampling. This operation ensures that the uniform_decode condition is satisfied when determining whether decode nodes are included in the full graph, thereby guaranteeing that all decode instances are present in the full graph and avoiding synchronous waiting for MoeDispatch. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@5326c89 Signed-off-by: chenmenglong <chenmenglong1@huawei.com> [Doc][Release] Add release note skill (vllm-project#6824) This PR adds the releaseing note skills: - `SKILL.md`: vLLM Ascend Releasing Note Writer - `references/ref-past-release-notes-highlight.md`: And also add a `output/v0.13.0` examples which was used by vllm-project@2da476d Inspired: https://github.com/simon-mo/release-notes-writing/ No - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Co-authored-by: esmeetu <jasonailu87@gmail.com> --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> [Feat]support sequence parallelism by pass for VL models (vllm-project#5632) [CI] Fix doc test fail when load model with error information: 'Stale file handle' (vllm-project#6832) This PR fixes a `Stale file handle` error that occurs during doctests in the CI environment. The error appears when loading models from ModelScope, likely due to issues with network file systems used in CI. The fix involves setting the `MODELSCOPE_HUB_FILE_LOCK` environment variable to `false` in the `run_doctests.sh` script. This disables file locking in the ModelScope hub, which is a common workaround for this type of file system error. No, this change only affects the CI test execution environment and has no impact on users. This change is validated by the CI pipeline. A successful run of the doctests indicates that the fix is effective. Signed-off-by: leo-pony <nengjunma@outlook.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update eagle_proposer.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> [Doc] fix the nit in docs (vllm-project#6826) Refresh the doc, fix the nit in the docs - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> add release note for 0.15.0rc1 (vllm-project#6839) Add release note for 0.15.0rc1 - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [DOC] enable both flashcomm1 and cudagraph (vllm-project#6807) This PR updates the DeepSeek-V3.2 documentation to include the latest performance optimizations and configuration improvements. - **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1` environment variable across all deployment scenarios to enable FlashComm1 for improved communication performance - **Layer Sharding**: Added `--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for better memory distribution - **CUDA Graph Optimization**: Updated cudagraph capture sizes from `[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40, 48]` - **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to 3 - **Documentation Links**: Fixed request forwarding documentation to use proper GitHub repository links Yes, users can now follow the updated documentation to enable FlashComm1 and layer sharding for improved DeepSeek-V3.2 performance. Existing documentation examples have been validated to ensure configuration consistency across all deployment scenarios. --- - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com> [Main2Main] Upgrade vLLM to 0226 (vllm-project#6813) Breaking: 1. vllm-project/vllm#33452 2. vllm-project/vllm#33451 3. vllm-project/vllm#32567 4. vllm-project/vllm#32344 - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: MrZ20 <2609716663@qq.com> [Refactor][EAGLE] 7/N Merged PCP and disable_padded interface (vllm-project#6811) [Refactor][EAGLE] 7/N Merged PCP and disable_padded interface into eagle_proposer.py This pull request significantly refactors the speculative decoding mechanism by merging Parallel Context Processing (PCP) and Multi-Token Prediction (MTP) functionalities directly into the eagle_proposer.py. The changes aim to enhance the efficiency and correctness of distributed speculative decoding, particularly by enabling the Eagle feature to work seamlessly with the disable_padded interface. This involves detailed adjustments to attention metadata, input/output processing, and state management to ensure proper operation in parallel environments. 1. The PCP and MTP features are migrated to the eagle_proposer.py 2. The Eagle and PCP features are integrated 3. Enable the eagle feature to use the disable_padded interface No Tests and UT - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com> [CI] Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector (vllm-project#5441) Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> [Doc][Misc] Update release notes for v0.15.0rc1 (vllm-project#6859) This PR updates the release notes for `v0.15.0rc1` to: - Mark the `310P MoE and W8A8 Support` feature as experimental. - Add a note for `Kimi-K2.5 Model Support` clarifying that it has known issues in vLLM 0.15.0 and requires manual patching to work correctly. No, this is a documentation-only update. N/A (documentation change). - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [bugfix] Fixed an accuracy problem of gdn layer in graph (vllm-project#6822) There will be random ouputs if we run model with GDN attention in graph mode: ```python prompts = [ "1. Who are you?", ] sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32) sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=5) llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, distributed_executor_backend="mp", gpu_memory_utilization=0.7, speculative_config={ "method": "qwen3_next_mtp", "num_speculative_tokens": 3, }, compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [8], }, max_model_len=4096, enable_prefix_caching=False) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"{output.prompt_token_ids=}") print(f"{output.outputs[0].token_ids=}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Before appling this change, the outputs was: ```text output.prompt_token_ids=[16, 13, 10479, 525, 498, 30] output.outputs[0].token_ids=[3555, 323, 279, 1112, 279] Prompt: '1. Who are you?', Generated text: ' What and the... the' ``` After applying this change, the output is: ```text output.prompt_token_ids=[16, 13, 10479, 525, 498, 30] output.outputs[0].token_ids=[3555, 374, 697, 829, 30] Prompt: '1. Who are you?', Generated text: ' What is your name?' ``` **Why does this change sovle the problem?** Now, `query_start_loc` is padded because of `fia`. But, for `gdn-attention`, padded version of `query_start_loc` will cause accuracy problem. So, we need an unpadded version of `query_start_loc` named `gdn_query_start_loc` and use it in `gdn-attention`, it works fine. N/A As described aboved. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: drslark <slarksblood@qq.com> [CI] Refactor to speedup image building and CI Installation (vllm-project#6708) 1. Refactor image workflow using cache-from to speedup builds ![build](https://github.com/user-attachments/assets/02135c12-0069-44f8-a3ec-5c2b4282448a) Simultaneously refactored all Dockerfiles by placing layers that rarely change before those that change frequently, improving build cache hit rate. 2. Refactor E2E test using vllm-ascend container images, to skip C compile while no C code are changed ![e2e](https://github.com/user-attachments/assets/49f5b166-0df3-41e1-8f71-b3bbbed17cfd) In this case, the job will only replace the source code of vllm-ascend and install `requirements-dev.txt`, saving about 10min before tests - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@9562912 Signed-off-by: wjunLu <wjunlu217@gmail.com> clean 0.15.0 support (vllm-project#6852) Clean up vllm 0.15.0 related code - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update eagle_proposer.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> [CI] Add long and short prompt tests for DeepSeek-V3.2 (vllm-project#6536) This version has no divisibility constraint between tp and mtp+1. However, cudagraph_capture_sizes must be a common multiple of tp and mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed cudagraph_capture_sizes. We added a long-sequence test (64k input, 3k output) for the two-node mixed deployment scenario. Due to the excessive time required for performance benchmarking, we are only verifying functionality. The single-node scenario is skipped because VRAM limitations prevent launching the model with a max-model-len of 68,000. and we also add aime2025 test for dual-node deepseek 3.2 nightly test. test at nightly environment. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com> [Feature][Quant] Auto-detect quantization format from model files (vllm-project#6645) - Add automatic quantization format detection, eliminating the need to manually specify `--quantization` when serving quantized models. - The detection inspects only lightweight JSON files (`quant_model_description.json` and `config.json`) at engine initialization time, with no `.safetensors` reads. - User-explicit `--quantization` flags are always respected; auto-detection only applies when the flag is omitted. **Detection priority:** 1. `quant_model_description.json` exists → `quantization="ascend"` (ModelSlim) 2. `config.json` contains `"quant_method": "compressed-tensors"` → `quantization="compressed-tensors"` (LLM-Compressor) 3. Neither → default float behavior **Technical approach:** Hooked into `NPUPlatform.check_and_update_config()` to run detection after `VllmConfig.__post_init__`. Since `quant_config` is already `None` at that point, we explicitly recreate it via `VllmConfig._get_quantization_config()` to trigger the full quantization initialization pipeline. | File | Description | |------|-------------| | `vllm_ascend/quantization/utils.py` | Added `detect_quantization_method()` and `maybe_auto_detect_quantization()` | | `vllm_ascend/platform.py` | Integrated auto-detection in `check_and_update_config()` | | `vllm_ascend/quantization/modelslim_config.py` | Improved error handling for weight loading | - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> [Misc] Drop patch_rope.py (vllm-project#6291) Part of vllm-project#5304. We have align with vLLM's latest change for `RotaryEmbeddingBase`. Don't need this patch anymore. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: gcanlin <canlinguosdu@gmail.com> [BugFix] [310p] Fix attention accuracy issue (vllm-project#6803) This pull request resolves an attention accuracy issue by enhancing the AttentionMaskBuilder310 to correctly handle the maximum model length. The change ensures that the attention mask generation process is properly parameterized by the model's configuration, rather than relying on a fixed internal value. This leads to more accurate attention mask creation, which is crucial for the correct functioning of the attention mechanism. Update fused_moe to main branch. No Qwen3 dense mode & moe model e2e test - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> [Doc][Misc] Refactor skill documentation and add Claude support instructions (vllm-project#6817) This PR refactors the documentation for vLLM Ascend skills. - It renames and moves the `vllm-ascend-model-adapter` skill's README to serve as a new top-level README for the `.agents` directory. - It adds instructions on how to use the Ascend skills with Claude, including a new README in the `.claude` directory. - It updates `.gitignore` to exclude skills copied for Claude's use. - Add main2main skill This improves the documentation structure, making it more organized and providing clear instructions for developers using these skills with different tools. No, this PR contains only documentation and repository configuration changes. It does not affect any user-facing code functionality. These changes are documentation-only and do not require specific testing. The correctness of the instructions is being verified through this review. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [Patch][Misc] Cleanup and update patches (vllm-project#6802) This PR performs a cleanup and update of the patch mechanism in `vllm-ascend`. - Removes several obsolete patches: `patch_deepseek.py`. - Updates the central patch documentation in `vllm_ascend/patch/__init__.py` to reflect these removals and additions, re-numbering and re-organizing the patch list for better clarity. No. These are internal changes to the patching mechanism and should not affect users. CI passed with new added/existing test. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [BugFix] Support ALL D-Nodes in fullgraph when running MTP in PD (vllm-project#5472) **BUG** When using prefill-decode disaggregation + MTP + full graph +asynchronous scheduling, the KV cache pulled by decode nodes from prefill decodes does not include spec tokens. As a result, the total_num_scheduled_tokens obtained by decode nodes from the scheduler lacks spec tokens. When determining whether to enqueue the full graph on decode nodes, the condition for uniform_decode ` scheduler_output.total_num_scheduled_tokens == self.input_batch.num_reqs * max_query_len` is not met, leading to the current instance not being enqueued into the full graph. The above situation leads to both full graph and eagle mode instances coexisting in the decode instances. Due to the synchronization wait of MoeDispatch, the decode instances in full graph are significantly slowed down by the instance in eagle mode. **Solution** The scenario is PD separation + MTP + Full Graph + asynchronous scheduling. On the decode nodes, the spec tokens of the request with KV cache from P need be padded. Then, the padded spec tokens will be rejected by sampling. This operation ensures that the uniform_decode condition is satisfied when determining whether decode nodes are included in the full graph, thereby guaranteeing that all decode instances are present in the full graph and avoiding synchronous waiting for MoeDispatch. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@5326c89 Signed-off-by: chenmenglong <chenmenglong1@huawei.com> [Doc][Release] Add release note skill (vllm-project#6824) This PR adds the releaseing note skills: - `SKILL.md`: vLLM Ascend Releasing Note Writer - `references/ref-past-release-notes-highlight.md`: And also add a `output/v0.13.0` examples which was used by vllm-project@2da476d Inspired: https://github.com/simon-mo/release-notes-writing/ No - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Co-authored-by: esmeetu <jasonailu87@gmail.com> --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> [Feat]support sequence parallelism by pass for VL models (vllm-project#5632) [CI] Fix doc test fail when load model with error information: 'Stale file handle' (vllm-project#6832) This PR fixes a `Stale file handle` error that occurs during doctests in the CI environment. The error appears when loading models from ModelScope, likely due to issues with network file systems used in CI. The fix involves setting the `MODELSCOPE_HUB_FILE_LOCK` environment variable to `false` in the `run_doctests.sh` script. This disables file locking in the ModelScope hub, which is a common workaround for this type of file system error. No, this change only affects the CI test execution environment and has no impact on users. This change is validated by the CI pipeline. A successful run of the doctests indicates that the fix is effective. Signed-off-by: leo-pony <nengjunma@outlook.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update eagle_proposer.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> [CI] Add long and short prompt tests for DeepSeek-V3.2 (vllm-project#6536) This version has no divisibility constraint between tp and mtp+1. However, cudagraph_capture_sizes must be a common multiple of tp and mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed cudagraph_capture_sizes. We added a long-sequence test (64k input, 3k output) for the two-node mixed deployment scenario. Due to the excessive time required for performance benchmarking, we are only verifying functionality. The single-node scenario is skipped because VRAM limitations prevent launching the model with a max-model-len of 68,000. and we also add aime2025 test for dual-node deepseek 3.2 nightly test. test at nightly environment. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com> [Feature][Quant] Auto-detect quantization format from model files (vllm-project#6645) - Add automatic quantization format detection, eliminating the need to manually specify `--quantization` when serving quantized models. - The detection inspects only lightweight JSON files (`quant_model_description.json` and `config.json`) at engine initialization time, with no `.safetensors` reads. - User-explicit `--quantization` flags are always respected; auto-detection only applies when the flag is omitted. **Detection priority:** 1. `quant_model_description.json` exists → `quantization="ascend"` (ModelSlim) 2. `config.json` contains `"quant_method": "compressed-tensors"` → `quantization="compressed-tensors"` (LLM-Compressor) 3. Neither → default float behavior **Technical approach:** Hooked into `NPUPlatform.check_and_update_config()` to run detection after `VllmConfig.__post_init__`. Since `quant_config` is already `None` at that point, we explicitly recreate it via `VllmConfig._get_quantization_config()` to trigger the full quantization initialization pipeline. | File | Description | |------|-------------| | `vllm_ascend/quantization/utils.py` | Added `detect_quantization_method()` and `maybe_auto_detect_quantization()` | | `vllm_ascend/platform.py` | Integrated auto-detection in `check_and_update_config()` | | `vllm_ascend/quantization/modelslim_config.py` | Improved error handling for weight loading | - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> [Misc] Drop patch_rope.py (vllm-project#6291) Part of vllm-project#5304. We have align with vLLM's latest change for `RotaryEmbeddingBase`. Don't need this patch anymore. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: gcanlin <canlinguosdu@gmail.com> [BugFix] [310p] Fix attention accuracy issue (vllm-project#6803) This pull request resolves an attention accuracy issue by enhancing the AttentionMaskBuilder310 to correctly handle the maximum model length. The change ensures that the attention mask generation process is properly parameterized by the model's configuration, rather than relying on a fixed internal value. This leads to more accurate attention mask creation, which is crucial for the correct functioning of the attention mechanism. Update fused_moe to main branch. No Qwen3 dense mode & moe model e2e test - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> [Doc][Misc] Refactor skill documentation and add Claude support instructions (vllm-project#6817) This PR refactors the documentation for vLLM Ascend skills. - It renames and moves the `vllm-ascend-model-adapter` skill's README to serve as a new top-level README for the `.agents` directory. - It adds instructions on how to use the Ascend skills with Claude, including a new README in the `.claude` directory. - It updates `.gitignore` to exclude skills copied for Claude's use. - Add main2main skill This improves the documentation structure, making it more organized and providing clear instructions for developers using these skills with different tools. No, this PR contains only documentation and repository configuration changes. It does not affect any user-facing code functionality. These changes are documentation-only and do not require specific testing. The correctness of the instructions is being verified through this review. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [Patch][Misc] Cleanup and update patches (vllm-project#6802) This PR performs a cleanup and update of the patch mechanism in `vllm-ascend`. - Removes several obsolete patches: `patch_deepseek.py`. - Updates the central patch documentation in `vllm_ascend/patch/__init__.py` to reflect these removals and additions, re-numbering and re-organizing the patch list for better clarity. No. These are internal changes to the patching mechanism and should not affect users. CI passed with new added/existing test. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [BugFix] Support ALL D-Nodes in fullgraph when running MTP in PD (vllm-project#5472) **BUG** When using prefill-decode disaggregation + MTP + full graph +asynchronous scheduling, the KV cache pulled by decode nodes from prefill decodes does not include spec tokens. As a result, the total_num_scheduled_tokens obtained by decode nodes from the scheduler lacks spec tokens. When determining whether to enqueue the full graph on decode nodes, the condition for uniform_decode ` scheduler_output.total_num_scheduled_tokens == self.input_batch.num_reqs * max_query_len` is not met, leading to the current instance not being enqueued into the full graph. The above situation leads to both full graph and eagle mode instances coexisting in the decode instances. Due to the synchronization wait of MoeDispatch, the decode instances in full graph are significantly slowed down by the instance in eagle mode. **Solution** The scenario is PD separation + MTP + Full Graph + asynchronous scheduling. On the decode nodes, the spec tokens of the request with KV cache from P need be padded. Then, the padded spec tokens will be rejected by sampling. This operation ensures that the uniform_decode condition is satisfied when determining whether decode nodes are included in the full graph, thereby guaranteeing that all decode instances are present in the full graph and avoiding synchronous waiting for MoeDispatch. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@5326c89 Signed-off-by: chenmenglong <chenmenglong1@huawei.com> [Doc][Release] Add release note skill (vllm-project#6824) This PR adds the releaseing note skills: - `SKILL.md`: vLLM Ascend Releasing Note Writer - `references/ref-past-release-notes-highlight.md`: And also add a `output/v0.13.0` examples which was used by vllm-project@2da476d Inspired: https://github.com/simon-mo/release-notes-writing/ No - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Co-authored-by: esmeetu <jasonailu87@gmail.com> --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> [Feat]support sequence parallelism by pass for VL models (vllm-project#5632) [CI] Fix doc test fail when load model with error information: 'Stale file handle' (vllm-project#6832) This PR fixes a `Stale file handle` error that occurs during doctests in the CI environment. The error appears when loading models from ModelScope, likely due to issues with network file systems used in CI. The fix involves setting the `MODELSCOPE_HUB_FILE_LOCK` environment variable to `false` in the `run_doctests.sh` script. This disables file locking in the ModelScope hub, which is a common workaround for this type of file system error. No, this change only affects the CI test execution environment and has no impact on users. This change is validated by the CI pipeline. A successful run of the doctests indicates that the fix is effective. Signed-off-by: leo-pony <nengjunma@outlook.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update eagle_proposer.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> [Doc] fix the nit in docs (vllm-project#6826) Refresh the doc, fix the nit in the docs - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> add release note for 0.15.0rc1 (vllm-project#6839) Add release note for 0.15.0rc1 - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [DOC] enable both flashcomm1 and cudagraph (vllm-project#6807) This PR updates the DeepSeek-V3.2 documentation to include the latest performance optimizations and configuration improvements. - **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1` environment variable across all deployment scenarios to enable FlashComm1 for improved communication performance - **Layer Sharding**: Added `--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for better memory distribution - **CUDA Graph Optimization**: Updated cudagraph capture sizes from `[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40, 48]` - **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to 3 - **Documentation Links**: Fixed request forwarding documentation to use proper GitHub repository links Yes, users can now follow the updated documentation to enable FlashComm1 and layer sharding for improved DeepSeek-V3.2 performance. Existing documentation examples have been validated to ensure configuration consistency across all deployment scenarios. --- - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com> [Main2Main] Upgrade vLLM to 0226 (vllm-project#6813) Breaking: 1. vllm-project/vllm#33452 2. vllm-project/vllm#33451 3. vllm-project/vllm#32567 4. vllm-project/vllm#32344 - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: MrZ20 <2609716663@qq.com> [Refactor][EAGLE] 7/N Merged PCP and disable_padded interface (vllm-project#6811) [Refactor][EAGLE] 7/N Merged PCP and disable_padded interface into eagle_proposer.py This pull request significantly refactors the speculative decoding mechanism by merging Parallel Context Processing (PCP) and Multi-Token Prediction (MTP) functionalities directly into the eagle_proposer.py. The changes aim to enhance the efficiency and correctness of distributed speculative decoding, particularly by enabling the Eagle feature to work seamlessly with the disable_padded interface. This involves detailed adjustments to attention metadata, input/output processing, and state management to ensure proper operation in parallel environments. 1. The PCP and MTP features are migrated to the eagle_proposer.py 2. The Eagle and PCP features are integrated 3. Enable the eagle feature to use the disable_padded interface No Tests and UT - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com> [CI] Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector (vllm-project#5441) Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> [Doc][Misc] Update release notes for v0.15.0rc1 (vllm-project#6859) This PR updates the release notes for `v0.15.0rc1` to: - Mark the `310P MoE and W8A8 Support` feature as experimental. - Add a note for `Kimi-K2.5 Model Support` clarifying that it has known issues in vLLM 0.15.0 and requires manual patching to work correctly. No, this is a documentation-only update. N/A (documentation change). - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [bugfix] Fixed an accuracy problem of gdn layer in graph (vllm-project#6822) There will be random ouputs if we run model with GDN attention in graph mode: ```python prompts = [ "1. Who are you?", ] sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32) sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=5) llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, distributed_executor_backend="mp", gpu_memory_utilization=0.7, speculative_config={ "method": "qwen3_next_mtp", "num_speculative_tokens": 3, }, compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [8], }, max_model_len=4096, enable_prefix_caching=False) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"{output.prompt_token_ids=}") print(f"{output.outputs[0].token_ids=}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Before appling this change, the outputs was: ```text output.prompt_token_ids=[16, 13, 10479, 525, 498, 30] output.outputs[0].token_ids=[3555, 323, 279, 1112, 279] Prompt: '1. Who are you?', Generated text: ' What and the... the' ``` After applying this change, the output is: ```text output.prompt_token_ids=[16, 13, 10479, 525, 498, 30] output.outputs[0].token_ids=[3555, 374, 697, 829, 30] Prompt: '1. Who are you?', Generated text: ' What is your name?' ``` **Why does this change sovle the problem?** Now, `query_start_loc` is padded because of `fia`. But, for `gdn-attention`, padded version of `query_start_loc` will cause accuracy problem. So, we need an unpadded version of `query_start_loc` named `gdn_query_start_loc` and use it in `gdn-attention`, it works fine. N/A As described aboved. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: drslark <slarksblood@qq.com> [CI] Refactor to speedup image building and CI Installation (vllm-project#6708) 1. Refactor image workflow using cache-from to speedup builds ![build](https://github.com/user-attachments/assets/02135c12-0069-44f8-a3ec-5c2b4282448a) Simultaneously refactored all Dockerfiles by placing layers that rarely change before those that change frequently, improving build cache hit rate. 2. Refactor E2E test using vllm-ascend container images, to skip C compile while no C code are changed ![e2e](https://github.com/user-attachments/assets/49f5b166-0df3-41e1-8f71-b3bbbed17cfd) In this case, the job will only replace the source code of vllm-ascend and install `requirements-dev.txt`, saving about 10min before tests - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@9562912 Signed-off-by: wjunLu <wjunlu217@gmail.com> clean 0.15.0 support (vllm-project#6852) Clean up vllm 0.15.0 related code - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Update eagle_proposer.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> [CI] Add long and short prompt tests for DeepSeek-V3.2 (vllm-project#6536) This version has no divisibility constraint between tp and mtp+1. However, cudagraph_capture_sizes must be a common multiple of tp and mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed cudagraph_capture_sizes. We added a long-sequence test (64k input, 3k output) for the two-node mixed deployment scenario. Due to the excessive time required for performance benchmarking, we are only verifying functionality. The single-node scenario is skipped because VRAM limitations prevent launching the model with a max-model-len of 68,000. and we also add aime2025 test for dual-node deepseek 3.2 nightly test. test at nightly environment. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com> [Misc] Drop patch_rope.py (vllm-project#6291) Part of vllm-project#5304. We have align with vLLM's latest change for `RotaryEmbeddingBase`. Don't need this patch anymore. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: gcanlin <canlinguosdu@gmail.com> [BugFix] [310p] Fix attention accuracy issue (vllm-project#6803) This pull request resolves an attention accuracy issue by enhancing the AttentionMaskBuilder310 to correctly handle the maximum model length. The change ensures that the attention mask generation process is properly parameterized by the model's configuration, rather than relying on a fixed internal value. This leads to more accurate attention mask creation, which is crucial for the correct functioning of the attention mechanism. Update fused_moe to main branch. No Qwen3 dense mode & moe model e2e test - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> [Doc][Misc] Refactor skill documentation and add Claude support instructions (vllm-project#6817) This PR refactors the documentation for vLLM Ascend skills. - It renames and moves the `vllm-ascend-model-adapter` skill's README to serve as a new top-level README for the `.agents` directory. - It adds instructions on how to use the Ascend skills with Claude, including a new README in the `.claude` directory. - It updates `.gitignore` to exclude skills copied for Claude's use. - Add main2main skill This improves the documentation structure, making it more organized and providing clear instructions for developers using these skills with different tools. No, this PR contains only documentation and repository configuration changes. It does not affect any user-facing code functionality. These changes are documentation-only and do not require specific testing. The correctness of the instructions is being verified through this review. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [Patch][Misc] Cleanup and update patches (vllm-project#6802) This PR performs a cleanup and update of the patch mechanism in `vllm-ascend`. - Removes several obsolete patches: `patch_deepseek.py`. - Updates the central patch documentation in `vllm_ascend/patch/__init__.py` to reflect these removals and additions, re-numbering and re-organizing the patch list for better clarity. No. These are internal changes to the patching mechanism and should not affect users. CI passed with new added/existing test. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [BugFix] Support ALL D-Nodes in fullgraph when running MTP in PD (vllm-project#5472) **BUG** When using prefill-decode disaggregation + MTP + full graph +asynchronous scheduling, the KV cache pulled by decode nodes from prefill decodes does not include spec tokens. As a result, the total_num_scheduled_tokens obtained by decode nodes from the scheduler lacks spec tokens. When determining whether to enqueue the full graph on decode nodes, the condition for uniform_decode ` scheduler_output.total_num_scheduled_tokens == self.input_batch.num_reqs * max_query_len` is not met, leading to the current instance not being enqueued into the full graph. The above situation leads to both full graph and eagle mode instances coexisting in the decode instances. Due to the synchronization wait of MoeDispatch, the decode instances in full graph are significantly slowed down by the instance in eagle mode. **Solution** The scenario is PD separation + MTP + Full Graph + asynchronous scheduling. On the decode nodes, the spec tokens of the request with KV cache from P need be padded. Then, the padded spec tokens will be rejected by sampling. This operation ensures that the uniform_decode condition is satisfied when determining whether decode nodes are included in the full graph, thereby guaranteeing that all decode instances are present in the full graph and avoiding synchronous waiting for MoeDispatch. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@5326c89 Signed-off-by: chenmenglong <chenmenglong1@huawei.com> [Doc][Release] Add release note skill (vllm-project#6824) This PR adds the releaseing note skills: - `SKILL.md`: vLLM Ascend Releasing Note Writer - `references/ref-past-release-notes-highlight.md`: And also add a `output/v0.13.0` examples which was used by vllm-project@2da476d Inspired: https://github.com/simon-mo/release-notes-writing/ No - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Co-authored-by: esmeetu <jasonailu87@gmail.com> --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> [Feat]support sequence parallelism by pass for VL models (vllm-project#5632) [CI] Fix doc test fail when load model with error information: 'Stale file handle' (vllm-project#6832) This PR fixes a `Stale file handle` error that occurs during doctests in the CI environment. The error appears when loading models from ModelScope, likely due to issues with network file systems used in CI. The fix involves setting the `MODELSCOPE_HUB_FILE_LOCK` environment variable to `false` in the `run_doctests.sh` script. This disables file locking in the ModelScope hub, which is a common workaround for this type of file system error. No, this change only affects the CI test execution environment and has no impact on users. This change is validated by the CI pipeline. A successful run of the doctests indicates that the fix is effective. Signed-off-by: leo-pony <nengjunma@outlook.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update eagle_proposer.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> [Doc] fix the nit in docs (vllm-project#6826) Refresh the doc, fix the nit in the docs - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> add release note for 0.15.0rc1 (vllm-project#6839) Add release note for 0.15.0rc1 - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [DOC] enable both flashcomm1 and cudagraph (vllm-project#6807) This PR updates the DeepSeek-V3.2 documentation to include the latest performance optimizations and configuration improvements. - **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1` environment variable across all deployment scenarios to enable FlashComm1 for improved communication performance - **Layer Sharding**: Added `--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for better memory distribution - **CUDA Graph Optimization**: Updated cudagraph capture sizes from `[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40, 48]` - **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to 3 - **Documentation Links**: Fixed request forwarding documentation to use proper GitHub repository links Yes, users can now follow the updated documentation to enable FlashComm1 and layer sharding for improved DeepSeek-V3.2 performance. Existing documentation examples have been validated to ensure configuration consistency across all deployment scenarios. --- - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com> [Main2Main] Upgrade vLLM to 0226 (vllm-project#6813) Breaking: 1. vllm-project/vllm#33452 2. vllm-project/vllm#33451 3. vllm-project/vllm#32567 4. vllm-project/vllm#32344 - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: MrZ20 <2609716663@qq.com> [Refactor][EAGLE] 7/N Merged PCP and disable_padded interface (vllm-project#6811) [Refactor][EAGLE] 7/N Merged PCP and disable_padded interface into eagle_proposer.py This pull request significantly refactors the speculative decoding mechanism by merging Parallel Context Processing (PCP) and Multi-Token Prediction (MTP) functionalities directly into the eagle_proposer.py. The changes aim to enhance the efficiency and correctness of distributed speculative decoding, particularly by enabling the Eagle feature to work seamlessly with the disable_padded interface. This involves detailed adjustments to attention metadata, input/output processing, and state management to ensure proper operation in parallel environments. 1. The PCP and MTP features are migrated to the eagle_proposer.py 2. The Eagle and PCP features are integrated 3. Enable the eagle feature to use the disable_padded interface No Tests and UT - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com> [CI] Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector (vllm-project#5441) Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> [Doc][Misc] Update release notes for v0.15.0rc1 (vllm-project#6859) This PR updates the release notes for `v0.15.0rc1` to: - Mark the `310P MoE and W8A8 Support` feature as experimental. - Add a note for `Kimi-K2.5 Model Support` clarifying that it has known issues in vLLM 0.15.0 and requires manual patching to work correctly. No, this is a documentation-only update. N/A (documentation change). - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [bugfix] Fixed an accuracy problem of gdn layer in graph (vllm-project#6822) There will be random ouputs if we run model with GDN attention in graph mode: ```python prompts = [ "1. Who are you?", ] sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32) sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=5) llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, distributed_executor_backend="mp", gpu_memory_utilization=0.7, speculative_config={ "method": "qwen3_next_mtp", "num_speculative_tokens": 3, }, compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [8], }, max_model_len=4096, enable_prefix_caching=False) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"{output.prompt_token_ids=}") print(f"{output.outputs[0].token_ids=}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Before appling this change, the outputs was: ```text output.prompt_token_ids=[16, 13, 10479, 525, 498, 30] output.outputs[0].token_ids=[3555, 323, 279, 1112, 279] Prompt: '1. Who are you?', Generated text: ' What and the... the' ``` After applying this change, the output is: ```text output.prompt_token_ids=[16, 13, 10479, 525, 498, 30] output.outputs[0].token_ids=[3555, 374, 697, 829, 30] Prompt: '1. Who are you?', Generated text: ' What is your name?' ``` **Why does this change sovle the problem?** Now, `query_start_loc` is padded because of `fia`. But, for `gdn-attention`, padded version of `query_start_loc` will cause accuracy problem. So, we need an unpadded version of `query_start_loc` named `gdn_query_start_loc` and use it in `gdn-attention`, it works fine. N/A As described aboved. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: drslark <slarksblood@qq.com> [CI] Refactor to speedup image building and CI Installation (vllm-project#6708) 1. Refactor image workflow using cache-from to speedup builds ![build](https://github.com/user-attachments/assets/02135c12-0069-44f8-a3ec-5c2b4282448a) Simultaneously refactored all Dockerfiles by placing layers that rarely change before those that change frequently, improving build cache hit rate. 2. Refactor E2E test using vllm-ascend container images, to skip C compile while no C code are changed ![e2e](https://github.com/user-attachments/assets/49f5b166-0df3-41e1-8f71-b3bbbed17cfd) In this case, the job will only replace the source code of vllm-ascend and install `requirements-dev.txt`, saving about 10min before tests - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@9562912 Signed-off-by: wjunLu <wjunlu217@gmail.com> clean 0.15.0 support (vllm-project#6852) Clean up vllm 0.15.0 related code - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com>

…nector (vllm-project#5441) ### What this PR does / why we need it? Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

The basic configs are extracted and reused for eplb UT. This is done so that if the basic configs are changed later, eplb UT does not need to be modified repeatedly. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: bigsir007 <xujiacheng12@huawei.com> Co-authored-by: bigsir007 <xujiacheng12@huawei.com> Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com> [CI]Fixed the spell check function in `typos.toml` (#6753) The incorrect regular expression syntax `.*[UE4M3|ue4m3].*` actually ignores all words containing any of the following characters: `u, e, 4, m, 3, |` ```yaml extend-ignore-identifiers-re = [".*Unc.*", ".*_thw", ".*UE8M0.*", ".*[UE4M3|ue4m3].*", ".*eles.*", ".*fo.*", ".*ba.*", ".*ot.*", ".*[Tt]h[rR].*"] ``` ===fix===> ```yaml extend-ignore-identifiers-re = [".*Unc.*", ".*_thw", ".*UE8M0.*", ".*(UE4M3|ue4m3]).*", ".*eles.*", ".*fo.*", ".*ba.*", ".*ot.*", ".*[Tt]h[rR].*"] ``` - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 Signed-off-by: MrZ20 <2609716663@qq.com> [Doc] modify glm doc (#6770) 1. add description of another version of glm5-w4a8 weight 2. update the introduction of installation 3. introduce a script to enable bf16 MTP N/A N/A - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: yydyzr <liuyuncong1@huawei.com> [CI] unlock when load model (#6771) - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 Signed-off-by: leo-pony <nengjunma@outlook.com> Refactor the ops PyTorch adapter，cleanup for csrc/torch_binding.cpp (#6732) Refactor the ops PyTorch adapter，cleanup for csrc/torch_binding.cpp, more details see https://github.com/vllm-project/vllm-ascend/issues/6486 No install the new package to test the new modification, here is the result: - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: luomin2005 <luomin2005@huawei.com> Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> [EPLB][Bugfix] Bugfix for ineffective dynamic eplb (#6653) the end-to-end precision is monitored in the UT, and the log is not printed in the key place. As a result, the eplb does not take effect and is not intercepted. 1. The forward_before function is added back. 2. Delete unnecessary logs and add key logs. 3. Warm-up of algorithm 3 is added. ![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298) Okay, the user is asking, \"What is deep learning?\" I need to explain this in a clear and concise way. Let me start by recalling what I know about deep learning. It's a subset of machine learning, right? So first, I should mention that it's part of machine learning, which itself is a branch of AI. Then, the key aspect of deep learning is the use of neural networks with multiple layers. These are called deep neural networks.\n\nWait, I should define neural networks first. Maybe start with the basics. A neural network is inspired by the human brain, with layers of nodes (neurons) that process data. But deep learning specifically refers to networks with many layers—hence \"deep.\" So the term \"deep\" comes from the number of layers. \n\nI should explain how deep learning works. It involves training these networks on large datasets, allowing them to automatically learn features from the data. Unlike traditional machine learning, where you might have to manually extract features, deep learning models can do this automatically. That's a key point. For example, in image recognition, a deep learning model can learn to detect edges, shapes, and then more complex patterns without human intervention.\n\nApplications are important too. The user might want to know where deep learning is used. Common examples include image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. Maybe mention specific technologies like self-driving cars using computer vision or virtual assistants like Siri or Alexa - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/13397841ab469cecf1ed425c3f52a9ffc38139b5 Signed-off-by: shenchuxiaofugui <1311027364@qq.com> [Bugfix] Fix wrong computed_tokens when meet exception. (#6522)   Fix wrong computed_tokens when meet exception. This pull request addresses a bug in the KV transfer mechanism where an exception during token lookup operations could lead to an incorrect count of computed_tokens. By modifying the exception handling in both the lookup and lookup_scheduler functions to return 0 instead of the start index, the system now correctly indicates that no tokens were successfully processed when a remote connection failure occurs. This enhancement improves the robustness and accuracy of token management within the vllm_ascend distributed KV pool.  NO.  Signed-off-by: xleoken <xleoken@163.com> [Lint]Style: Convert `test/` to ruff format(Batch #5) (#6747) | File Path | | :--- | | `tests/e2e/singlecard/compile/backend.py` | | `tests/e2e/singlecard/compile/test_graphex_norm_quant_fusion.py` | | `tests/e2e/singlecard/compile/test_graphex_qknorm_rope_fusion.py` | | `tests/e2e/singlecard/compile/test_norm_quant_fusion.py` | | `tests/e2e/singlecard/model_runner_v2/test_basic.py` | | `tests/e2e/singlecard/test_aclgraph_accuracy.py` | | `tests/e2e/singlecard/test_aclgraph_batch_invariant.py` | | `tests/e2e/singlecard/test_aclgraph_mem.py` | | `tests/e2e/singlecard/test_async_scheduling.py` | | `tests/e2e/singlecard/test_auto_fit_max_mode_len.py` | | `tests/e2e/singlecard/test_batch_invariant.py` | | `tests/e2e/singlecard/test_camem.py` | | `tests/e2e/singlecard/test_completion_with_prompt_embeds.py` | | `tests/e2e/singlecard/test_cpu_offloading.py` | | `tests/e2e/singlecard/test_guided_decoding.py` | | `tests/e2e/singlecard/test_ilama_lora.py` | | `tests/e2e/singlecard/test_llama32_lora.py` | | `tests/e2e/singlecard/test_models.py` | | `tests/e2e/singlecard/test_multistream_overlap_shared_expert.py` | | `tests/e2e/singlecard/test_quantization.py` | | `tests/e2e/singlecard/test_qwen3_multi_loras.py` | | `tests/e2e/singlecard/test_sampler.py` | | `tests/e2e/singlecard/test_vlm.py` | | `tests/e2e/singlecard/test_xlite.py` | | `tests/e2e/singlecard/utils.py` | - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: MrZ20 <2609716663@qq.com> [Feat] 310p supports PrefillCacheHit State (#6756) This PR extends the Ascend 310P attention backend to support the `PrefillCacheHit` state. Previously, only `PrefillNoCache`, `DecodeOnly`, and `ChunkedPrefill` were supported. This PR handles this state by routing it to the existing `forward_chunked_prefill_310` implementation, which is suitable for this scenario. The changes also include refactoring the main `forward_impl` dispatch method for better clarity and updating unit tests to cover the new state and ensure correctness. No Accuracy test when chunked prefill is disabled. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> [main]update release note & support matrix (#6759) Update release note & support matrix to add experimental tag for features and models. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 0.13.0 branch: https://github.com/vllm-project/vllm-ascend/pull/6751 Signed-off-by: zzzzwwjj <1183291235@qq.com> [EPLB] Reduce the memory used for heat aggregation (#6729) If dist.all_gather is used directly, 2 x HCCL_BUFFSIZE memory will be consumed, but the actual memory required for hotspot aggregation is less than 1 MB. Therefore, a separate small communication domain is created for it. Original： ![1](https://github.com/user-attachments/assets/8880b461-c26f-497c-9a05-2ca60cc46aa4) Current： ![2](https://github.com/user-attachments/assets/c9da32b5-9200-4fa2-aff9-d8c4978ac602) - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 Signed-off-by: shenchuxiaofugui <1311027364@qq.com> upgrade main to 0212 (#6712) Fixes `transformers_utils/processors/__init__` import error, due to https://github.com/vllm-project/vllm/pull/33247 Fixes Fused MoE break introduced by `MoERunner abstraction,` due to https://github.com/vllm-project/vllm/pull/32344 > delete AscendMoERunnere when https://github.com/vllm-project/vllm/pull/35178 is merged Fixes `Make Qwen3VL compatible with Transformers v5`, due to https://github.com/vllm-project/vllm/pull/34262 - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: wxsIcey <1790571317@qq.com> [Feat]ds3.2 support pcp (#6733) The ds3.2 model adaptation supports the PCP feature. The solution is as follows: When saving the KV cache, first perform an allgather operation on the KVs, and then each node saves its own copy. When the attention or indexer performs calculations, they all gather the KV cache and then perform the calculations. No 02/12 23:05:10 - AISBench - INFO - Running 1-th replica of evaluation 02/12 23:05:10 - AISBench - INFO - Task [vllm-api-general-chat/gsm8k]: {'accuracy': 96.35416666666667, 'type': 'GEN'} 02/12 23:05:10 - AISBench - INFO - time elapsed: 2.87s 02/12 23:05:12 - AISBench - INFO - Evaluation tasks completed. 02/12 23:05:12 - AISBench - INFO - Summarizing evaluation results... dataset version metric mode vllm-api-general-chat gsm8kdataset - accuracy gen 96.35 - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> [Nightly] Increase VLLM_ENGINE_READY_TIMEOUT_S to avoid nightly failure (#6778) After some observation, I found some cases failed for timeout, just like https://github.com/vllm-project/vllm-ascend/actions/runs/22280996034/job/64487867977#step:9:921 and https://github.com/vllm-project/vllm-ascend/actions/runs/22315540111/job/64574590762#step:9:1809, this may caused by the excessively long model loading time (currently we are still loading weights from network storage), it is necessary to adjust the timeout seconds 600s -> 1800s - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 Signed-off-by: wangli <wangli858794774@gmail.com> [Platform] Enable ARM-only CPU binding with NUMA-balanced A3 policy and update docs/tests (#6686) - Keeps enable_cpu_binding default on, but skips binding on non‑ARM CPUs inside bind_cpus, with a clear log. - Uses a table-driven binding policy: A3 uses NUMA‑balanced binding; other device types use NUMA‑affinity binding. - Updates docs to reflect the exact behavior and adds/updates unit tests for the new logic. - Yes. CPU binding is now enabled by default via additional_config, and documented in the user guide. - CPU binding behavior differs by device type (A3 vs. others). Added/updated unit tests: test_cpu_binding.py 1. test_binding_mode_table covers A2 vs A3 binding mode mapping. 2. test_build_cpu_pools_fallback_to_numa_balanced covers fallback when affinity info is missing. 3. TestBindingSwitch.test_is_arm_cpu covers ARM/x86/unknown arch detection. 4. test_bind_cpus_skip_non_arm covers non‑ARM skip path in bind_cpus. test_worker_v1.py 1. Updated mocks for enable_cpu_binding default True to align with new config default. - vLLM version: v0.14.1 - vLLM main: d7de043 --------- Signed-off-by: chenchuw886 <chenchuw@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com> [KVPool][BugFix] Correctly initialize head_or_tp_rank for mooncake backend (#6498) The problem that the local priority is not used in the A2 environment on the Mooncake node is resolved. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com> [Refactor][Bugfix] Use upstream `mem_utils` for profiling and correct non-torch memory recorded during profiling (#6625) 1. Following https://github.com/vllm-project/vllm/pull/32322, use the `memory_profiling` context manager from vllm for profiling. 2. Fix wrong non-torch memory value recorded during profiling, which is not its peak during inference. --- **More details about point 2:** After profling, the non-torch memory value we recorded is lower than that in real inference. This is mainly because of the different memory management behaviour between `torch.cuda.empty_cache()` and `torch.npu.empty_cache()`. With regard to `torch.cuda.empty_cache()`, it only recycle the unused memory in pytorch memory pool (i.e., memory managed by pytorch caching allocator), **with no affect to non-torch memory**. However, as for `torch.npu.empty_cache()`, it has a totally different memory management mechanism, i.e., it may call `aclrtSynchronize` and **enable Ascend runtime to free up non-torch memory**. Thus, the non-torch memory value we recorded after `torch.npu.empty_cache()` is much lower than its peak during profling. Resolution: We record the peak non-torch memory value (`non_torch_memory_before_empty_cache`) after profiling, but before `torch.npu.empty_cache()`. Then, we add the diff (`non_torch_memory_cleared_by_empty_cache = non_torch_memory_before_empty_cache - self.non_torch_memory`) to non-torch memory when calculating available KV cache memory, which will lead to less KV cache memory (i.e., it's safer to avoid OOM issues). --- > [!NOTE] > This PR needs to wait for main2main aligning to latest vllm commit before merging. no. Before this PR, the non-torch memory we used to calculate available KV cache memory is **0.90 G**, whereas its peak during real inference is **1.08 G**, diff: **182.00 M**. After this PR, we add this diff to non-torch memory after profiling and thus make the profiling results more accurate. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a --------- Signed-off-by: shen-shanshan <467638484@qq.com> [Bugfix] Add the missing parentheses to @torch.inference_mode (#6757) This PR fixes a bug in `vllm_ascend/worker/model_runner_v1.py` where the `@torch.inference_mode` decorator was used without parentheses. Using the decorator without instantiation is deprecated and may not correctly disable gradient calculations, leading to performance degradation and increased memory usage during inference. This change adds the required parentheses to ensure `torch.inference_mode` is applied correctly. No. The change is a minor syntax correction. Existing CI tests should cover this. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 Signed-off-by: gcanlin <canlinguosdu@gmail.com> [DOC] add request forwarding (#6780) - New section: "Request Forwarding" documentation in docs/source/tutorials/models/DeepSeek-V3.2.md - Environment fix: Changed VLLM_ASCEND_ENABLE_FLASHCOMM1 from 0 to 1 in the DeepSeek-V3 configuration examples Documentation update only - provides new configuration guidance for request forwarding setups - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com> Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com> [fix]change num_commmon_tokens to num_common_tokens (#6792) change num_commmon_tokens to num_common_tokens in vllm_ascend/_310p/model_runner_310p.py，which caused CI test failure - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com> [Bugfix] Support Kimi-K2.5 models (#6755) This PR supports the Kimi-K2.5 models on the NPU of bf16 and w4a8 weights. The corresponding PR in the vllm community has been merged: https://github.com/vllm-project/vllm/pull/34501 - No. We test the Kimi-K2.5 weights. The weights path: https://modelscope.cn/models/Eco-Tech/Kimi-K2.5-W4A8 Successfully ran on 910B NPU using vllm-ascend by the w4a8 weights. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: LoganJane <LoganJane73@hotmail.com> Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com> [Bugfix] fix bug for mtp (#6514) fix(mtp): resolve MTP core bugs and enhance eager mode test cases 1. Resolved critical issues in eager mode MTP core execution logic; 2. Fixed functional bugs in the _update_states_after_model_execute function; 3. Updated and released test_mtp_qwen3_next.py to validate eager mode acceptance rate. None - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: Bowen-Leee <caoshankuangren@gmail.com> Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com> [Bugfix] Fix DeepseekV3.1 Accuracy issue (#6805) In order to adapt to the GLM model, logits were passed in the sample, which can cause accuracy issues in version 0.15.0. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd Signed-off-by: GDzhu01 <809721801@qq.com> Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com> [Doc][Feature] Add vLLM Ascend development guidelines AGETNS.md (#6797) This PR adds a new document, `AGENTS.md`, which provides detailed development guidelines for contributors to the vLLM Ascend project. These guidelines cover code style, testing, NPU-specific considerations, and the contribution process to ensure code quality and consistency. No, this is a documentation-only update for developers. This is a documentation change and does not require testing. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com> [Doc][Skill] Introduce AI-assisted model-adaptation workflow for vllm-ascend (#6731) This PR introduces the **first AI-assisted model-adaptation skill package** for `vllm-ascend`. The goal is to make model adaptation work (especially for recurring feature-request issues) **repeatable, auditable, and easier to hand off**. This PR adds only skill/workflow assets under: - `.agents/skills/vllm-ascend-model-adapter/SKILL.md` - `.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md` - `.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md` - `.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/deliverables.md` The skill standardizes: 1. **Environment assumptions** used in our Docker setup - implementation roots: `/vllm-workspace/vllm` and `/vllm-workspace/vllm-ascend` - serving root: `/workspace` - model path convention: `/models/<model-name>` 2. **Validation strategy** - Stage A: fast `--load-format dummy` gate - Stage B: mandatory real-weight gate before sign-off - avoid false-ready by requiring request-level checks (not startup log only) 3. **Feature-first verification checklist** - ACLGraph / EP / flashcomm1 / MTP / multimodal - explicit `supported / unsupported / not-applicable / checkpoint-missing` outcomes 4. **Delivery contract** - minimal scoped code changes - required artifacts (Chinese report + runbook, e2e config YAML, tutorial doc) - one signed commit in delivery repo - No runtime/kernel/model patch is included in this PR. - No direct model support claim is made by this PR alone. - Model-specific adaptation/fix work should be submitted in follow-up PRs using this skill as the workflow baseline. This gives the repo a shared, explicit AI-assistance protocol, so future model-adaptation PRs are easier to review, compare, and reproduce. --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com> [MM][Perf] Use `seq_lens` CPU cache to avoid frequent d2h copy for better performance (#6448) Currently, the performance of multi-modal encoding (i.e., `AscendMMEncoderAttention` forward) is considerably bounded by the heavy host pre-process operations. We can see from the profiling results below, before the real computation of Attention, there are long free time in the device, which will lead to extremely low NPU utilization. <img width="2264" height="1398" alt="iShot_2026-01-23_16 26 39" src="https://github.com/user-attachments/assets/37f21d06-e526-4f28-82fe-005746cf13bd" /> --- **To opitimize this, this PR has proposed four changes:** 1. Use `seq_lens` CPU cache to avoid frequent d2h copy. Before this PR, `AscendMMEncoderAttention` will copy the `cu_seqlens` from NPU to CPU in every forward, since the op `_npu_flash_attention_unpad()` requires CPU `cu_seqlens` (otherwise it will crash). Thus, we use `seq_lens_cpu_cache` to cache this tensor, since it's shared between all layers, but may change in different forward step. When the current `layer_index` is `0`, we update the cache, otherwise we directly use the cache to avoid frequent `diff` and `copy` operations, which are costful. 2. Pre-compute the scale value to avoid calculating it in every forward. 3. Move the judgment of `enable_pad` from forward to the `__init__` method. 4. Revert https://github.com/vllm-project/vllm-ascend/pull/6204. **Performance after these optimizations:** - **TTFT** has been reduced by **7.43%** ⬇️. - **Throughput** has been increased by **1.23%** ⬆️. --- > [!NOTE] > This PR requires https://github.com/vllm-project/vllm/pull/33674 be merged. --- No. Launch the server: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --dtype bfloat16 \ --limit-mm-per-prompt '{"image": 1}' \ --max-model-len 16384 \ --max-num-batched-tokens 16384 \ --no-async-scheduling ``` Run benchmark: ```bash vllm bench serve \ --model /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --backend openai-chat \ --endpoint /v1/chat/completions \ --dataset-name hf \ --hf-split train \ --dataset-path lmarena-ai/vision-arena-bench-v0.1 \ --num-prompts 500 \ --request-rate 10 \ --burstiness 5 \ --no-stream ``` Before this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 500 Failed requests: 0 Request rate configured (RPS): 10.00 Benchmark duration (s): 82.23 Total input tokens: 33418 Total generated tokens: 61543 Request throughput (req/s): 6.08 Output token throughput (tok/s): 748.45 Peak output token throughput (tok/s): 3203.00 Peak concurrent requests: 402.00 Total token throughput (tok/s): 1154.86 ---------------Time to First Token---------------- Mean TTFT (ms): 10275.37 Median TTFT (ms): 6297.88 P99 TTFT (ms): 22918.26 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 263.02 Median TPOT (ms): 277.61 P99 TPOT (ms): 483.56 ---------------Inter-token Latency---------------- Mean ITL (ms): 257.31 Median ITL (ms): 94.83 P99 ITL (ms): 1773.90 ================================================== ``` After this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 500 Failed requests: 0 Request rate configured (RPS): 10.00 Benchmark duration (s): 81.20 Total input tokens: 33418 Total generated tokens: 61509 Request throughput (req/s): 6.16 Output token throughput (tok/s): 757.54 Peak output token throughput (tok/s): 2562.00 Peak concurrent requests: 395.00 Total token throughput (tok/s): 1169.11 ---------------Time to First Token---------------- Mean TTFT (ms): 9511.91 Median TTFT (ms): 5479.78 P99 TTFT (ms): 21427.21 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 261.12 Median TPOT (ms): 276.03 P99 TPOT (ms): 446.99 ---------------Inter-token Latency---------------- Mean ITL (ms): 254.04 Median ITL (ms): 97.71 P99 ITL (ms): 1516.67 ================================================== ``` - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/dc917cceb877dfd13f98c538c4c96158047d98bd Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com> [Refactor] Modify the binding logic, added memory migration and interrupt core binding functions. (#6785) [Refactor] Modify the binding logic, added memory migration and interrupt core binding functions. Controls the use of memory on a closer NUMA node to achieve a lower memory access latency, while binding interrupts to different CPU cores to prevent them form interrupting the inference process. No https://github.com/vllm-project/vllm-ascend/pull/6785/changes/b8eaaa073bc99e3a25e31c16e87bbd4acd6377eb Signed-off-by: rowzwel_dx <1392851715@qq.com> Signed-off-by: Rozwel-dx <1392851715@qq.com> - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 Signed-off-by: Rozwel-dx <1392851715@qq.com> Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com> [Feat] Support routing replay (#6696) [Feat] Support routing replay same as https://github.com/vllm-project/vllm-ascend/pull/6666 resubmit because of DOC failure - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: liyongwen <1310439159@qq.com> Signed-off-by: Li-Yongwen <63399187+Li-Yongwen@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com> [CI] Fix EAGLE CI problems (#6702) New FIA operator requires queryT equal to the last element of actualSequenceLengthQ. No. Passed existing test (test_mtp_eagle_correctness.py). - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: Wangbingjie <wangbj1207@126.com> Signed-off-by: Wangbingjie <w30061490@china.huawei.com> Co-authored-by: Wangbingjie <w30061490@china.huawei.com> Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com> fix glm4.7 hidden_states and positions shape mismatch Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update eagle_proposer.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> [CI] Add long and short prompt tests for DeepSeek-V3.2 (#6536) This version has no divisibility constraint between tp and mtp+1. However, cudagraph_capture_sizes must be a common multiple of tp and mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed cudagraph_capture_sizes. We added a long-sequence test (64k input, 3k output) for the two-node mixed deployment scenario. Due to the excessive time required for performance benchmarking, we are only verifying functionality. The single-node scenario is skipped because VRAM limitations prevent launching the model with a max-model-len of 68,000. and we also add aime2025 test for dual-node deepseek 3.2 nightly test. test at nightly environment. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com> [Feature][Quant] Auto-detect quantization format from model files (#6645) - Add automatic quantization format detection, eliminating the need to manually specify `--quantization` when serving quantized models. - The detection inspects only lightweight JSON files (`quant_model_description.json` and `config.json`) at engine initialization time, with no `.safetensors` reads. - User-explicit `--quantization` flags are always respected; auto-detection only applies when the flag is omitted. **Detection priority:** 1. `quant_model_description.json` exists → `quantization="ascend"` (ModelSlim) 2. `config.json` contains `"quant_method": "compressed-tensors"` → `quantization="compressed-tensors"` (LLM-Compressor) 3. Neither → default float behavior **Technical approach:** Hooked into `NPUPlatform.check_and_update_config()` to run detection after `VllmConfig.__post_init__`. Since `quant_config` is already `None` at that point, we explicitly recreate it via `VllmConfig._get_quantization_config()` to trigger the full quantization initialization pipeline. | File | Description | |------|-------------| | `vllm_ascend/quantization/utils.py` | Added `detect_quantization_method()` and `maybe_auto_detect_quantization()` | | `vllm_ascend/platform.py` | Integrated auto-detection in `check_and_update_config()` | | `vllm_ascend/quantization/modelslim_config.py` | Improved error handling for weight loading | - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> [Misc] Drop patch_rope.py (#6291) Part of #5304. We have align with vLLM's latest change for `RotaryEmbeddingBase`. Don't need this patch anymore. - vLLM version: v0.14.1 - vLLM main: https://github.com/vllm-project/vllm/commit/dc917cceb877dfd13f98c538c4c96158047d98bd Signed-off-by: gcanlin <canlinguosdu@gmail.com> [BugFix] [310p] Fix attention accuracy issue (#6803) This pull request resolves an attention accuracy issue by enhancing the AttentionMaskBuilder310 to correctly handle the maximum model length. The change ensures that the attention mask generation process is properly parameterized by the model's configuration, rather than relying on a fixed internal value. This leads to more accurate attention mask creation, which is crucial for the correct functioning of the attention mechanism. Update fused_moe to main branch. No Qwen3 dense mode & moe model e2e test - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> [Doc][Misc] Refactor skill documentation and add Claude support instructions (#6817) This PR refactors the documentation for vLLM Ascend skills. - It renames and moves the `vllm-ascend-model-adapter` skill's README to serve as a new top-level README for the `.agents` directory. - It adds instructions on how to use the Ascend skills with Claude, including a new README in the `.claude` directory. - It updates `.gitignore` to exclude skills copied for Claude's use. - Add main2main skill This improves the documentation structure, making it more organized and providing clear instructions for developers using these skills with different tools. No, this PR contains only documentation and repository configuration changes. It does not affect any user-facing code functionality. These changes are documentation-only and do not require specific testing. The correctness of the instructions is being verified through this review. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [Patch][Misc] Cleanup and update patches (#6802) This PR performs a cleanup and update of the patch mechanism in `vllm-ascend`. - Removes several obsolete patches: `patch_deepseek.py`. - Updates the central patch documentation in `vllm_ascend/patch/__init__.py` to reflect these removals and additions, re-numbering and re-organizing the patch list for better clarity. No. These are internal changes to the patching mechanism and should not affect users. CI passed with new added/existing test. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [BugFix] Support ALL D-Nodes in fullgraph when running MTP in PD (#5472) **BUG** When using prefill-decode disaggregation + MTP + full graph +asynchronous scheduling, the KV cache pulled by decode nodes from prefill decodes does not include spec tokens. As a result, the total_num_scheduled_tokens obtained by decode nodes from the scheduler lacks spec tokens. When determining whether to enqueue the full graph on decode nodes, the condition for uniform_decode ` scheduler_output.total_num_scheduled_tokens == self.input_batch.num_reqs * max_query_len` is not met, leading to the current instance not being enqueued into the full graph. The above situation leads to both full graph and eagle mode instances coexisting in the decode instances. Due to the synchronization wait of MoeDispatch, the decode instances in full graph are significantly slowed down by the instance in eagle mode. **Solution** The scenario is PD separation + MTP + Full Graph + asynchronous scheduling. On the decode nodes, the spec tokens of the request with KV cache from P need be padded. Then, the padded spec tokens will be rejected by sampling. This operation ensures that the uniform_decode condition is satisfied when determining whether decode nodes are included in the full graph, thereby guaranteeing that all decode instances are present in the full graph and avoiding synchronous waiting for MoeDispatch. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/5326c89803566a131c928f7fdd2100b75c981a42 Signed-off-by: chenmenglong <chenmenglong1@huawei.com> [Doc][Release] Add release note skill (#6824) This PR adds the releaseing note skills: - `SKILL.md`: vLLM Ascend Releasing Note Writer - `references/ref-past-release-notes-highlight.md`: And also add a `output/v0.13.0` examples which was used by https://github.com/vllm-project/vllm-ascend/commit/2da476d82f048816095794a9c0ac45126dc251af Inspired: https://github.com/simon-mo/release-notes-writing/ No - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd Co-authored-by: esmeetu <jasonailu87@gmail.com> --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> [Feat]support sequence parallelism by pass for VL models (#5632) [CI] Fix doc test fail when load model with error information: 'Stale file handle' (#6832) This PR fixes a `Stale file handle` error that occurs during doctests in the CI environment. The error appears when loading models from ModelScope, likely due to issues with network file systems used in CI. The fix involves setting the `MODELSCOPE_HUB_FILE_LOCK` environment variable to `false` in the `run_doctests.sh` script. This disables file locking in the ModelScope hub, which is a common workaround for this type of file system error. No, this change only affects the CI test execution environment and has no impact on users. This change is validated by the CI pipeline. A successful run of the doctests indicates that the fix is effective. Signed-off-by: leo-pony <nengjunma@outlook.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update eagle_proposer.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> [CI] Add long and short prompt tests for DeepSeek-V3.2 (#6536) This version has no divisibility constraint between tp and mtp+1. However, cudagraph_capture_sizes must be a common multiple of tp and mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed cudagraph_capture_sizes. We added a long-sequence test (64k input, 3k output) for the two-node mixed deployment scenario. Due to the excessive time required for performance benchmarking, we are only verifying functionality. The single-node scenario is skipped because VRAM limitations prevent launching the model with a max-model-len of 68,000. and we also add aime2025 test for dual-node deepseek 3.2 nightly test. test at nightly environment. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com> [Feature][Quant] Auto-detect quantization format from model files (#6645) - Add automatic quantization format detection, eliminating the need to manually specify `--quantization` when serving quantized models. - The detection inspects only lightweight JSON files (`quant_model_description.json` and `config.json`) at engine initialization time, with no `.safetensors` reads. - User-explicit `--quantization` flags are always respected; auto-detection only applies when the flag is omitted. **Detection priority:** 1. `quant_model_description.json` exists → `quantization="ascend"` (ModelSlim) 2. `config.json` contains `"quant_method": "compressed-tensors"` → `quantization="compressed-tensors"` (LLM-Compressor) 3. Neither → default float behavior **Technical approach:** Hooked into `NPUPlatform.check_and_update_config()` to run detection after `VllmConfig.__post_init__`. Since `quant_config` is already `None` at that point, we explicitly recreate it via `VllmConfig._get_quantization_config()` to trigger the full quantization initialization pipeline. | File | Description | |------|-------------| | `vllm_ascend/quantization/utils.py` | Added `detect_quantization_method()` and `maybe_auto_detect_quantization()` | | `vllm_ascend/platform.py` | Integrated auto-detection in `check_and_update_config()` | | `vllm_ascend/quantization/modelslim_config.py` | Improved error handling for weight loading | - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> [Misc] Drop patch_rope.py (#6291) Part of #5304. We have align with vLLM's latest change for `RotaryEmbeddingBase`. Don't need this patch anymore. - vLLM version: v0.14.1 - vLLM main: https://github.com/vllm-project/vllm/commit/dc917cceb877dfd13f98c538c4c96158047d98bd Signed-off-by: gcanlin <canlinguosdu@gmail.com> [BugFix] [310p] Fix attention accuracy issue (#6803) This pull request resolves an attention accuracy issue by enhancing the AttentionMaskBuilder310 to correctly handle the maximum model length. The change ensures that the attention mask generation process is properly parameterized by the model's configuration, rather than relying on a fixed internal value. This leads to more accurate attention mask creation, which is crucial for the correct functioning of the attention mechanism. Update fused_moe to main branch. No Qwen3 dense mode & moe model e2e test - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> [Doc][Misc] Refactor skill documentation and add Claude support instructions (#6817) This PR refactors the documentation for vLLM Ascend skills. - It renames and moves the `vllm-ascend-model-adapter` skill's README to serve as a new top-level README for the `.agents` directory. - It adds instructions on how to use the Ascend skills with Claude, including a new README in the `.claude` directory. - It updates `.gitignore` to exclude skills copied for Claude's use. - Add main2main skill This improves the documentation structure, making it more organized and providing clear instructions for developers using these skills with different tools. No, this PR contains only documentation and repository configuration changes. It does not affect any user-facing code functionality. These changes are documentation-only and do not require specific testing. The correctness of the instructions is being verified through this review. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [Patch][Misc] Cleanup and update patches (#6802) This PR performs a cleanup and update of the patch mechanism in `vllm-ascend`. - Removes several obsolete patches: `patch_deepseek.py`. - Updates the central patch documentation in `vllm_ascend/patch/__init__.py` to reflect these removals and additions, re-numbering and re-organizing the patch list for better clarity. No. These are internal changes to the patching mechanism and should not affect users. CI passed with new added/existing test. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [BugFix] Support ALL D-Nodes in fullgraph when running MTP in PD (#5472) **BUG** When using prefill-decode disaggregation + MTP + full graph +asynchronous scheduling, the KV cache pulled by decode nodes from prefill decodes does not include spec tokens. As a result, the total_num_scheduled_tokens obtained by decode nodes from the scheduler lacks spec tokens. When determining whether to enqueue the full graph on decode nodes, the condition for uniform_decode ` scheduler_output.total_num_scheduled_tokens == self.input_batch.num_reqs * max_query_len` is not met, leading to the current instance not being enqueued into the full graph. The above situation leads to both full graph and eagle mode instances coexisting in the decode instances. Due to the synchronization wait of MoeDispatch, the decode instances in full graph are significantly slowed down by the instance in eagle mode. **Solution** The scenario is PD separation + MTP + Full Graph + asynchronous scheduling. On the decode nodes, the spec tokens of the request with KV cache from P need be padded. Then, the padded spec tokens will be rejected by sampling. This operation ensures that the uniform_decode condition is satisfied when determining whether decode nodes are included in the full graph, thereby guaranteeing that all decode instances are present in the full graph and avoiding synchronous waiting for MoeDispatch. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/5326c89803566a131c928f7fdd2100b75c981a42 Signed-off-by: chenmenglong <chenmenglong1@huawei.com> [Doc][Release] Add release note skill (#6824) This PR adds the releaseing note skills: - `SKILL.md`: vLLM Ascend Releasing Note Writer - `references/ref-past-release-notes-highlight.md`: And also add a `output/v0.13.0` examples which was used by https://github.com/vllm-project/vllm-ascend/commit/2da476d82f048816095794a9c0ac45126dc251af Inspired: https://github.com/simon-mo/release-notes-writing/ No - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd Co-authored-by: esmeetu <jasonailu87@gmail.com> --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> [Feat]support sequence parallelism by pass for VL models (#5632) [CI] Fix doc test fail when load model with error information: 'Stale file handle' (#6832) This PR fixes a `Stale file handle` error that occurs during doctests in the CI environment. The error appears when loading models from ModelScope, likely due to issues with network file systems used in CI. The fix involves setting the `MODELSCOPE_HUB_FILE_LOCK` environment variable to `false` in the `run_doctests.sh` script. This disables file locking in the ModelScope hub, which is a common workaround for this type of file system error. No, this change only affects the CI test execution environment and has no impact on users. This change is validated by the CI pipeline. A successful run of the doctests indicates that the fix is effective. Signed-off-by: leo-pony <nengjunma@outlook.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update eagle_proposer.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> [Doc] fix the nit in docs (#6826) Refresh the doc, fix the nit in the docs - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> add release note for 0.15.0rc1 (#6839) Add release note for 0.15.0rc1 - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [DOC] enable both flashcomm1 and cudagraph (#6807) This PR updates the DeepSeek-V3.2 documentation to include the latest performance optimizations and configuration improvements. - **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1` environment variable across all deployment scenarios to enable FlashComm1 for improved communication performance - **Layer Sharding**: Added `--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for better memory distribution - **CUDA Graph Optimization**: Updated cudagraph capture sizes from `[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40, 48]` - **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to 3 - **Documentation Links**: Fixed request forwarding documentation to use proper GitHub repository links Yes, users can now follow the updated documentation to enable FlashComm1 and layer sharding for improved DeepSeek-V3.2 performance. Existing documentation examples have been validated to ensure configuration consistency across all deployment scenarios. --- - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com> [Main2Main] Upgrade vLLM to 0226 (#6813) Breaking: 1. https://github.com/vllm-project/vllm/pull/33452 2. https://github.com/vllm-project/vllm/pull/33451 3. https://github.com/vllm-project/vllm/pull/32567 4. https://github.com/vllm-project/vllm/pull/32344 - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: MrZ20 <2609716663@qq.com> [Refactor][EAGLE] 7/N Merged PCP and disable_padded interface (#6811) [Refactor][EAGLE] 7/N Merged PCP and disable_padded interface into eagle_proposer.py This pull request significantly refactors the speculative decoding mechanism by merging Parallel Context Processing (PCP) and Multi-Token Prediction (MTP) functionalities directly into the eagle_proposer.py. The changes aim to enhance the efficiency and correctness of distributed speculative decoding, particularly by enabling the Eagle feature to work seamlessly with the disable_padded interface. This involves detailed adjustments to attention metadata, input/output processing, and state management to ensure proper operation in parallel environments. 1. The PCP and MTP features are migrated to the eagle_proposer.py 2. The Eagle and PCP features are integrated 3. Enable the eagle feature to use the disable_padded interface No Tests and UT - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com> [CI] Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector (#5441) Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector. - vLLM version: release/v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/81786c87748b0177111dfdc07af5351d8389baa1 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> [Doc][Misc] Update release notes for v0.15.0rc1 (#6859) This PR updates the release notes for `v0.15.0rc1` to: - Mark the `310P MoE and W8A8 Support` feature as experimental. - Add a note for `Kimi-K2.5 Model Support` clarifying that it has known issues in vLLM 0.15.0 and requires manual patching to work correctly. No, this is a documentation-only update. N/A (documentation change). - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [bugfix] Fixed an accuracy problem of gdn layer in graph (#6822) There will be random ouputs if we run model with GDN attention in graph mode: ```python prompts = [ "1. Who are you?", ] sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32) sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=5) llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, distributed_executor_backend="mp", gpu_memory_utilization=0.7, speculative_config={ "method": "qwen3_next_mtp", "num_speculative_tokens": 3, }, compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [8], }, max_model_len=4096, enable_prefix_caching=False) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"{output.prompt_token_ids=}") print(f"{output.outputs[0].token_ids=}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Before appling this change, the outputs was: ```text output.prompt_token_ids=[16, 13, 10479, 525, 498, 30] output.outputs[0].token_ids=[3555, 323, 279, 1112, 279] Prompt: '1. Who are you?', Generated text: ' What and the... the' ``` After applying this change, the output is: ```text output.prompt_token_ids=[16, 13, 10479, 525, 498, 30] output.outputs[0].token_ids=[3555, 374, 697, 829, 30] Prompt: '1. Who are you?', Generated text: ' What is your name?' ``` **Why does this change sovle the problem?** Now, `query_start_loc` is padded because of `fia`. But, for `gdn-attention`, padded version of `query_start_loc` will cause accuracy problem. So, we need an unpadded version of `query_start_loc` named `gdn_query_start_loc` and use it in `gdn-attention`, it works fine. N/A As described aboved. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd Signed-off-by: drslark <slarksblood@qq.com> [CI] Refactor to speedup image building and CI Installation (#6708) 1. Refactor image workflow using cache-from to speedup builds ![build](https://github.com/user-attachments/assets/02135c12-0069-44f8-a3ec-5c2b4282448a) Simultaneously refactored all Dockerfiles by placing layers that rarely change before those that change frequently, improving build cache hit rate. 2. Refactor E2E test using vllm-ascend container images, to skip C compile while no C code are changed ![e2e](https://github.com/user-attachments/assets/49f5b166-0df3-41e1-8f71-b3bbbed17cfd) In this case, the job will only replace the source code of vllm-ascend and install `requirements-dev.txt`, saving about 10min before tests - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 Signed-off-by: wjunLu <wjunlu217@gmail.com> clean 0.15.0 support (#6852) Clean up vllm 0.15.0 related code - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Update eagle_proposer.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> [CI] Add long and short prompt tests for DeepSeek-V3.2 (#6536) This version has no divisibility constraint between tp and mtp+1. However, cudagraph_capture_sizes must be a common multiple of tp and mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed cudagraph_capture_sizes. We added a long-sequence test (64k input, 3k output) for the two-node mixed deployment scenario. Due to the excessive time required for performance benchmarking, we are only verifying functionality. The single-node scenario is skipped because VRAM limitations prevent launching the model with a max-model-len of 68,000. and we also add aime2025 test for dual-node deepseek 3.2 nightly test. test at nightly environment. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com> [Misc] Drop patch_rope.py (#6291) Part of #5304. We have align with vLLM's latest change for `RotaryEmbeddingBase`. Don't need this patch anymore. - vLLM version: v0.14.1 - vLLM main: https://github.com/vllm-project/vllm/commit/dc917cceb877dfd13f98c538c4c96158047d98bd Signed-off-by: gcanlin <canlinguosdu@gmail.com> [BugFix] [310p] Fix attention accuracy issue (#6803) This pull request resolves an attention accuracy issue by enhancing the AttentionMaskBuilder310 to correctly handle the maximum model length. The change ensures that the attention mask generation process is properly parameterized by the model's configuration, rather than relying on a fixed internal value. This leads to more accurate attention mask creation, which is crucial for the correct functioning of the attention mechanism. Update fused_moe to main branch. No Qwen3 dense mode & moe model e2e test - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> [Doc][Misc] Refactor skill documentation and add Claude support instructions (#6817) This PR refactors the documentation for vLLM Ascend skills. - It renames and moves the `vllm-ascend-model-adapter` skill's README to serve as a new top-level README for the `.agents` directory. - It adds instructions on how to use the Ascend skills with Claude, including a new README in the `.claude` directory. - It updates `.gitignore` to exclude skills copied for Claude's use. - Add main2main skill This improves the documentation structure, making it more organized and providing clear instructions for developers using these skills with different tools. No, this PR contains only documentation and repository configuration changes. It does not affect any user-facing code functionality. These changes are documentation-only and do not require specific testing. The correctness of the instructions is being verified through this review. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [Patch][Misc] Cleanup and update patches (#6802) This PR performs a cleanup and update of the patch mechanism in `vllm-ascend`. - Removes several obsolete patches: `patch_deepseek.py`. - Updates the central patch documentation in `vllm_ascend/patch/__init__.py` to reflect these removals and additions, re-numbering and re-organizing the patch list for better clarity. No. These are internal changes to the patching mechanism and should not affect users. CI passed with new added/existing test. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [BugFix] Support ALL D-Nodes in fullgraph when running MTP in PD (#5472) **BUG** When using prefill-decode disaggregation + MTP + full graph +asynchronous scheduling, the KV cache pulled by decode nodes from prefill decodes does not include spec tokens. As a result, the total_num_scheduled_tokens obtained by decode nodes from the scheduler lacks spec tokens. When determining whether to enqueue the full graph on decode nodes, the condition for uniform_decode ` scheduler_output.total_num_scheduled_tokens == self.input_batch.num_reqs * max_query_len` is not met, leading to the current instance not being enqueued into the full graph. The above situation leads to both full graph and eagle mode instances coexisting in the decode instances. Due to the synchronization wait of MoeDispatch, the decode instances in full graph are significantly slowed down by the instance in eagle mode. **Solution** The scenario is PD separation + MTP + Full Graph + asynchronous scheduling. On the decode nodes, the spec tokens of the request with KV cache from P need be padded. Then, the padded spec tokens will be rejected by sampling. This operation ensures that the uniform_decode condition is satisfied when determining whether decode nodes are included in the full graph, thereby guaranteeing that all decode instances are present in the full graph and avoiding synchronous waiting for MoeDispatch. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/5326c89803566a131c928f7fdd2100b75c981a42 Signed-off-by: chenmenglong <chenmenglong1@huawei.com> [Doc][Release] Add release note skill (#6824) This PR adds the releaseing note skills: - `SKILL.md`: vLLM Ascend Releasing Note Writer - `references/ref-past-release-notes-highlight.md`: And also add a `output/v0.13.0` examples which was used by https://github.com/vllm-project/vllm-ascend/commit/2da476d82f048816095794a9c0ac45126dc251af Inspired: https://github.com/simon-mo/release-notes-writing/ No - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd Co-authored-by: esmeetu <jasonailu87@gmail.com> --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> [Feat]support sequence parallelism by pass for VL models (#5632) [CI] Fix doc test fail when load model with error information: 'Stale file handle' (#6832) This PR fixes a `Stale file handle` error that occurs during doctests in the CI environment. The error appears when loading models from ModelScope, likely due to issues with network file systems used in CI. The fix involves setting the `MODELSCOPE_HUB_FILE_LOCK` environment variable to `false` in the `run_doctests.sh` script. This disables file locking in the ModelScope hub, which is a common workaround for this type of file system error. No, this change only affects the CI test execution environment and has no impact on users. This change is validated by the CI pipeline. A successful run of the doctests indicates that the fix is effective. Signed-off-by: leo-pony <nengjunma@outlook.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update eagle_proposer.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> [Doc] fix the nit in docs (#6826) Refresh the doc, fix the nit in the docs - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> add release note for 0.15.0rc1 (#6839) Add release note for 0.15.0rc1 - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [DOC] enable both flashcomm1 and cudagraph (#6807) This PR updates the DeepSeek-V3.2 documentation to include the latest performance optimizations and configuration improvements. - **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1` environment variable across all deployment scenarios to enable FlashComm1 for improved communication performance - **Layer Sharding**: Added `--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for better memory distribution - **CUDA Graph Optimization**: Updated cudagraph capture sizes from `[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40, 48]` - **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to 3 - **Documentation Links**: Fixed request forwarding documentation to use proper GitHub repository links Yes, users can now follow the updated documentation to enable FlashComm1 and layer sharding for improved DeepSeek-V3.2 performance. Existing documentation examples have been validated to ensure configuration consistency across all deployment scenarios. --- - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com> [Main2Main] Upgrade vLLM to 0226 (#6813) Breaking: 1. https://github.com/vllm-project/vllm/pull/33452 2. https://github.com/vllm-project/vllm/pull/33451 3. https://github.com/vllm-project/vllm/pull/32567 4. https://github.com/vllm-project/vllm/pull/32344 - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: MrZ20 <2609716663@qq.com> [Refactor][EAGLE] 7/N Merged PCP and disable_padded interface (#6811) [Refactor][EAGLE] 7/N Merged PCP and disable_padded interface into eagle_proposer.py This pull request significantly refactors the speculative decoding mechanism by merging Parallel Context Processing (PCP) and Multi-Token Prediction (MTP) functionalities directly into the eagle_proposer.py. The changes aim to enhance the efficiency and correctness of distributed speculative decoding, particularly by enabling the Eagle feature to work seamlessly with the disable_padded interface. This involves detailed adjustments to attention metadata, input/output processing, and state management to ensure proper operation in parallel environments. 1. The PCP and MTP features are migrated to the eagle_proposer.py 2. The Eagle and PCP features are integrated 3. Enable the eagle feature to use the disable_padded interface No Tests and UT - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com> [CI] Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector (#5441) Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector. - vLLM version: release/v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/81786c87748b0177…

…nector (vllm-project#5441) ### What this PR does / why we need it? Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

…nector (vllm-project#5441) ### What this PR does / why we need it? Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…nector (vllm-project#5441) ### What this PR does / why we need it? Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

github-actions bot added ci/build module:tests labels Dec 27, 2025

gemini-code-assist bot reviewed Dec 27, 2025

View reviewed changes

zhangxinyuehfad reviewed Dec 27, 2025

View reviewed changes

tests/e2e/nightly/multi_node/config/Qwen3-235B-A22B-Mooncake-Layerwise.yaml Show resolved Hide resolved

tests/e2e/nightly/multi_node/config/models/Qwen3-235B-A22B-Mooncake-Layerwise.yaml Outdated Show resolved Hide resolved

github-actions bot added the merge-conflicts label Dec 27, 2025

github-actions bot removed the merge-conflicts label Dec 28, 2025

wjunLu force-pushed the add_nightly branch 2 times, most recently from 9b58838 to bf71a41 Compare December 29, 2025 02:17

github-actions bot added the merge-conflicts label Dec 30, 2025

zhangxinyuehfad force-pushed the add_nightly branch from 680a9ad to 0e4816c Compare December 30, 2025 13:41

github-actions bot removed the merge-conflicts label Dec 30, 2025

zhangxinyuehfad force-pushed the add_nightly branch from 0e4816c to 70082fd Compare December 30, 2025 13:56

wjunLu force-pushed the add_nightly branch 2 times, most recently from 2b8b5d7 to d92e4c8 Compare January 6, 2026 07:44

wangxiyuan approved these changes Jan 7, 2026

View reviewed changes

github-actions bot added the merge-conflicts label Jan 7, 2026

wjunLu force-pushed the add_nightly branch from d92e4c8 to 243f048 Compare January 7, 2026 02:40

github-actions bot removed the merge-conflicts label Jan 7, 2026

liziyu179 reviewed Jan 13, 2026

View reviewed changes

github-actions bot added the merge-conflicts label Jan 30, 2026

wjunLu and others added 2 commits February 27, 2026 15:24

Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector

3673de3

Signed-off-by: wjunLu <wjunlu217@gmail.com>

upgrade pref baseline

2c68245

Signed-off-by: hfadzxy <starmoon_zhang@163.com>

zhangxinyuehfad force-pushed the add_nightly branch from 243f048 to 2c68245 Compare February 27, 2026 07:25

zhangxinyuehfad requested a review from Yikun as a code owner February 27, 2026 07:25

github-actions bot removed the merge-conflicts label Feb 27, 2026

wangxiyuan merged commit b60b991 into vllm-project:main Feb 27, 2026
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector#5441

[CI] Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector#5441
wangxiyuan merged 2 commits intovllm-project:mainfrom
wjunLu:add_nightly

wjunLu commented Dec 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Dec 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 27, 2025

Uh oh!

github-actions bot commented Dec 30, 2025

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

liziyu179 Jan 13, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

zhangxinyuehfad commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wjunLu commented Dec 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Dec 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 27, 2025

Uh oh!

github-actions bot commented Dec 30, 2025

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

liziyu179 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

zhangxinyuehfad commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wjunLu commented Dec 27, 2025 •

edited by github-actions bot

Loading