[Main2Main] Upgrade vllm commit to 0102#5546
[Main2Main] Upgrade vllm commit to 0102#5546wjunLu wants to merge 16 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the compatible vLLM commit hash for the main branch in the versioning policy documentation. The change is straightforward, but to improve long-term maintainability, I've suggested abstracting the hardcoded commit hash into a substitution variable. This aligns with the project's own documented policy for managing version-specific information in the documentation.
| | vLLM Ascend | vLLM | Python | Stable CANN | PyTorch/torch_npu | | ||
| |-------------|--------------|------------------|-------------|--------------------| | ||
| | main | 7157596103666ee7ccb7008acee8bff8a8ff1731, v0.13.0 tag | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 | | ||
| | main | ecd49ce7e69a50892be7f9841941ca2d7e3b12ea, v0.13.0 tag | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 | |
There was a problem hiding this comment.
For better maintainability, this commit hash should be defined as a substitution variable in docs/source/conf.py. This aligns with the project's documentation policy stated in this same file on line 138: 'To reduce maintenance costs, all branch documentation content should remain consistent, and version differences can be controlled via variables in docs/source/conf.py'.
You would need to add a new key to the myst_substitutions dictionary in docs/source/conf.py, for example:
myst_substitutions = {
# ... existing substitutions
'vllm_main_commit': 'ecd49ce7e69a50892be7f9841941ca2d7e3b12ea',
}Then you can use the substitution here as suggested.
| | main | ecd49ce7e69a50892be7f9841941ca2d7e3b12ea, v0.13.0 tag | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 | | |
| | main | {{ vllm_main_commit }}, v0.13.0 tag | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 | |
|
Can we use GitHub Actions to automatically sync the latest commits in vllm? |
We have this workflow https://github.com/vllm-project/vllm-ascend/actions/workflows/schedule_test_vllm_main.yaml to automatically sync and verfiy, but there are always some error so that we have to process them manually |
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: wjunLu <wjunlu217@gmail.com>
…efill scenario (vllm-project#3072) By converting the KV cache from ND to NZ format when the decode node receives it, this PR ensures that the KV NZ feature works correctly during the decoding phase in disagg-prefill scenario. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com> Co-authored-by: ghphotoframe <854746559@qq.com> Co-authored-by: alex101-ops <alex1015718386@gmail.com> Signed-off-by: wjunLu <wjunlu217@gmail.com>
) Currently in the Fused MoE module, functions of classes like MoECommMethod and MoETokenDispatcher output data in dictionary or tuple format, which hampers code maintainability, readability, and extensibility. This PR introduces dataclasses for these key output types to address these issues. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com> Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it? Improve the performance of Layerwise Connector, mainly includes the following points: 1. Use event synchronize to replace stream synchronize. 2. Access metaserver when scheduling. 3. Transfer kvcache each Chunk prefill segmentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@5fbfa8d --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: wjunLu <wjunlu217@gmail.com>
… reuse of the workspace in certain scenarios (vllm-project#5522) ### What this PR does / why we need it? In the current process of implementing attention updates, the FIA operator shares a single workspace among different layers within the same computation graph. To enable memory reuse, we adopt the weak_ref_tensor mechanism. However, this approach may lead to precision anomalies in certain scenarios. To address this issue, different layers in the same computation graph are assigned independent workspaces. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@45c1ca1 Signed-off-by: WithHades <244036962@qq.com> Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it? Add LongCat-Flash support. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: chuyuelin <923822139@qq.com> Co-authored-by: chuyuelin <chuyuelin1@huawei.com> Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it? This PR builds upon PR vllm-project#5011 and aims to further enhance the npu_graph_ex_passes module. Based on prior work, we have added graph optimization support for the add_rms_quant fused operator in scenarios where a bias term is present—ensuring the fusion pattern is correctly registered and matched into the computation graph. For validation, we switched to the Qwen3-235B-A22B-W8A8 model. Benchmark results show that, compared to the unfused baseline, enabling this fusion pass significantly improves inference throughput for W8A8 quantized models. For more details can refer to the RFC:vllm-project#4715 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ``` llm = LLM( model=model, tensor_parallel_size=GPUs_per_dp_rank, enforce_eager=False, enable_expert_parallel=enable_expert_parallel, trust_remote_code=trust_remote_code, gpu_memory_utilization=0.98, max_num_batched_tokens=512, # load_format="dummy", max_model_len=2048, max_num_seqs=16, quantization="ascend", additional_config={ "refresh": True, "enable_npugraph_ex": True }, compilation_config={ "cudagraph_capture_sizes": [8, 16], "cudagraph_mode": "FULL_DECODE_ONLY", }, ) if profile_dir: llm.start_profile() outputs = llm.generate(prompts, sampling_params) if profile_dir: llm.stop_profile() for i, output in enumerate(outputs): if i >= 5: break prompt = output.prompt generated_text = output.outputs[0].text print( f"DP rank {global_dp_rank}, Prompt: {prompt!r}, " f"Generated text: {generated_text!r}" ) ``` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 Signed-off-by: cjian <2318164299@qq.com> Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it? Currently, when the MooncakeConnector interacts via ZeroMQ, it throws the following exception upon send/receive failure: **Issue 1:** The currently used `zmq.REQ` socket follows a strict request-reply pattern, requiring an alternating sequence of send → receive → send → receive... If either a send() or receive() operation fails, the ZeroMQ socket becomes unusable. **Solution:** When a send() or receive() exception occurs, close and delete the ZeroMQ socket, and recreate it upon next use. **Issue 2:** In `_handle_request`, if `_send_done_recv_signal` raises an exception, the exception is thrown immediately and subsequent code is not executed, causing the decode logic to fail to properly release the request. **Solution:** Move the call to `_send_done_recv_signal` to the end of the function. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@45c1ca1 Signed-off-by: LCAIZJ <leichao139636@163.com> Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it? We should also trigger image build when nightly test related files are changed to ensure the image is valid for nightly tests. Please note that this only applies to image with the tag `main*`(which means build triggered by PR). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: wjunLu <wjunlu217@gmail.com>
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 6. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: wjunLu <wjunlu217@gmail.com>
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 7. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: wjunLu <wjunlu217@gmail.com>
…replay.py Signed-off-by: wjunLu <wjunlu217@gmail.com>
…-W8A8 (vllm-project#5381) ### What this PR does / why we need it? add DeepSeek-R1-W8A8 and Qwen3-235B-W8A8 configs in multi-nodes and longseq (PCP&DCP) scenario - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@bc0a5a0 --------- Signed-off-by: daishixun <dsxsteven@sina.com> Signed-off-by: wjunLu <wjunlu217@gmail.com>
What this PR does / why we need it?
Upgrade vllm commit to 0102
maybe_padded_num_tokensarg inmodel_runner_v1.pydue to [Core] Remove unusednum_tokensparameter from_init_model_kwargsvllm#31517Qwen/Qwen3-0.6Bintests/e2e/multicard/test_aclgraph_capture_replay.pybecause that Offline data parallel mode will be not supported/useful for dense modelsDoes this PR introduce any user-facing change?
How was this patch tested?