Skip to content

[Main2Main] Upgrade vllm commit to 0102#5546

Closed
wjunLu wants to merge 16 commits intovllm-project:mainfrom
wjunLu:main_upgrade
Closed

[Main2Main] Upgrade vllm commit to 0102#5546
wjunLu wants to merge 16 commits intovllm-project:mainfrom
wjunLu:main_upgrade

Conversation

@wjunLu
Copy link
Collaborator

@wjunLu wjunLu commented Dec 31, 2025

What this PR does / why we need it?

Upgrade vllm commit to 0102

  1. Remove maybe_padded_num_tokens arg in model_runner_v1.py due to [Core] Remove unused num_tokens parameter from _init_model_kwargs vllm#31517
  2. Remove Qwen/Qwen3-0.6B in tests/e2e/multicard/test_aclgraph_capture_replay.py because that Offline data parallel mode will be not supported/useful for dense models

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the compatible vLLM commit hash for the main branch in the versioning policy documentation. The change is straightforward, but to improve long-term maintainability, I've suggested abstracting the hardcoded commit hash into a substitution variable. This aligns with the project's own documented policy for managing version-specific information in the documentation.

| vLLM Ascend | vLLM | Python | Stable CANN | PyTorch/torch_npu |
|-------------|--------------|------------------|-------------|--------------------|
| main | 7157596103666ee7ccb7008acee8bff8a8ff1731, v0.13.0 tag | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 |
| main | ecd49ce7e69a50892be7f9841941ca2d7e3b12ea, v0.13.0 tag | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For better maintainability, this commit hash should be defined as a substitution variable in docs/source/conf.py. This aligns with the project's documentation policy stated in this same file on line 138: 'To reduce maintenance costs, all branch documentation content should remain consistent, and version differences can be controlled via variables in docs/source/conf.py'.

You would need to add a new key to the myst_substitutions dictionary in docs/source/conf.py, for example:

myst_substitutions = {
    # ... existing substitutions
    'vllm_main_commit': 'ecd49ce7e69a50892be7f9841941ca2d7e3b12ea',
}

Then you can use the substitution here as suggested.

Suggested change
| main | ecd49ce7e69a50892be7f9841941ca2d7e3b12ea, v0.13.0 tag | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 |
| main | {{ vllm_main_commit }}, v0.13.0 tag | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 |

@jianzs
Copy link
Collaborator

jianzs commented Dec 31, 2025

Can we use GitHub Actions to automatically sync the latest commits in vllm?

@vllm-ascend-ci vllm-ascend-ci added ready read for review ready-for-test start test by label for PR labels Dec 31, 2025
@wjunLu
Copy link
Collaborator Author

wjunLu commented Dec 31, 2025

Can we use GitHub Actions to automatically sync the latest commits in vllm?

We have this workflow https://github.com/vllm-project/vllm-ascend/actions/workflows/schedule_test_vllm_main.yaml to automatically sync and verfiy, but there are always some error so that we have to process them manually

@github-actions github-actions bot added documentation Improvements or additions to documentation ci/build labels Dec 31, 2025
@github-actions
Copy link
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@wjunLu wjunLu changed the title [Main2Main] Upgrade vllm commit to 1231 [Main2Main] Upgrade vllm commit to 0102 Jan 4, 2026
wjunLu and others added 16 commits January 4, 2026 11:43
Signed-off-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: wjunLu <wjunlu217@gmail.com>
…efill scenario (vllm-project#3072)

By converting the KV cache from ND to NZ format when the decode node
receives it, this PR ensures that the KV NZ feature works correctly
during the decoding phase in disagg-prefill scenario.

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@83f478b

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Co-authored-by: ghphotoframe <854746559@qq.com>
Co-authored-by: alex101-ops <alex1015718386@gmail.com>
Signed-off-by: wjunLu <wjunlu217@gmail.com>
)

Currently in the Fused MoE module, functions of classes like
MoECommMethod and MoETokenDispatcher output data in dictionary or tuple
format, which hampers code maintainability, readability, and
extensibility. This PR introduces dataclasses for these key output types
to address these issues.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@5326c89

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it?
Improve the performance of Layerwise Connector, mainly includes the
following points:
1. Use event synchronize to replace stream synchronize.
2. Access metaserver when scheduling.
3. Transfer kvcache each Chunk prefill segmentation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.
- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@5fbfa8d

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: wjunLu <wjunlu217@gmail.com>
… reuse of the workspace in certain scenarios (vllm-project#5522)

### What this PR does / why we need it?

In the current process of implementing attention updates, the FIA
operator shares a single workspace among different layers within the
same computation graph. To enable memory reuse, we adopt the
weak_ref_tensor mechanism. However, this approach may lead to precision
anomalies in certain scenarios. To address this issue, different layers
in the same computation graph are assigned independent workspaces.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@45c1ca1

Signed-off-by: WithHades <244036962@qq.com>
Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it?
Add LongCat-Flash support.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: chuyuelin <923822139@qq.com>
Co-authored-by: chuyuelin <chuyuelin1@huawei.com>
Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it?
This PR builds upon PR vllm-project#5011 and aims to further enhance the
npu_graph_ex_passes module. Based on prior work, we have added graph
optimization support for the add_rms_quant fused operator in scenarios
where a bias term is present—ensuring the fusion pattern is correctly
registered and matched into the computation graph.

For validation, we switched to the Qwen3-235B-A22B-W8A8 model. Benchmark
results show that, compared to the unfused baseline, enabling this
fusion pass significantly improves inference throughput for W8A8
quantized models.
For more details can refer to the
RFC:vllm-project#4715

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
```
llm = LLM(
        model=model,
        tensor_parallel_size=GPUs_per_dp_rank,
        enforce_eager=False,
        enable_expert_parallel=enable_expert_parallel,
        trust_remote_code=trust_remote_code,
        gpu_memory_utilization=0.98,
        max_num_batched_tokens=512,
        # load_format="dummy",
        max_model_len=2048,
        max_num_seqs=16,
        quantization="ascend",
        additional_config={
            "refresh": True,
            "enable_npugraph_ex": True
        },
        compilation_config={
            "cudagraph_capture_sizes": [8, 16],
            "cudagraph_mode": "FULL_DECODE_ONLY",
        },
    )
    if profile_dir:
        llm.start_profile()
    outputs = llm.generate(prompts, sampling_params)
    if profile_dir:
        llm.stop_profile()
    for i, output in enumerate(outputs):
        if i >= 5:
            break
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(
            f"DP rank {global_dp_rank}, Prompt: {prompt!r}, "
            f"Generated text: {generated_text!r}"
        )
```
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@5326c89

Signed-off-by: cjian <2318164299@qq.com>
Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it?
Currently, when the MooncakeConnector interacts via ZeroMQ, it throws
the following exception upon send/receive failure:
**Issue 1:** The currently used `zmq.REQ` socket follows a strict
request-reply pattern, requiring an alternating sequence of send →
receive → send → receive... If either a send() or receive() operation
fails, the ZeroMQ socket becomes unusable.
**Solution:** When a send() or receive() exception occurs, close and
delete the ZeroMQ socket, and recreate it upon next use.

**Issue 2:** In `_handle_request`, if `_send_done_recv_signal` raises an
exception, the exception is thrown immediately and subsequent code is
not executed, causing the decode logic to fail to properly release the
request.
**Solution:** Move the call to `_send_done_recv_signal` to the end of
the function.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@45c1ca1

Signed-off-by: LCAIZJ <leichao139636@163.com>
Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it?
We should also trigger image build when nightly test related files are
changed to ensure the image is valid for nightly tests. Please note that
this only applies to image with the tag `main*`(which means build
triggered by PR).
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@7157596

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wjunLu <wjunlu217@gmail.com>
Bumps
[actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 6.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@5326c89

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: wjunLu <wjunlu217@gmail.com>
Bumps
[actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 7.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@5326c89

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: wjunLu <wjunlu217@gmail.com>
…replay.py

Signed-off-by: wjunLu <wjunlu217@gmail.com>
…-W8A8 (vllm-project#5381)

### What this PR does / why we need it?
add DeepSeek-R1-W8A8 and Qwen3-235B-W8A8 configs in multi-nodes and
longseq (PCP&DCP) scenario

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@bc0a5a0
---------
Signed-off-by: daishixun <dsxsteven@sina.com>
Signed-off-by: wjunLu <wjunlu217@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants