[Core]Init vllm-ascend#2
Closed
wangxiyuan wants to merge 59 commits intovllm-project:mainfrom
wangxiyuan:main
Closed
[Core]Init vllm-ascend#2wangxiyuan wants to merge 59 commits intovllm-project:mainfrom wangxiyuan:main
wangxiyuan wants to merge 59 commits intovllm-project:mainfrom
wangxiyuan:main
Conversation
Signed-off-by: MengqingCao <cmq0113@163.com>
* add requirements.txt for npu * update setup.py Signed-off-by: MengqingCao <cmq0113@163.com>
Add npu worker and model_runner
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
* add block_size arg to inference script * add get_current_memory_usage to platform Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
[bugfix]: attention alibi bias shape is wrong
[Docs] Update docs and code format
[Format] Update docs and add format.sh
Signed-off-by: MengqingCao <cmq0113@163.com>
This was referenced Dec 11, 2025
wuhang2014
pushed a commit
to wuhang2014/vllm-ascend
that referenced
this pull request
Dec 19, 2025
dsxsteven
pushed a commit
to dsxsteven/vllm-ascend_dsx
that referenced
this pull request
Jan 9, 2026
wangxiyuan
pushed a commit
that referenced
this pull request
Jan 19, 2026
### What this PR does / why we need it? **Scope of Changes**: | File Path | | :--- | | `vllm_ascend/attention/attention_mask.py` | | `vllm_ascend/attention/attention_v1.py` | | `vllm_ascend/attention/context_parallel/attention_cp.py` | | `vllm_ascend/attention/context_parallel/common_cp.py` | | `vllm_ascend/attention/context_parallel/mla_cp.py` | | `vllm_ascend/attention/utils.py` | | `vllm_ascend/batch_invariant.py` | | `vllm_ascend/device/device_op.py` | | `vllm_ascend/device_allocator/camem.py` | | `vllm_ascend/envs.py` | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 --------- Signed-off-by: MrZ20 <2609716663@qq.com>
845473182
pushed a commit
to 845473182/vllm-ascend
that referenced
this pull request
Jan 19, 2026
…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (110 commits) [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936) [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960) [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755) [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834) [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897) [CI]fix for lint CI (vllm-project#5982) [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034) [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928) [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933) [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908) [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855) [doc]Table split (vllm-project#5929) [Doc] Upgrade outdated ut doc (vllm-project#5937) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977) Eagle3 mm support, enablement on qwen3vl (vllm-project#4848) [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959) [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968) [Feature] Support fine-grained shared expert overlap (vllm-project#5482) [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963) [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776) ...
845473182
pushed a commit
to 845473182/vllm-ascend
that referenced
this pull request
Jan 21, 2026
…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (637 commits) [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936) [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960) [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755) [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834) [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897) [CI]fix for lint CI (vllm-project#5982) [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034) [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928) [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933) [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908) [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855) [doc]Table split (vllm-project#5929) [Doc] Upgrade outdated ut doc (vllm-project#5937) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977) Eagle3 mm support, enablement on qwen3vl (vllm-project#4848) [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959) [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968) [Feature] Support fine-grained shared expert overlap (vllm-project#5482) [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963) [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776) ...
starmountain1997
pushed a commit
to starmountain1997/vllm-ascend
that referenced
this pull request
Jan 31, 2026
…) (vllm-project#5977) ### What this PR does / why we need it? **Scope of Changes**: | File Path | | :--- | | `vllm_ascend/attention/attention_mask.py` | | `vllm_ascend/attention/attention_v1.py` | | `vllm_ascend/attention/context_parallel/attention_cp.py` | | `vllm_ascend/attention/context_parallel/common_cp.py` | | `vllm_ascend/attention/context_parallel/mla_cp.py` | | `vllm_ascend/attention/utils.py` | | `vllm_ascend/batch_invariant.py` | | `vllm_ascend/device/device_op.py` | | `vllm_ascend/device_allocator/camem.py` | | `vllm_ascend/envs.py` | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 --------- Signed-off-by: MrZ20 <2609716663@qq.com>
starmountain1997
pushed a commit
to starmountain1997/vllm-ascend
that referenced
this pull request
Jan 31, 2026
…) (vllm-project#5977) ### What this PR does / why we need it? **Scope of Changes**: | File Path | | :--- | | `vllm_ascend/attention/attention_mask.py` | | `vllm_ascend/attention/attention_v1.py` | | `vllm_ascend/attention/context_parallel/attention_cp.py` | | `vllm_ascend/attention/context_parallel/common_cp.py` | | `vllm_ascend/attention/context_parallel/mla_cp.py` | | `vllm_ascend/attention/utils.py` | | `vllm_ascend/batch_invariant.py` | | `vllm_ascend/device/device_op.py` | | `vllm_ascend/device_allocator/camem.py` | | `vllm_ascend/envs.py` | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 --------- Signed-off-by: MrZ20 <2609716663@qq.com>
linsheng1
added a commit
to leideng/vllm-ascend
that referenced
this pull request
Feb 11, 2026
we can get kvcomp_metadata from common_attn_metadata now
ZhuJiyang1
added a commit
to ZhuJiyang1/vllm-ascend
that referenced
this pull request
Feb 28, 2026
# This is the 1st commit message: Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> Update eagle_proposer.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com> [CI] Add long and short prompt tests for DeepSeek-V3.2 (vllm-project#6536) This version has no divisibility constraint between tp and mtp+1. However, cudagraph_capture_sizes must be a common multiple of tp and mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed cudagraph_capture_sizes. We added a long-sequence test (64k input, 3k output) for the two-node mixed deployment scenario. Due to the excessive time required for performance benchmarking, we are only verifying functionality. The single-node scenario is skipped because VRAM limitations prevent launching the model with a max-model-len of 68,000. and we also add aime2025 test for dual-node deepseek 3.2 nightly test. test at nightly environment. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com> [Feature][Quant] Auto-detect quantization format from model files (vllm-project#6645) - Add automatic quantization format detection, eliminating the need to manually specify `--quantization` when serving quantized models. - The detection inspects only lightweight JSON files (`quant_model_description.json` and `config.json`) at engine initialization time, with no `.safetensors` reads. - User-explicit `--quantization` flags are always respected; auto-detection only applies when the flag is omitted. **Detection priority:** 1. `quant_model_description.json` exists → `quantization="ascend"` (ModelSlim) 2. `config.json` contains `"quant_method": "compressed-tensors"` → `quantization="compressed-tensors"` (LLM-Compressor) 3. Neither → default float behavior **Technical approach:** Hooked into `NPUPlatform.check_and_update_config()` to run detection after `VllmConfig.__post_init__`. Since `quant_config` is already `None` at that point, we explicitly recreate it via `VllmConfig._get_quantization_config()` to trigger the full quantization initialization pipeline. | File | Description | |------|-------------| | `vllm_ascend/quantization/utils.py` | Added `detect_quantization_method()` and `maybe_auto_detect_quantization()` | | `vllm_ascend/platform.py` | Integrated auto-detection in `check_and_update_config()` | | `vllm_ascend/quantization/modelslim_config.py` | Improved error handling for weight loading | - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> [Misc] Drop patch_rope.py (vllm-project#6291) Part of vllm-project#5304. We have align with vLLM's latest change for `RotaryEmbeddingBase`. Don't need this patch anymore. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: gcanlin <canlinguosdu@gmail.com> [BugFix] [310p] Fix attention accuracy issue (vllm-project#6803) This pull request resolves an attention accuracy issue by enhancing the AttentionMaskBuilder310 to correctly handle the maximum model length. The change ensures that the attention mask generation process is properly parameterized by the model's configuration, rather than relying on a fixed internal value. This leads to more accurate attention mask creation, which is crucial for the correct functioning of the attention mechanism. Update fused_moe to main branch. No Qwen3 dense mode & moe model e2e test - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> [Doc][Misc] Refactor skill documentation and add Claude support instructions (vllm-project#6817) This PR refactors the documentation for vLLM Ascend skills. - It renames and moves the `vllm-ascend-model-adapter` skill's README to serve as a new top-level README for the `.agents` directory. - It adds instructions on how to use the Ascend skills with Claude, including a new README in the `.claude` directory. - It updates `.gitignore` to exclude skills copied for Claude's use. - Add main2main skill This improves the documentation structure, making it more organized and providing clear instructions for developers using these skills with different tools. No, this PR contains only documentation and repository configuration changes. It does not affect any user-facing code functionality. These changes are documentation-only and do not require specific testing. The correctness of the instructions is being verified through this review. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [Patch][Misc] Cleanup and update patches (vllm-project#6802) This PR performs a cleanup and update of the patch mechanism in `vllm-ascend`. - Removes several obsolete patches: `patch_deepseek.py`. - Updates the central patch documentation in `vllm_ascend/patch/__init__.py` to reflect these removals and additions, re-numbering and re-organizing the patch list for better clarity. No. These are internal changes to the patching mechanism and should not affect users. CI passed with new added/existing test. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> [BugFix] Support ALL D-Nodes in fullgraph when running MTP in PD (vllm-project#5472) **BUG** When using prefill-decode disaggregation + MTP + full graph +asynchronous scheduling, the KV cache pulled by decode nodes from prefill decodes does not include spec tokens. As a result, the total_num_scheduled_tokens obtained by decode nodes from the scheduler lacks spec tokens. When determining whether to enqueue the full graph on decode nodes, the condition for uniform_decode ` scheduler_output.total_num_scheduled_tokens == self.input_batch.num_reqs * max_query_len` is not met, leading to the current instance not being enqueued into the full graph. The above situation leads to both full graph and eagle mode instances coexisting in the decode instances. Due to the synchronization wait of MoeDispatch, the decode instances in full graph are significantly slowed down by the instance in eagle mode. **Solution** The scenario is PD separation + MTP + Full Graph + asynchronous scheduling. On the decode nodes, the spec tokens of the request with KV cache from P need be padded. Then, the padded spec tokens will be rejected by sampling. This operation ensures that the uniform_decode condition is satisfied when determining whether decode nodes are included in the full graph, thereby guaranteeing that all decode instances are present in the full graph and avoiding synchronous waiting for MoeDispatch. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@5326c89 Signed-off-by: chenmenglong <chenmenglong1@huawei.com> [Doc][Release] Add release note skill (vllm-project#6824) This PR adds the releaseing note skills: - `SKILL.md`: vLLM Ascend Releasing Note Writer - `references/ref-past-release-notes-highlight.md`: And also add a `output/v0.13.0` examples which was used by vllm-project@2da476d Inspired: https://github.com/simon-mo/release-notes-writing/ No - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 Co-authored-by: esmeetu <jasonailu87@gmail.com> --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> [Feat]support sequence parallelism by pass for VL models (vllm-project#5632) # This is the commit message vllm-project#2: [CI] Fix doc test fail when load model with error information: 'Stale file handle' (vllm-project#6832) ### What this PR does / why we need it? This PR fixes a `Stale file handle` error that occurs during doctests in the CI environment. The error appears when loading models from ModelScope, likely due to issues with network file systems used in CI. The fix involves setting the `MODELSCOPE_HUB_FILE_LOCK` environment variable to `false` in the `run_doctests.sh` script. This disables file locking in the ModelScope hub, which is a common workaround for this type of file system error. ### Does this PR introduce _any_ user-facing change? No, this change only affects the CI test execution environment and has no impact on users. ### How was this patch tested? This change is validated by the CI pipeline. A successful run of the doctests indicates that the fix is effective. Signed-off-by: leo-pony <nengjunma@outlook.com> # This is the commit message vllm-project#3: Update rotary_embedding.py Signed-off-by: ZhuJiyang1 <3048369099@qq.com>
ZRJ026
pushed a commit
to ZRJ026/vllm-ascend
that referenced
this pull request
Feb 28, 2026
…) (vllm-project#5977) ### What this PR does / why we need it? **Scope of Changes**: | File Path | | :--- | | `vllm_ascend/attention/attention_mask.py` | | `vllm_ascend/attention/attention_v1.py` | | `vllm_ascend/attention/context_parallel/attention_cp.py` | | `vllm_ascend/attention/context_parallel/common_cp.py` | | `vllm_ascend/attention/context_parallel/mla_cp.py` | | `vllm_ascend/attention/utils.py` | | `vllm_ascend/batch_invariant.py` | | `vllm_ascend/device/device_op.py` | | `vllm_ascend/device_allocator/camem.py` | | `vllm_ascend/envs.py` | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241
pushed a commit
to maoxx241/vllm-ascend
that referenced
this pull request
Mar 2, 2026
…) (vllm-project#5977) ### What this PR does / why we need it? **Scope of Changes**: | File Path | | :--- | | `vllm_ascend/attention/attention_mask.py` | | `vllm_ascend/attention/attention_v1.py` | | `vllm_ascend/attention/context_parallel/attention_cp.py` | | `vllm_ascend/attention/context_parallel/common_cp.py` | | `vllm_ascend/attention/context_parallel/mla_cp.py` | | `vllm_ascend/attention/utils.py` | | `vllm_ascend/batch_invariant.py` | | `vllm_ascend/device/device_op.py` | | `vllm_ascend/device_allocator/camem.py` | | `vllm_ascend/envs.py` | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 --------- Signed-off-by: MrZ20 <2609716663@qq.com>
ZRJ026
pushed a commit
to ZRJ026/vllm-ascend
that referenced
this pull request
Mar 4, 2026
…) (vllm-project#5977) ### What this PR does / why we need it? **Scope of Changes**: | File Path | | :--- | | `vllm_ascend/attention/attention_mask.py` | | `vllm_ascend/attention/attention_v1.py` | | `vllm_ascend/attention/context_parallel/attention_cp.py` | | `vllm_ascend/attention/context_parallel/common_cp.py` | | `vllm_ascend/attention/context_parallel/mla_cp.py` | | `vllm_ascend/attention/utils.py` | | `vllm_ascend/batch_invariant.py` | | `vllm_ascend/device/device_op.py` | | `vllm_ascend/device_allocator/camem.py` | | `vllm_ascend/envs.py` | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
liuchenbing2026
pushed a commit
to liuchenbing2026/vllm-ascend
that referenced
this pull request
Mar 5, 2026
Revert "GLM-5 gemm weight nd to nz"
LCAIZJ
pushed a commit
to LCAIZJ/vllm-ascend
that referenced
this pull request
Mar 7, 2026
…) (vllm-project#5977) ### What this PR does / why we need it? **Scope of Changes**: | File Path | | :--- | | `vllm_ascend/attention/attention_mask.py` | | `vllm_ascend/attention/attention_v1.py` | | `vllm_ascend/attention/context_parallel/attention_cp.py` | | `vllm_ascend/attention/context_parallel/common_cp.py` | | `vllm_ascend/attention/context_parallel/mla_cp.py` | | `vllm_ascend/attention/utils.py` | | `vllm_ascend/batch_invariant.py` | | `vllm_ascend/device/device_op.py` | | `vllm_ascend/device_allocator/camem.py` | | `vllm_ascend/envs.py` | - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 --------- Signed-off-by: MrZ20 <2609716663@qq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.