Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ Please use the following recommended versions to get started quickly:

| Version | Release type | Doc |
|------------|--------------|--------------------------------------|
| v0.14.0rc1 | Latest release candidate | See [QuickStart](https://docs.vllm.ai/projects/ascend/en/latest/quick_start.html) and [Installation](https://docs.vllm.ai/projects/ascend/en/latest/installation.html) for more details |
| v0.16.0rc1 | Latest release candidate | See [QuickStart](https://docs.vllm.ai/projects/ascend/en/latest/quick_start.html) and [Installation](https://docs.vllm.ai/projects/ascend/en/latest/installation.html) for more details |
| v0.13.0 | Latest stable version | See [QuickStart](https://docs.vllm.ai/projects/ascend/en/v0.13.0/quick_start.html) and [Installation](https://docs.vllm.ai/projects/ascend/en/v0.13.0/installation.html) for more details |

## Contributing
Expand All @@ -86,7 +86,7 @@ Below are the maintained branches:

| Branch | Status | Note |
|------------|--------------|--------------------------------------|
| main | Maintained | CI commitment for vLLM main branch and vLLM v0.13.0 tag |
| main | Maintained | CI commitment for vLLM main branch and vLLM v0.16.0 tag |
| v0.7.1-dev | Unmaintained | Only doc fixes are allowed |
| v0.7.3-dev | Maintained | CI commitment for vLLM 0.7.3 version, only bug fixes are allowed, and no new release tags anymore. |
| v0.9.1-dev | Maintained | CI commitment for vLLM 0.9.1 version |
Expand Down
4 changes: 2 additions & 2 deletions README.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP

| Version | Release type | Doc |
|------------|--------------|--------------------------------------|
|v0.14.0rc1| 最新RC版本 |请查看[快速开始](https://docs.vllm.ai/projects/ascend/en/latest/quick_start.html)和[安装指南](https://docs.vllm.ai/projects/ascend/en/latest/installation.html)了解更多|
|v0.16.0rc1| 最新RC版本 |请查看[快速开始](https://docs.vllm.ai/projects/ascend/en/latest/quick_start.html)和[安装指南](https://docs.vllm.ai/projects/ascend/en/latest/installation.html)了解更多|
|v0.13.0| 最新正式/稳定版本 |[快速开始](https://docs.vllm.ai/projects/ascend/en/v0.13.0/quick_start.html) and [安装指南](https://docs.vllm.ai/projects/ascend/en/v0.13.0/installation.html)了解更多|

## 贡献
Expand All @@ -80,7 +80,7 @@ vllm-ascend有主干分支和开发分支。

| 分支 | 状态 | 备注 |
|------------|------------|---------------------|
| main | Maintained | 基于vLLM main分支和vLLM最新版本(v0.13.0)CI看护 |
| main | Maintained | 基于vLLM main分支和vLLM最新版本(v0.16.0)CI看护 |
| v0.7.1-dev | Unmaintained | 只允许文档修复 |
| v0.7.3-dev | Maintained | 基于vLLM v0.7.3版本CI看护, 只允许Bug修复,不会再发布新版本 |
| v0.9.1-dev | Maintained | 基于vLLM v0.9.1版本CI看护 |
Expand Down
4 changes: 3 additions & 1 deletion docs/source/community/versioning_policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ The table below is the release compatibility matrix for vLLM Ascend release.

| vLLM Ascend | vLLM | Python | Stable CANN | PyTorch/torch_npu | Triton Ascend |
|-------------|-------------------|-----------------|-------------|---------------------------------|---------------|
| v0.16.0rc1 | v0.16.0 | >= 3.10, < 3.12 | 8.5.0 | 2.9.0 / 2.9.0 | 3.2.0 |
| v0.15.0rc1 | v0.15.0 | >= 3.10, < 3.12 | 8.5.0 | 2.9.0 / 2.9.0 | 3.2.0 |
| v0.14.0rc1 | v0.14.1 | >= 3.10, < 3.12 | 8.5.0 | 2.9.0 / 2.9.0 | 3.2.0 |
| v0.13.0 | v0.13.0 | >= 3.10, < 3.12 | 8.5.0 | 2.9.0 / 2.8.0.post2 | 3.2.0 |
Expand Down Expand Up @@ -65,6 +66,7 @@ For main branch of vLLM Ascend, we usually make it compatible with the latest vL

| Date | Event |
|------------|-------------------------------------------|
| 2026.03.10 | Release candidates, v0.16.0rc1 |
| 2026.02.27 | Release candidates, v0.15.0rc1 |
| 2026.02.06 | v0.13.0 Final release, v0.13.0 |
| 2026.01.26 | Release candidates, v0.14.0rc1 |
Expand Down Expand Up @@ -122,7 +124,7 @@ Usually, each minor version of vLLM (such as 0.7) corresponds to a vLLM Ascend v

| Branch | State | Note |
| ---------- | ------------ | -------------------------------------------------------- |
| main | Maintained | CI commitment for vLLM main branch and vLLM 0.13.0 tag |
| main | Maintained | CI commitment for vLLM main branch and vLLM 0.16.0 tag |
| releases/v0.13.0 | Maintained | CI commitment for vLLM 0.13.0 version |
| v0.11.0-dev| Maintained | CI commitment for vLLM 0.11.0 version |
| v0.9.1-dev | Maintained | CI commitment for vLLM 0.9.1 version |
Expand Down
8 changes: 4 additions & 4 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,15 +65,15 @@
# the branch of vllm, used in vllm clone
# - main branch: 'main'
# - vX.Y.Z branch: 'vX.Y.Z'
"vllm_version": "v0.15.0",
"vllm_version": "v0.16.0",
# the branch of vllm-ascend, used in vllm-ascend clone and image tag
# - main branch: 'main'
# - vX.Y.Z branch: latest vllm-ascend release tag
"vllm_ascend_version": "v0.15.0rc1",
"vllm_ascend_version": "v0.16.0rc1",
# the newest release version of vllm-ascend and matched vLLM, used in pip install.
# This value should be updated when cut down release.
"pip_vllm_ascend_version": "0.15.0rc1",
"pip_vllm_version": "0.15.0",
"pip_vllm_ascend_version": "0.16.0rc1",
"pip_vllm_version": "0.16.0",
# CANN image tag
"cann_image_tag": "8.5.0-910b-ubuntu22.04-py3.11",
# vllm version in ci
Expand Down
1 change: 1 addition & 0 deletions docs/source/faqs.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

## Version Specific FAQs

- [[v0.16.0rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/6969)
- [[v0.15.0rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/6838)
- [[v0.13.0] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/6583)

Expand Down
91 changes: 91 additions & 0 deletions docs/source/user_guide/release_notes.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,96 @@
# Release Notes

## v0.16.0rc1 - 2026.03.09

This is the first release candidate of v0.16.0 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest) to get started.

### Highlights

- Qwen3-Omni quantization adaptation and optimization is now available. [#6828](https://github.com/vllm-project/vllm-ascend/pull/6828)
- GLM5-W8A8 quantization is now supported by parameterizing hardcoded MLA dimensions. [#6902](https://github.com/vllm-project/vllm-ascend/pull/6902)

### Features

- [Experimental] Support FabricMem Mode for ADXL/HIXL interconnect. [#6806](https://github.com/vllm-project/vllm-ascend/pull/6806)
- Qwen3-Next now supports FlashComm1. [#6830](https://github.com/vllm-project/vllm-ascend/pull/6830)
- NPUWorker Profiler now supports profile_prefix for better profiling experience. [#6968](https://github.com/vllm-project/vllm-ascend/pull/6968)
- EPLB profiling now displays expert hotness comparison and time required for eplb adjustment. [#6877](https://github.com/vllm-project/vllm-ascend/pull/6877) [#7001](https://github.com/vllm-project/vllm-ascend/pull/7001)]
- Xlite Qwen3 MoE now supports Data Parallel. [#6715](https://github.com/vllm-project/vllm-ascend/pull/6715)
- Mooncake Layerwise Connector now supports kv_pool. [#7032](https://github.com/vllm-project/vllm-ascend/pull/7032)
- Eagle3 now supports QuaRot quantization without embedding. [#7038](https://github.com/vllm-project/vllm-ascend/pull/7038)

### Hardware and Operator Support

- 310P now supports w8a8sc quantization method. [#7075](https://github.com/vllm-project/vllm-ascend/pull/7075)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd better don't metion 310p progress here

- Added AscendC casual_conv1d_fn operator for Qwen3-Next. [#6661](https://github.com/vllm-project/vllm-ascend/pull/6661)
- Added Ascend Ops recurrent_gated_delta_rule operator. [#6725](https://github.com/vllm-project/vllm-ascend/pull/6725)
- Added GMM custom operator for MoE models. [#7010](https://github.com/vllm-project/vllm-ascend/pull/7010)

### Performance

- Faster convolution computation improves TTFT by 0.95% and throughput by 0.59% for Qwen3-VL models. [#7017](https://github.com/vllm-project/vllm-ascend/pull/7017)
- Optimize split_qkv_rmsnorm_rope operator. [#6827](https://github.com/vllm-project/vllm-ascend/pull/6827)
- Implement global CPU slicing and improve IRQ binding for Ascend NPUs, ensuring non-overlapping CPU partitions and better resource management. [#6945](https://github.com/vllm-project/vllm-ascend/pull/6945)
- Optimize MTP execution by reordering state update operation. [#6844](https://github.com/vllm-project/vllm-ascend/pull/6844)
- Avoid CPU sync in mrope_positions copy by using full tensor copy. [#7014](https://github.com/vllm-project/vllm-ascend/pull/7014)
- Remove H2D synchronization for expert_map in MoE models. [#7000](https://github.com/vllm-project/vllm-ascend/pull/7000)

### Dependencies

- CANN is upgraded to 8.5.1, please remember to upgrade by hand if you're not using the official image. [#6897](https://github.com/vllm-project/vllm-ascend/pull/6897)

### Deprecation & Breaking Changes

- `enable_flash_comm_v1` config option has been renamed back to `enable_sp`. [#6883](https://github.com/vllm-project/vllm-ascend/pull/6883)
- The auto-detect quantization format from model files is reverted, in v0.16.0rc1, we still need to add `---quantization ascend` to serve a model quantinized by modelslim. It will be added back in the next version after the bug with the remote model id is fixed. [#6873](https://github.com/vllm-project/vllm-ascend/pull/6873)

### Documentation

- Added user/developer guide for CPU binding. [#7045](https://github.com/vllm-project/vllm-ascend/pull/7045)
- Added metrics usage documentation and example. [#6962](https://github.com/vllm-project/vllm-ascend/pull/6962)
- Added llms.txt for LLM discovery. [#6886](https://github.com/vllm-project/vllm-ascend/pull/6886)
- Added GLM4.x multi-node deploy tutorial. [#6872](https://github.com/vllm-project/vllm-ascend/pull/6872)
- Added explanation of 310p special param: max-model-len. [#7065](https://github.com/vllm-project/vllm-ascend/pull/7065)

### Others

- Fix openEuler Dockerfile error. [#6871](https://github.com/vllm-project/vllm-ascend/pull/6871)
- Many bug fixes including:
- Fix Eagle speculative decoding with Context Parallel enabled. [#6981](https://github.com/vllm-project/vllm-ascend/pull/6981) [#7079](https://github.com/vllm-project/vllm-ascend/pull/7079)
- Fix LoRA accuracy issue introduced by upstream vLLM changes. [#6958](https://github.com/vllm-project/vllm-ascend/pull/6958)
- Fix streaming content-type in load balance proxy server. [#6985](https://github.com/vllm-project/vllm-ascend/pull/6985)
- Fix metadata execute error: integer modulo by zero. [#6521](https://github.com/vllm-project/vllm-ascend/pull/6521)
- Fix triton rope_siso implementation bug. [#7082](https://github.com/vllm-project/vllm-ascend/pull/7082)
- Fix incorrect layer count for MTP models in update_aclgraph_sizes. [#7064](https://github.com/vllm-project/vllm-ascend/pull/7064)
- Fix compilation errors for CANN versions subsequent to b020. [#7059](https://github.com/vllm-project/vllm-ascend/pull/7059)
- Fix quant config support in GLM4.6V. [#7062](https://github.com/vllm-project/vllm-ascend/pull/7062)
- Fix parameter ordering bug in _merge_multimodal_embeddings. [#7068](https://github.com/vllm-project/vllm-ascend/pull/7068)
- Fix fused mc2 bug in EPLB. [#6794](https://github.com/vllm-project/vllm-ascend/pull/6794)
- Fix kernel block size for computing slot mapping. [#7019](https://github.com/vllm-project/vllm-ascend/pull/7019)
- Fix layerwise stacking MTP error in P/D disaggregation. [#7036](https://github.com/vllm-project/vllm-ascend/pull/7036)
- Fix RoPE dimension for npu_rotary_embedding. [#6880](https://github.com/vllm-project/vllm-ascend/pull/6880)
- Fix Qwen-Omni quantization bugs. [#7042](https://github.com/vllm-project/vllm-ascend/pull/7042) [#7007](https://github.com/vllm-project/vllm-ascend/pull/7007)
- Fix GDN layer accuracy in graph mode. [#6822](https://github.com/vllm-project/vllm-ascend/pull/6822)
- Fix precision bugs for PCP/DCP in PD disaggregate. [#6876](https://github.com/vllm-project/vllm-ascend/pull/6876)
- Fix MTP in PD disaggregation with fullgraph support for all D-Nodes. [#6948](https://github.com/vllm-project/vllm-ascend/pull/6948)
- Fix GQA model error when enabling both DP and DCP. [#7012](https://github.com/vllm-project/vllm-ascend/pull/7012)
- Fix MTP prefill misclassified as decode edge case. [#6835](https://github.com/vllm-project/vllm-ascend/pull/6835)
- Fix Eagle3 acceptance rate for QuaRot quantized models. [#6914](https://github.com/vllm-project/vllm-ascend/pull/6914)
- Fix RoPE shape mismatch for MTP models with FlashComm V1 enabled. [#6939](https://github.com/vllm-project/vllm-ascend/pull/6939)
- Fix Qwen2.5VL accuracy issue. [#6975](https://github.com/vllm-project/vllm-ascend/pull/6975)
- Fix MoE forward error with static kernel enabled. [#6964](https://github.com/vllm-project/vllm-ascend/pull/6964)
- Fix muls_add fusion for GLM5 models. [#6928](https://github.com/vllm-project/vllm-ascend/pull/6928)
- Fix GDN layer detection for multimodal models. [#6941](https://github.com/vllm-project/vllm-ascend/pull/6941)
- Fix 300I unquant model weight nd2nz error. [#6851](https://github.com/vllm-project/vllm-ascend/pull/6851)
- Fix CPU binding logic. [#6889](https://github.com/vllm-project/vllm-ascend/pull/6889)
- Fix Eagle fullgraph shape capture. [#6846](https://github.com/vllm-project/vllm-ascend/pull/6846)

### Known Issue

- Currently, for DeepSeek v3.2, PCP & DCP do not yet work with FlashComm1 feature, which may cause serve errors or other unknown errors.
- In 4-node A3 PD disaggregation deployment with DeepSeek V3.2, the P-Node may hang when benchmarking in high concurrency scenario, e.g., 2K/2K tokens with 512 concurrent requests.
- MTP with large EP configurations may cause graph capture buffer overflow. This is a bug need to fix in vLLM, now there is a workaround to avoid it: explicitly set `--compilation-config '{"max_cudagraph_capture_size": N}'` where `N = max_concurrency × (1 + num_speculative_tokens)`.

## v0.15.0rc1 - 2026.02.27

This is the first release candidate of v0.15.0 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest) to get started.
Expand Down