From 49faf9e5504db1b4e21cacfed049f31aa4f52c33 Mon Sep 17 00:00:00 2001 From: zzzzwwjj <1183291235@qq.com> Date: Fri, 13 Feb 2026 18:07:25 +0800 Subject: [PATCH] [0.13.0] modify release note & supported matrix Signed-off-by: zzzzwwjj <1183291235@qq.com> --- docs/source/user_guide/release_notes.md | 32 ++++----- .../support_matrix/supported_features.md | 42 +++++------ .../support_matrix/supported_models.md | 72 +++++++++---------- 3 files changed, 73 insertions(+), 73 deletions(-) diff --git a/docs/source/user_guide/release_notes.md b/docs/source/user_guide/release_notes.md index f077230f607..80167ff3f88 100644 --- a/docs/source/user_guide/release_notes.md +++ b/docs/source/user_guide/release_notes.md @@ -7,31 +7,31 @@ This is the final release of v0.13.0 for vLLM Ascend. Please follow the [officia ### Highlights **Model Support** -- **DeepSeek-R1 & DeepSeek-V3.2**: Performance optimizations, and async scheduling enhancements. [#3631](https://github.com/vllm-project/vllm-ascend/pull/3631) [#3900](https://github.com/vllm-project/vllm-ascend/pull/3900) [#3908](https://github.com/vllm-project/vllm-ascend/pull/3908) [#4191](https://github.com/vllm-project/vllm-ascend/pull/4191) [#4805](https://github.com/vllm-project/vllm-ascend/pull/4805) -- **Qwen3-Next**: Full support for Qwen3-Next series including 80B-A3B-Instruct with full graph mode, MTP, quantization (W8A8), NZ optimization, and chunked prefill. Fixed multiple accuracy and stability issues. [#3450](https://github.com/vllm-project/vllm-ascend/pull/3450) [#3572](https://github.com/vllm-project/vllm-ascend/pull/3572) [#3428](https://github.com/vllm-project/vllm-ascend/pull/3428) [#3918](https://github.com/vllm-project/vllm-ascend/pull/3918) [#4058](https://github.com/vllm-project/vllm-ascend/pull/4058) [#4245](https://github.com/vllm-project/vllm-ascend/pull/4245) [#4070](https://github.com/vllm-project/vllm-ascend/pull/4070) [#4477](https://github.com/vllm-project/vllm-ascend/pull/4477) [#4770](https://github.com/vllm-project/vllm-ascend/pull/4770) +- **DeepSeek-R1 & DeepSeek-V3.2**: [Experimental]Performance optimizations, and async scheduling enhancements. [#3631](https://github.com/vllm-project/vllm-ascend/pull/3631) [#3900](https://github.com/vllm-project/vllm-ascend/pull/3900) [#3908](https://github.com/vllm-project/vllm-ascend/pull/3908) [#4191](https://github.com/vllm-project/vllm-ascend/pull/4191) [#4805](https://github.com/vllm-project/vllm-ascend/pull/4805) +- **Qwen3-Next**: [Experimental]Full support for Qwen3-Next series including 80B-A3B-Instruct with full graph mode, MTP, quantization (W8A8), NZ optimization, and chunked prefill. Fixed multiple accuracy and stability issues. [#3450](https://github.com/vllm-project/vllm-ascend/pull/3450) [#3572](https://github.com/vllm-project/vllm-ascend/pull/3572) [#3428](https://github.com/vllm-project/vllm-ascend/pull/3428) [#3918](https://github.com/vllm-project/vllm-ascend/pull/3918) [#4058](https://github.com/vllm-project/vllm-ascend/pull/4058) [#4245](https://github.com/vllm-project/vllm-ascend/pull/4245) [#4070](https://github.com/vllm-project/vllm-ascend/pull/4070) [#4477](https://github.com/vllm-project/vllm-ascend/pull/4477) [#4770](https://github.com/vllm-project/vllm-ascend/pull/4770) - **InternVL**: Added support for InternVL models with comprehensive e2e tests and accuracy evaluation. [#3796](https://github.com/vllm-project/vllm-ascend/pull/3796) [#3964](https://github.com/vllm-project/vllm-ascend/pull/3964) -- **LongCat-Flash**: Added support for LongCat-Flash model. [#3833](https://github.com/vllm-project/vllm-ascend/pull/3833) -- **minimax_m2**: Added support for minimax_m2 model. [#5624](https://github.com/vllm-project/vllm-ascend/pull/5624) -- **Whisper & Cross-Attention**: Added support for cross-attention and Whisper models. [#5592](https://github.com/vllm-project/vllm-ascend/pull/5592) -- **Pooling Models**: Added support for pooling models with PCP adaptation and fixed multiple pooling-related bugs. [#3122](https://github.com/vllm-project/vllm-ascend/pull/3122) [#4143](https://github.com/vllm-project/vllm-ascend/pull/4143) [#6056](https://github.com/vllm-project/vllm-ascend/pull/6056) [#6057](https://github.com/vllm-project/vllm-ascend/pull/6057) [#6146](https://github.com/vllm-project/vllm-ascend/pull/6146) -- **PanguUltraMoE**: Added support for PanguUltraMoE model. [#4615](https://github.com/vllm-project/vllm-ascend/pull/4615) +- **LongCat-Flash**: [Experimental]Added support for LongCat-Flash model. [#3833](https://github.com/vllm-project/vllm-ascend/pull/3833) +- **minimax_m2**: [Experimental]Added support for minimax_m2 model. [#5624](https://github.com/vllm-project/vllm-ascend/pull/5624) +- **Whisper & Cross-Attention**: [Experimental]Added support for cross-attention and Whisper models. [#5592](https://github.com/vllm-project/vllm-ascend/pull/5592) +- **Pooling Models**: [Experimental]Added support for pooling models with PCP adaptation and fixed multiple pooling-related bugs. [#3122](https://github.com/vllm-project/vllm-ascend/pull/3122) [#4143](https://github.com/vllm-project/vllm-ascend/pull/4143) [#6056](https://github.com/vllm-project/vllm-ascend/pull/6056) [#6057](https://github.com/vllm-project/vllm-ascend/pull/6057) [#6146](https://github.com/vllm-project/vllm-ascend/pull/6146) +- **PanguUltraMoE**: [Experimental]Added support for PanguUltraMoE model. [#4615](https://github.com/vllm-project/vllm-ascend/pull/4615) **Core Features** - **Context Parallel (PCP/DCP)**: [Experimental] Added comprehensive support for Prefill Context Parallel (PCP) and Decode Context Parallel (DCP) with ACLGraph, MTP, chunked prefill, MLAPO, and Mooncake connector integration. This is an experimental feature - feedback welcome. [#3260](https://github.com/vllm-project/vllm-ascend/pull/3260) [#3731](https://github.com/vllm-project/vllm-ascend/pull/3731) [#3801](https://github.com/vllm-project/vllm-ascend/pull/3801) [#3980](https://github.com/vllm-project/vllm-ascend/pull/3980) [#4066](https://github.com/vllm-project/vllm-ascend/pull/4066) [#4098](https://github.com/vllm-project/vllm-ascend/pull/4098) [#4183](https://github.com/vllm-project/vllm-ascend/pull/4183) [#5672](https://github.com/vllm-project/vllm-ascend/pull/5672) -- **Full Graph Mode (ACLGraph)**: Enhanced full graph mode with GQA support, memory optimizations, unified logic between ACLGraph and Torchair, and improved stability. [#3560](https://github.com/vllm-project/vllm-ascend/pull/3560) [#3970](https://github.com/vllm-project/vllm-ascend/pull/3970) [#3812](https://github.com/vllm-project/vllm-ascend/pull/3812) [#3879](https://github.com/vllm-project/vllm-ascend/pull/3879) [#3888](https://github.com/vllm-project/vllm-ascend/pull/3888) [#3894](https://github.com/vllm-project/vllm-ascend/pull/3894) [#5118](https://github.com/vllm-project/vllm-ascend/pull/5118) +- **Full Graph Mode (ACLGraph)**: [Experimental]Enhanced full graph mode with GQA support, memory optimizations, unified logic between ACLGraph and Torchair, and improved stability. [#3560](https://github.com/vllm-project/vllm-ascend/pull/3560) [#3970](https://github.com/vllm-project/vllm-ascend/pull/3970) [#3812](https://github.com/vllm-project/vllm-ascend/pull/3812) [#3879](https://github.com/vllm-project/vllm-ascend/pull/3879) [#3888](https://github.com/vllm-project/vllm-ascend/pull/3888) [#3894](https://github.com/vllm-project/vllm-ascend/pull/3894) [#5118](https://github.com/vllm-project/vllm-ascend/pull/5118) - **Multi-Token Prediction (MTP)**: Significantly improved MTP support with chunked prefill for DeepSeek, quantization support, full graph mode, PCP/DCP integration, and async scheduling. MTP now works in most cases and is recommended for use. [#2711](https://github.com/vllm-project/vllm-ascend/pull/2711) [#2713](https://github.com/vllm-project/vllm-ascend/pull/2713) [#3620](https://github.com/vllm-project/vllm-ascend/pull/3620) [#3845](https://github.com/vllm-project/vllm-ascend/pull/3845) [#3910](https://github.com/vllm-project/vllm-ascend/pull/3910) [#3915](https://github.com/vllm-project/vllm-ascend/pull/3915) [#4102](https://github.com/vllm-project/vllm-ascend/pull/4102) [#4111](https://github.com/vllm-project/vllm-ascend/pull/4111) [#4770](https://github.com/vllm-project/vllm-ascend/pull/4770) [#5477](https://github.com/vllm-project/vllm-ascend/pull/5477) - **Eagle Speculative Decoding**: Eagle spec decode now works with full graph mode and is more stable. [#5118](https://github.com/vllm-project/vllm-ascend/pull/5118) [#4893](https://github.com/vllm-project/vllm-ascend/pull/4893) [#5804](https://github.com/vllm-project/vllm-ascend/pull/5804) - **PD Disaggregation**: Set ADXL engine as default backend for disaggregated prefill with improved performance and stability. Added support for KV NZ feature for DeepSeek decode node. [#3761](https://github.com/vllm-project/vllm-ascend/pull/3761) [#3950](https://github.com/vllm-project/vllm-ascend/pull/3950) [#5008](https://github.com/vllm-project/vllm-ascend/pull/5008) [#3072](https://github.com/vllm-project/vllm-ascend/pull/3072) - **KV Pool & Mooncake**: Enhanced KV pool with Mooncake connector support for PCP/DCP, multiple input suffixes, and improved performance of Layerwise Connector. [#3690](https://github.com/vllm-project/vllm-ascend/pull/3690) [#3752](https://github.com/vllm-project/vllm-ascend/pull/3752) [#3849](https://github.com/vllm-project/vllm-ascend/pull/3849) [#4183](https://github.com/vllm-project/vllm-ascend/pull/4183) [#5303](https://github.com/vllm-project/vllm-ascend/pull/5303) -- **EPLB (Elastic Prefill Load Balancing)**: EPLB is now more stable with many bug fixes. Mix placement now works. [#6086](https://github.com/vllm-project/vllm-ascend/pull/6086) +- **EPLB (Elastic Prefill Load Balancing)**: [Experimental]EPLB is now more stable with many bug fixes. Mix placement now works. [#6086](https://github.com/vllm-project/vllm-ascend/pull/6086) - **Full Decode Only Mode**: Added support for Qwen3-Next and DeepSeekv32 in full_decode_only mode with bug fixes. [#3949](https://github.com/vllm-project/vllm-ascend/pull/3949) [#3986](https://github.com/vllm-project/vllm-ascend/pull/3986) [#3763](https://github.com/vllm-project/vllm-ascend/pull/3763) -- **Model Runner V2**: Added basic support for Model Runner V2, the next generation of vLLM. It will be used by default in future releases. [#5210](https://github.com/vllm-project/vllm-ascend/pull/5210) +- **Model Runner V2**: [Experimental]Added basic support for Model Runner V2, the next generation of vLLM. It will be used by default in future releases. [#5210](https://github.com/vllm-project/vllm-ascend/pull/5210) ### Features -- **W8A16 Quantization**: Added new W8A16 quantization method support. [#4541](https://github.com/vllm-project/vllm-ascend/pull/4541) -- **UCM Connector**: Added UCMConnector for KV Cache Offloading. [#4411](https://github.com/vllm-project/vllm-ascend/pull/4411) -- **Batch Invariant**: Implemented basic framework for batch invariant feature. [#5517](https://github.com/vllm-project/vllm-ascend/pull/5517) +- **W8A16 Quantization**: [Experimental]Added new W8A16 quantization method support. [#4541](https://github.com/vllm-project/vllm-ascend/pull/4541) +- **UCM Connector**: [Experimental]Added UCMConnector for KV Cache Offloading. [#4411](https://github.com/vllm-project/vllm-ascend/pull/4411) +- **Batch Invariant**: [Experimental]Implemented basic framework for batch invariant feature. [#5517](https://github.com/vllm-project/vllm-ascend/pull/5517) - **Sampling**: Enhanced sampling with async_scheduler and disable_padded_drafter_batch support in Eagle. [#4893](https://github.com/vllm-project/vllm-ascend/pull/4893) ### Hardware and Operator Support @@ -51,13 +51,13 @@ This is the final release of v0.13.0 for vLLM Ascend. Please follow the [officia Many custom ops and triton kernels were added in this release to speed up model performance: -- **DeepSeek Performance**: Improved performance for DeepSeek V3.2 by eliminating HD synchronization in async scheduling and optimizing memory usage for MTP. [#4805](https://github.com/vllm-project/vllm-ascend/pull/4805) [#2713](https://github.com/vllm-project/vllm-ascend/pull/2713) -- **Qwen3-Next Performance**: Improved performance with Triton ops and optimizations. [#5664](https://github.com/vllm-project/vllm-ascend/pull/5664) [#5984](https://github.com/vllm-project/vllm-ascend/pull/5984) [#5765](https://github.com/vllm-project/vllm-ascend/pull/5765) +- **DeepSeek Performance**: [Experimental]Improved performance for DeepSeek V3.2 by eliminating HD synchronization in async scheduling and optimizing memory usage for MTP. [#4805](https://github.com/vllm-project/vllm-ascend/pull/4805) [#2713](https://github.com/vllm-project/vllm-ascend/pull/2713) +- **Qwen3-Next Performance**: [Experimental]Improved performance with Triton ops and optimizations. [#5664](https://github.com/vllm-project/vllm-ascend/pull/5664) [#5984](https://github.com/vllm-project/vllm-ascend/pull/5984) [#5765](https://github.com/vllm-project/vllm-ascend/pull/5765) - **FlashComm**: Enhanced FlashComm v2 optimization with o_shared linear and communication domain fixes. [#3232](https://github.com/vllm-project/vllm-ascend/pull/3232) [#4188](https://github.com/vllm-project/vllm-ascend/pull/4188) [#4458](https://github.com/vllm-project/vllm-ascend/pull/4458) [#5848](https://github.com/vllm-project/vllm-ascend/pull/5848) - **MoE Optimization**: Optimized all2allv for MoE models and enhanced all-reduce skipping logic. [#3738](https://github.com/vllm-project/vllm-ascend/pull/3738) [#5329](https://github.com/vllm-project/vllm-ascend/pull/5329) - **Attention Optimization**: Moved attention update stream out of loop, converted BSND to TND format for long sequence optimization, and removed transpose step after attention switching to transpose_batchmatmul. [#3848](https://github.com/vllm-project/vllm-ascend/pull/3848) [#3778](https://github.com/vllm-project/vllm-ascend/pull/3778) [#5390](https://github.com/vllm-project/vllm-ascend/pull/5390) - **Quantization Performance**: Moved quantization before allgather in Allgather EP. [#3420](https://github.com/vllm-project/vllm-ascend/pull/3420) -- **Layerwise Connector**: Improved performance of Layerwise Connector. [#5303](https://github.com/vllm-project/vllm-ascend/pull/5303) +- **Layerwise Connector**: [Experimental]Improved performance of Layerwise Connector. [#5303](https://github.com/vllm-project/vllm-ascend/pull/5303) - **Prefix Cache**: Improved performance of prefix cache features. [#4022](https://github.com/vllm-project/vllm-ascend/pull/4022) - **Async Scheduling**: Fixed async copy and eliminated hangs in async scheduling. [#4113](https://github.com/vllm-project/vllm-ascend/pull/4113) [#4233](https://github.com/vllm-project/vllm-ascend/pull/4233) - **Memory Operations**: Removed redundant D2H operations and deleted redundant operations in model_runner. [#4063](https://github.com/vllm-project/vllm-ascend/pull/4063) [#3677](https://github.com/vllm-project/vllm-ascend/pull/3677) diff --git a/docs/source/user_guide/support_matrix/supported_features.md b/docs/source/user_guide/support_matrix/supported_features.md index a7245e4d336..ecf44c84a48 100644 --- a/docs/source/user_guide/support_matrix/supported_features.md +++ b/docs/source/user_guide/support_matrix/supported_features.md @@ -8,27 +8,27 @@ You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is th | Feature | Status | Next Step | |-------------------------------|----------------|------------------------------------------------------------------------| -| Chunked Prefill | 🟢 Functional | Functional, see detailed note: [Chunked Prefill][cp] | -| Automatic Prefix Caching | 🟢 Functional | Functional, see detailed note: [vllm-ascend#732][apc] | -| LoRA | 🟢 Functional | Functional, see detailed note: [LoRA][LoRA] | -| Speculative decoding | 🟢 Functional | Basic support | -| Pooling | 🟢 Functional | CI needed to adapt to more models; V1 support relies on vLLM support. | -| Enc-dec | 🟡 Planned | vLLM should support this feature first. | -| Multi Modality | 🟢 Functional | [Multi Modality][multimodal], optimizing and adapting more models | -| LogProbs | 🟢 Functional | CI needed | -| Prompt logProbs | 🟢 Functional | CI needed | -| Async output | 🟢 Functional | CI needed | -| Beam search | 🟢 Functional | CI needed | -| Guided Decoding | 🟢 Functional | [vllm-ascend#177][guided_decoding] | -| Tensor Parallel | 🟢 Functional | Make TP >4 work with graph mode. | -| Pipeline Parallel | 🟢 Functional | Write official guide and tutorial. | -| Expert Parallel | 🟢 Functional | Support dynamic EPLB. | -| Data Parallel | 🟢 Functional | Data Parallel support for Qwen3 MoE. | -| Prefill Decode Disaggregation | 🟢 Functional | Functional, xPyD is supported. | -| Quantization | 🟢 Functional | W8A8 available; working on more quantization method support (W4A8, etc) | -| Graph Mode | 🟢 Functional | Functional, see detailed note: [Graph Mode][graph_mode] | -| Sleep Mode | 🟢 Functional | Functional, see detailed note: [Sleep Mode][sleep_mode] | -| Context Parallel | 🟢 Functional | Functional, see detailed note: [Context Parallel][context_parallel] | +| Chunked Prefill | 🟢 Functional | Functional, see detailed note: [Chunked Prefill][cp] | +| Automatic Prefix Caching | 🟢 Functional | Functional, see detailed note: [vllm-ascend#732][apc] | +| LoRA | 🔵 Experimental | Functional, see detailed note: [LoRA][LoRA] | +| Speculative decoding | 🟢 Functional | Basic support | +| Pooling | 🔵 Experimental | CI needed to adapt to more models; V1 support relies on vLLM support. | +| Enc-dec | 🟡 Planned | vLLM should support this feature first. | +| Multi Modality | 🟢 Functional | [Multi Modality][multimodal], optimizing and adapting more models | +| LogProbs | 🟢 Functional | CI needed | +| Prompt logProbs | 🟢 Functional | CI needed | +| Async output | 🟢 Functional | CI needed | +| Beam search | 🔵 Experimental | CI needed | +| Guided Decoding | 🟢 Functional | [vllm-ascend#177][guided_decoding] | +| Tensor Parallel | 🟢 Functional | Make TP >4 work with graph mode. | +| Pipeline Parallel | 🟢 Functional | Write official guide and tutorial. | +| Expert Parallel | 🟢 Functional | Support dynamic EPLB. | +| Data Parallel | 🟢 Functional | Data Parallel support for Qwen3 MoE. | +| Prefill Decode Disaggregation | 🟢 Functional | Functional, xPyD is supported. | +| Quantization | 🟢 Functional | W8A8 available; working on more quantization method support (W4A8, etc) | +| Graph Mode | 🟢 Functional | Functional, see detailed note: [Graph Mode][graph_mode] | +| Sleep Mode | 🟢 Functional | Functional, see detailed note: [Sleep Mode][sleep_mode] | +| Context Parallel | 🟢 Functional | Functional, see detailed note: [Context Parallel][context_parallel] | - 🟢 Functional: Fully operational, with ongoing optimizations. - 🔵 Experimental: Experimental support, interfaces and functions may change. diff --git a/docs/source/user_guide/support_matrix/supported_models.md b/docs/source/user_guide/support_matrix/supported_models.md index 050f1fa463b..446976c5cca 100644 --- a/docs/source/user_guide/support_matrix/supported_models.md +++ b/docs/source/user_guide/support_matrix/supported_models.md @@ -9,33 +9,33 @@ Get the latest info here: https://github.com/vllm-project/vllm-ascend/issues/160 | Model | Support | Note | BF16 | Supported Hardware | W8A8 | Chunked Prefill | Automatic Prefix Cache | LoRA | Speculative Decoding | Async Scheduling | Tensor Parallel | Pipeline Parallel | Expert Parallel | Data Parallel | Prefill-decode Disaggregation | Piecewise AclGraph | Fullgraph AclGraph | max-model-len | MLP Weight Prefetch | Doc | |-------------------------------|-----------|----------------------------------------------------------------------|------|--------------------|------|-----------------|------------------------|------|----------------------|------------------|-----------------|-------------------|-----------------|---------------|-------------------------------|--------------------|--------------------|---------------|---------------------|-----| | DeepSeek V3/3.1 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ || ✅ || ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 240k || [DeepSeek-V3.1](../../tutorials/DeepSeek-V3.1.md) | -| DeepSeek V3.2 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 160k | ✅ | [DeepSeek-V3.2](../../tutorials/DeepSeek-V3.2.md) | +| DeepSeek V3.2 | 🔵 | Experimental | ✅ | A2/A3 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 160k | ✅ | [DeepSeek-V3.2](../../tutorials/DeepSeek-V3.2.md) | | DeepSeek R1 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ || ✅ || ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 128k || [DeepSeek-R1](../../tutorials/DeepSeek-R1.md) | | DeepSeek Distill (Qwen/Llama) | ✅ | || A2/A3 ||||||||||||||||| | Qwen3 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ ||| ✅ | ✅ ||| ✅ || ✅ | ✅ | 128k | ✅ | [Qwen3-Dense](../../tutorials/Qwen3-Dense.md) | -| Qwen3-based | ✅ | || A2/A3 ||||||||||||||||| +| Qwen3-based | 🔵 | Experimental || A2/A3 ||||||||||||||||| | Qwen3-Coder | ✅ | | ✅ | A2/A3 ||✅|✅|✅|||✅|✅|✅|✅||||||[Qwen3-Coder-30B-A3B tutorial](../../tutorials/Qwen3-Coder-30B-A3B.md)| | Qwen3-Moe | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ ||| ✅ | ✅ || ✅ | ✅ | ✅ | ✅ | ✅ | 256k || [Qwen3-235B-A22B](../../tutorials/Qwen3-235B-A22B.md) | -| Qwen3-Next | ✅ | | ✅ | A2/A3 | ✅ |||||| ✅ ||| ✅ || ✅ | ✅ ||| [Qwen3-Next](../../tutorials/Qwen3-Next.md) | +| Qwen3-Next | 🔵 | Experimental | ✅ | A2/A3 | ✅ |||||| ✅ ||| ✅ || ✅ | ✅ ||| [Qwen3-Next](../../tutorials/Qwen3-Next.md) | | Qwen2.5 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ |||| ✅ ||| ✅ |||||| [Qwen2.5-7B](../../tutorials/Qwen2.5-7B.md) | | Qwen2 | ✅ | || A2/A3 ||||||||||||||||| | Qwen2-based | ✅ | || A2/A3 ||||||||||||||||| | QwQ-32B | ✅ | || A2/A3 ||||||||||||||||| | Llama2/3/3.1/3.2 | ✅ | || A2/A3 ||||||||||||||||| -| Internlm | ✅ | [#1962](https://github.com/vllm-project/vllm-ascend/issues/1962) || A2/A3 ||||||||||||||||| -| Baichuan | ✅ | || A2/A3 ||||||||||||||||| -| Baichuan2 | ✅ | || A2/A3 ||||||||||||||||| -| Phi-4-mini | ✅ | || A2/A3 ||||||||||||||||| -| MiniCPM | ✅ | || A2/A3 ||||||||||||||||| -| MiniCPM3 | ✅ | || A2/A3 ||||||||||||||||| -| Ernie4.5 | ✅ | || A2/A3 ||||||||||||||||| -| Ernie4.5-Moe | ✅ | || A2/A3 ||||||||||||||||| -| Gemma-2 | ✅ | || A2/A3 ||||||||||||||||| -| Gemma-3 | ✅ | || A2/A3 ||||||||||||||||| -| Phi-3/4 | ✅ | || A2/A3 ||||||||||||||||| -| Mistral/Mistral-Instruct | ✅ | || A2/A3 ||||||||||||||||| -| GLM-4.x | ✅ | || A2/A3 |✅|✅|✅||✅|✅|✅|||✅||✅|✅|128k||../../tutorials/GLM4.x.md| -| Kimi-K2-Thinking | ✅ | || A2/A3 |||||||||||||||| [Kimi-K2-Thinking](../../tutorials/Kimi-K2-Thinking.md) | +| Internlm | 🔵 | [#1962](https://github.com/vllm-project/vllm-ascend/issues/1962) || A2/A3 ||||||||||||||||| +| Baichuan | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| Baichuan2 | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| Phi-4-mini | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| MiniCPM | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| MiniCPM3 | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| Ernie4.5 | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| Ernie4.5-Moe | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| Gemma-2 | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| Gemma-3 | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| Phi-3/4 | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| Mistral/Mistral-Instruct | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| GLM-4.x | 🔵 | Experimental || A2/A3 |✅|✅|✅||✅|✅|✅|||✅||✅|✅|128k||../../tutorials/GLM4.x.md| +| Kimi-K2-Thinking | 🔵 | Experimental || A2/A3 |||||||||||||||| [Kimi-K2-Thinking](../../tutorials/Kimi-K2-Thinking.md) | | GLM-4 | ❌ | [#2255](https://github.com/vllm-project/vllm-ascend/issues/2255) ||||||||||||||||||| | GLM-4-0414 | ❌ | [#2258](https://github.com/vllm-project/vllm-ascend/issues/2258) ||||||||||||||||||| | ChatGLM | ❌ | [#554](https://github.com/vllm-project/vllm-ascend/issues/554) ||||||||||||||||||| @@ -47,11 +47,11 @@ Get the latest info here: https://github.com/vllm-project/vllm-ascend/issues/160 | Model | Support | Note | BF16 | Supported Hardware | W8A8 | Chunked Prefill | Automatic Prefix Cache | LoRA | Speculative Decoding | Async Scheduling | Tensor Parallel | Pipeline Parallel | Expert Parallel | Data Parallel | Prefill-decode Disaggregation | Piecewise AclGraph | Fullgraph AclGraph | max-model-len | MLP Weight Prefetch | Doc | |-------------------------------|-----------|----------------------------------------------------------------------|------|--------------------|------|-----------------|------------------------|------|----------------------|------------------|-----------------|-------------------|-----------------|---------------|-------------------------------|--------------------|--------------------|---------------|---------------------|-----| -| Qwen3-Embedding | ✅ | || A2/A3 |||||||||||||||| [Qwen3_embedding](../../tutorials/Qwen3_embedding.md)| -| Qwen3-Reranker | ✅ | || A2/A3 |||||||||||||||| [Qwen3_reranker](../../tutorials/Qwen3_reranker.md)| -| Molmo | ✅ | [1942](https://github.com/vllm-project/vllm-ascend/issues/1942) || A2/A3 ||||||||||||||||| -| XLM-RoBERTa-based | ✅ | || A2/A3 ||||||||||||||||| -| Bert | ✅ | || A2/A3 ||||||||||||||||| +| Qwen3-Embedding | 🔵 | Experimental || A2/A3 |||||||||||||||| [Qwen3_embedding](../../tutorials/Qwen3_embedding.md)| +| Qwen3-Reranker | 🔵 | Experimental || A2/A3 |||||||||||||||| [Qwen3_reranker](../../tutorials/Qwen3_reranker.md)| +| Molmo | 🔵 | [1942](https://github.com/vllm-project/vllm-ascend/issues/1942) || A2/A3 ||||||||||||||||| +| XLM-RoBERTa-based | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| Bert | 🔵 | Experimental || A2/A3 ||||||||||||||||| ## Multimodal Language Models @@ -63,20 +63,20 @@ Get the latest info here: https://github.com/vllm-project/vllm-ascend/issues/160 | Qwen2.5-VL | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ ||| ✅ | ✅ |||| ✅ | ✅ | ✅ | 30k || [Qwen-VL-Dense](../../tutorials/Qwen-VL-Dense.md) | | Qwen3-VL | ✅ | ||A2/A3|||||||✅|||||✅|✅||| [Qwen-VL-Dense](../../tutorials/Qwen-VL-Dense.md) | | Qwen3-VL-MOE | ✅ | | ✅ | A2/A3||✅|✅|||✅|✅|✅|✅|✅|✅|✅|✅|256k||[Qwen3-VL-MOE](../../tutorials/Qwen3-VL-235B-A22B-Instruct.md)| -| Qwen3-Omni-30B-A3B-Thinking | ✅ | ||A2/A3|||||||✅||✅|||||||[Qwen3-Omni-30B-A3B-Thinking](../../tutorials/Qwen3-Omni-30B-A3B-Thinking.md)| -| Qwen2.5-Omni | ✅ | || A2/A3 |||||||||||||||| [Qwen2.5-Omni](../../tutorials/Qwen2.5-Omni.md) | -| Qwen3-Omni | ✅ | || A2/A3 ||||||||||||||||| -| QVQ | ✅ | || A2/A3 ||||||||||||||||| -| Qwen2-Audio | ✅ | || A2/A3 ||||||||||||||||| -| Aria | ✅ | || A2/A3 ||||||||||||||||| -| LLaVA-Next | ✅ | || A2/A3 ||||||||||||||||| -| LLaVA-Next-Video | ✅ | || A2/A3 ||||||||||||||||| -| MiniCPM-V | ✅ | || A2/A3 ||||||||||||||||| -| Mistral3 | ✅ | || A2/A3 ||||||||||||||||| -| Phi-3-Vision/Phi-3.5-Vision | ✅ | || A2/A3 ||||||||||||||||| -| Gemma3 | ✅ | || A2/A3 ||||||||||||||||| -| Llama3.2 | ✅ | || A2/A3 ||||||||||||||||| -| PaddleOCR-VL | ✅ | || A2/A3 ||||||||||||||||| +| Qwen3-Omni-30B-A3B-Thinking | 🔵 | Experimental ||A2/A3|||||||✅||✅|||||||[Qwen3-Omni-30B-A3B-Thinking](../../tutorials/Qwen3-Omni-30B-A3B-Thinking.md)| +| Qwen2.5-Omni | 🔵 | Experimental || A2/A3 |||||||||||||||| [Qwen2.5-Omni](../../tutorials/Qwen2.5-Omni.md) | +| Qwen3-Omni | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| QVQ | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| Qwen2-Audio | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| Aria | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| LLaVA-Next | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| LLaVA-Next-Video | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| MiniCPM-V | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| Mistral3 | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| Phi-3-Vision/Phi-3.5-Vision | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| Gemma3 | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| Llama3.2 | 🔵 | Experimental || A2/A3 ||||||||||||||||| +| PaddleOCR-VL | 🔵 | Experimental || A2/A3 ||||||||||||||||| | Llama4 | ❌ | [1972](https://github.com/vllm-project/vllm-ascend/issues/1972) ||||||||||||||||||| | Keye-VL-8B-Preview | ❌ | [1963](https://github.com/vllm-project/vllm-ascend/issues/1963) ||||||||||||||||||| | Florence-2 | ❌ | [2259](https://github.com/vllm-project/vllm-ascend/issues/2259) |||||||||||||||||||