[Performance]use triton mrope for Qwen3-VL by ichaoren · Pull Request #5827 · vllm-project/vllm-ascend

ichaoren · 2026-01-13T01:43:06Z

What this PR does / why we need it?

The performance can be optimized by use triton mrope for Qwen3-VL.

Qwen3-VL-235B-A22B-Instruct-W8A8

	accuracy	TTFT	TPOT	TPS
base	83.76	4.8771s	0.1472	49
test	83.59	4.3273	0.0615	105

Qwen3-VL-8B-Instruct

	accuracy	TTFT	TPOT	TPS
base	80.65	4.1744	0.0499	125
test	80.86	3.1858	0.0245	227

Does this PR introduce any user-facing change?

Name: triton-ascend
Version: 3.2.0.dev20260105

How was this patch tested?

vllm serve $MODEL_DIR
--served-model-name $MODEL_NAME
--quantization ascend
--tensor-parallel-size 8
--data-parallel-size 1
--enable-expert-parallel
--max-model-len 20000
--allowed-local-media-path /vllm-workspace/vllm/tests/multimodal/assets/
--enforce-eager
--max-num-batched-tokens 5000
--gpu-memory-utilization 0.8
--trust-remote-code
--port 8800
--no-enable-prefix-caching
--mm_processor_cache_type="shm"
--async-scheduling

performance test
evalscope perf --parallel 8 --model qwen3vl --url http://ip:port/v1/chat/completions --api openai --dataset random_vl --min-tokens 300 --max-tokens 300 --prefix-length 0 --min-prompt-length 1024 --max-prompt-length 1024 --image-width 2048 --image-height 2048 --image-format RGB --image-num 1 --number 32 --tokenizer-path /path/Qwen3-VL-235B-A22B-Instruct-W8A8/
Accuracy Test
ais_bench --models vllm_api_stream_chat --datasets textvqa_gen --debug

gemini-code-assist

Code Review

This pull request introduces a performance optimization for Qwen3-VL models by enabling a Triton-based multi-head rotary position embedding (mrope) implementation. The change correctly identifies the specific model configuration and directs it to a more efficient code path. My review includes one suggestion to improve code maintainability by replacing a hardcoded value with a named constant.

Signed-off-by: panther-zhu <nonamefly@petalmail.com>

### What this PR does / why we need it? fix pr: [5827](#5827) keep same with main branch. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: ichaoren <fengjian5500@163.com>

### What this PR does / why we need it?  The performance can be optimized by use triton mrope for Qwen3-VL. 1. Qwen3-VL-235B-A22B-Instruct-W8A8 | accuracy | TTFT | TPOT | TPS -- | -- | -- | -- | -- base | 83.76 | 4.8771s | 0.1472 | 49 test | 83.59 | 4.3273 | 0.0615 | 105 2. Qwen3-VL-8B-Instruct | accuracy | TTFT | TPOT | TPS -- | -- | -- | -- | -- base | 80.65 | 4.1744 | 0.0499 | 125 test | 80.86 | 3.1858 | 0.0245 | 227 ### Does this PR introduce _any_ user-facing change?  Name: triton-ascend Version: 3.2.0.dev20260105 ### How was this patch tested?  vllm serve $MODEL_DIR \ --served-model-name $MODEL_NAME \ --quantization ascend \ --tensor-parallel-size 8 \ --data-parallel-size 1 \ --enable-expert-parallel \ --max-model-len 20000 \ --allowed-local-media-path /vllm-workspace/vllm/tests/multimodal/assets/ \ --enforce-eager \ --max-num-batched-tokens 5000 \ --gpu-memory-utilization 0.8 \ --trust-remote-code \ --port 8800 \ --no-enable-prefix-caching \ --mm_processor_cache_type="shm" \ --async-scheduling - performance test evalscope perf --parallel 8 --model qwen3vl --url http://ip:port/v1/chat/completions --api openai --dataset random_vl --min-tokens 300 --max-tokens 300 --prefix-length 0 --min-prompt-length 1024 --max-prompt-length 1024 --image-width 2048 --image-height 2048 --image-format RGB --image-num 1 --number 32 --tokenizer-path /path/Qwen3-VL-235B-A22B-Instruct-W8A8/ - Accuracy Test ais_bench --models vllm_api_stream_chat --datasets textvqa_gen --debug --------- Signed-off-by: panther-zhu <nonamefly@petalmail.com> Co-authored-by: panther-zhu <nonamefly@petalmail.com>

) ### What this PR does / why we need it? fix pr: [5827](vllm-project#5827) keep same with main branch. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: ichaoren <fengjian5500@163.com>

gemini-code-assist bot reviewed Jan 13, 2026

View reviewed changes

Comment thread vllm_ascend/ops/rotary_embedding.py Outdated

ichaoren changed the title ~~Releases/v0.13.0~~ use triton mrope for Qwen3-VL Jan 13, 2026

gcanlin reviewed Jan 13, 2026

View reviewed changes

Comment thread vllm_ascend/ops/rotary_embedding.py Outdated

ichaoren requested a review from gcanlin January 13, 2026 10:59

ichaoren force-pushed the releases/v0.13.0 branch from d578193 to f0e25a2 Compare January 14, 2026 03:30

use triton mrope for Qwen3-VL

f46d72d

Signed-off-by: panther-zhu <nonamefly@petalmail.com>

ichaoren force-pushed the releases/v0.13.0 branch from f0e25a2 to f46d72d Compare January 14, 2026 03:32

-muse triton mrope for Qwen3-VL

913a449

Signed-off-by: panther-zhu <nonamefly@petalmail.com>

wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 15, 2026

wangxiyuan approved these changes Jan 15, 2026

View reviewed changes

wangxiyuan changed the title ~~use triton mrope for Qwen3-VL~~ [Performance]use triton mrope for Qwen3-VL Jan 15, 2026

zzzzwwjj approved these changes Jan 17, 2026

View reviewed changes

zzzzwwjj merged commit d17370b into vllm-project:releases/v0.13.0 Jan 17, 2026
20 checks passed

ichaoren mentioned this pull request Jan 19, 2026

[0.13.0][cherry-pick][bugfix] fix bug of triton mrope #6009

Merged

ichaoren mentioned this pull request Jan 21, 2026

[0.13.0][cherry-pick][bugfix] fix bug of triton mrope #6090

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]use triton mrope for Qwen3-VL#5827

[Performance]use triton mrope for Qwen3-VL#5827
zzzzwwjj merged 2 commits intovllm-project:releases/v0.13.0from
ichaoren:releases/v0.13.0

ichaoren commented Jan 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ichaoren commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ichaoren commented Jan 13, 2026 •

edited

Loading