Skip to content

[Performance]use triton mrope for Qwen3-VL#5827

Merged
zzzzwwjj merged 2 commits intovllm-project:releases/v0.13.0from
ichaoren:releases/v0.13.0
Jan 17, 2026
Merged

[Performance]use triton mrope for Qwen3-VL#5827
zzzzwwjj merged 2 commits intovllm-project:releases/v0.13.0from
ichaoren:releases/v0.13.0

Conversation

@ichaoren
Copy link
Copy Markdown
Contributor

@ichaoren ichaoren commented Jan 13, 2026

What this PR does / why we need it?

The performance can be optimized by use triton mrope for Qwen3-VL.

  1. Qwen3-VL-235B-A22B-Instruct-W8A8
  accuracy TTFT TPOT TPS
base 83.76 4.8771s 0.1472 49
test 83.59 4.3273 0.0615 105
  1. Qwen3-VL-8B-Instruct
  accuracy TTFT TPOT TPS
base 80.65 4.1744 0.0499 125
test 80.86 3.1858 0.0245 227

Does this PR introduce any user-facing change?

Name: triton-ascend
Version: 3.2.0.dev20260105

How was this patch tested?

vllm serve $MODEL_DIR
--served-model-name $MODEL_NAME
--quantization ascend
--tensor-parallel-size 8
--data-parallel-size 1
--enable-expert-parallel
--max-model-len 20000
--allowed-local-media-path /vllm-workspace/vllm/tests/multimodal/assets/
--enforce-eager
--max-num-batched-tokens 5000
--gpu-memory-utilization 0.8
--trust-remote-code
--port 8800
--no-enable-prefix-caching
--mm_processor_cache_type="shm"
--async-scheduling

  • performance test
    evalscope perf --parallel 8 --model qwen3vl --url http://ip:port/v1/chat/completions --api openai --dataset random_vl --min-tokens 300 --max-tokens 300 --prefix-length 0 --min-prompt-length 1024 --max-prompt-length 1024 --image-width 2048 --image-height 2048 --image-format RGB --image-num 1 --number 32 --tokenizer-path /path/Qwen3-VL-235B-A22B-Instruct-W8A8/

  • Accuracy Test
    ais_bench --models vllm_api_stream_chat --datasets textvqa_gen --debug

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for Qwen3-VL models by enabling a Triton-based multi-head rotary position embedding (mrope) implementation. The change correctly identifies the specific model configuration and directs it to a more efficient code path. My review includes one suggestion to improve code maintainability by replacing a hardcoded value with a named constant.

Comment thread vllm_ascend/ops/rotary_embedding.py Outdated
@ichaoren ichaoren changed the title Releases/v0.13.0 use triton mrope for Qwen3-VL Jan 13, 2026
Comment thread vllm_ascend/ops/rotary_embedding.py Outdated
@ichaoren ichaoren requested a review from gcanlin January 13, 2026 10:59
Signed-off-by: panther-zhu <nonamefly@petalmail.com>
Signed-off-by: panther-zhu <nonamefly@petalmail.com>
@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 15, 2026
@wangxiyuan wangxiyuan changed the title use triton mrope for Qwen3-VL [Performance]use triton mrope for Qwen3-VL Jan 15, 2026
@zzzzwwjj zzzzwwjj merged commit d17370b into vllm-project:releases/v0.13.0 Jan 17, 2026
20 checks passed
yiz-liu pushed a commit that referenced this pull request Jan 20, 2026
### What this PR does / why we need it?
fix pr: [5827](#5827)
keep same with main branch.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Signed-off-by: ichaoren <fengjian5500@163.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
The performance can be optimized by use triton mrope for Qwen3-VL.

1. Qwen3-VL-235B-A22B-Instruct-W8A8

  | accuracy | TTFT | TPOT | TPS
-- | -- | -- | -- | --
base | 83.76 | 4.8771s | 0.1472 | 49
test | 83.59 | 4.3273 | 0.0615 | 105

2. Qwen3-VL-8B-Instruct

  | accuracy | TTFT | TPOT | TPS
-- | -- | -- | -- | --
base | 80.65 | 4.1744 | 0.0499 | 125
test | 80.86 | 3.1858 | 0.0245 | 227

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
Name: triton-ascend
Version: 3.2.0.dev20260105

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

vllm serve $MODEL_DIR \
 --served-model-name $MODEL_NAME \
 --quantization ascend \
 --tensor-parallel-size 8 \
 --data-parallel-size 1 \
 --enable-expert-parallel \
 --max-model-len 20000 \
--allowed-local-media-path /vllm-workspace/vllm/tests/multimodal/assets/
\
 --enforce-eager \
 --max-num-batched-tokens 5000 \
 --gpu-memory-utilization 0.8 \
 --trust-remote-code \
 --port 8800 \
 --no-enable-prefix-caching \
 --mm_processor_cache_type="shm" \
 --async-scheduling

- performance test
evalscope perf --parallel 8 --model qwen3vl --url
http://ip:port/v1/chat/completions --api openai --dataset random_vl
--min-tokens 300 --max-tokens 300 --prefix-length 0 --min-prompt-length
1024 --max-prompt-length 1024 --image-width 2048 --image-height 2048
--image-format RGB --image-num 1 --number 32 --tokenizer-path
/path/Qwen3-VL-235B-A22B-Instruct-W8A8/

- Accuracy Test
ais_bench --models vllm_api_stream_chat --datasets textvqa_gen --debug

---------

Signed-off-by: panther-zhu <nonamefly@petalmail.com>
Co-authored-by: panther-zhu <nonamefly@petalmail.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
)

### What this PR does / why we need it?
fix pr: [5827](vllm-project#5827)
keep same with main branch.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Signed-off-by: ichaoren <fengjian5500@163.com>
tangtiangu pushed a commit to tangtiangu/jiusi-vllm-ascend that referenced this pull request Feb 24, 2026
)

### What this PR does / why we need it?
fix pr: [5827](vllm-project#5827)
keep same with main branch.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Signed-off-by: ichaoren <fengjian5500@163.com>
tangtiangu pushed a commit to tangtiangu/jiusi-vllm-ascend that referenced this pull request Feb 24, 2026
)

### What this PR does / why we need it?
fix pr: [5827](vllm-project#5827)
keep same with main branch.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Signed-off-by: ichaoren <fengjian5500@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants