[Performance]use triton mrope for Qwen3-VL#5827
Merged
zzzzwwjj merged 2 commits intovllm-project:releases/v0.13.0from Jan 17, 2026
Merged
[Performance]use triton mrope for Qwen3-VL#5827zzzzwwjj merged 2 commits intovllm-project:releases/v0.13.0from
zzzzwwjj merged 2 commits intovllm-project:releases/v0.13.0from
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a performance optimization for Qwen3-VL models by enabling a Triton-based multi-head rotary position embedding (mrope) implementation. The change correctly identifies the specific model configuration and directs it to a more efficient code path. My review includes one suggestion to improve code maintainability by replacing a hardcoded value with a named constant.
gcanlin
reviewed
Jan 13, 2026
d578193 to
f0e25a2
Compare
Signed-off-by: panther-zhu <nonamefly@petalmail.com>
f0e25a2 to
f46d72d
Compare
Signed-off-by: panther-zhu <nonamefly@petalmail.com>
wangxiyuan
approved these changes
Jan 15, 2026
zzzzwwjj
approved these changes
Jan 17, 2026
yiz-liu
pushed a commit
that referenced
this pull request
Jan 20, 2026
### What this PR does / why we need it? fix pr: [5827](#5827) keep same with main branch. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: ichaoren <fengjian5500@163.com>
starmountain1997
pushed a commit
to starmountain1997/vllm-ascend
that referenced
this pull request
Jan 31, 2026
<!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> The performance can be optimized by use triton mrope for Qwen3-VL. 1. Qwen3-VL-235B-A22B-Instruct-W8A8 | accuracy | TTFT | TPOT | TPS -- | -- | -- | -- | -- base | 83.76 | 4.8771s | 0.1472 | 49 test | 83.59 | 4.3273 | 0.0615 | 105 2. Qwen3-VL-8B-Instruct | accuracy | TTFT | TPOT | TPS -- | -- | -- | -- | -- base | 80.65 | 4.1744 | 0.0499 | 125 test | 80.86 | 3.1858 | 0.0245 | 227 ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> Name: triton-ascend Version: 3.2.0.dev20260105 ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> vllm serve $MODEL_DIR \ --served-model-name $MODEL_NAME \ --quantization ascend \ --tensor-parallel-size 8 \ --data-parallel-size 1 \ --enable-expert-parallel \ --max-model-len 20000 \ --allowed-local-media-path /vllm-workspace/vllm/tests/multimodal/assets/ \ --enforce-eager \ --max-num-batched-tokens 5000 \ --gpu-memory-utilization 0.8 \ --trust-remote-code \ --port 8800 \ --no-enable-prefix-caching \ --mm_processor_cache_type="shm" \ --async-scheduling - performance test evalscope perf --parallel 8 --model qwen3vl --url http://ip:port/v1/chat/completions --api openai --dataset random_vl --min-tokens 300 --max-tokens 300 --prefix-length 0 --min-prompt-length 1024 --max-prompt-length 1024 --image-width 2048 --image-height 2048 --image-format RGB --image-num 1 --number 32 --tokenizer-path /path/Qwen3-VL-235B-A22B-Instruct-W8A8/ - Accuracy Test ais_bench --models vllm_api_stream_chat --datasets textvqa_gen --debug --------- Signed-off-by: panther-zhu <nonamefly@petalmail.com> Co-authored-by: panther-zhu <nonamefly@petalmail.com>
starmountain1997
pushed a commit
to starmountain1997/vllm-ascend
that referenced
this pull request
Jan 31, 2026
) ### What this PR does / why we need it? fix pr: [5827](vllm-project#5827) keep same with main branch. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: ichaoren <fengjian5500@163.com>
tangtiangu
pushed a commit
to tangtiangu/jiusi-vllm-ascend
that referenced
this pull request
Feb 24, 2026
) ### What this PR does / why we need it? fix pr: [5827](vllm-project#5827) keep same with main branch. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: ichaoren <fengjian5500@163.com>
tangtiangu
pushed a commit
to tangtiangu/jiusi-vllm-ascend
that referenced
this pull request
Feb 24, 2026
) ### What this PR does / why we need it? fix pr: [5827](vllm-project#5827) keep same with main branch. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: ichaoren <fengjian5500@163.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it?
The performance can be optimized by use triton mrope for Qwen3-VL.
Does this PR introduce any user-facing change?
Name: triton-ascend
Version: 3.2.0.dev20260105
How was this patch tested?
vllm serve $MODEL_DIR
--served-model-name $MODEL_NAME
--quantization ascend
--tensor-parallel-size 8
--data-parallel-size 1
--enable-expert-parallel
--max-model-len 20000
--allowed-local-media-path /vllm-workspace/vllm/tests/multimodal/assets/
--enforce-eager
--max-num-batched-tokens 5000
--gpu-memory-utilization 0.8
--trust-remote-code
--port 8800
--no-enable-prefix-caching
--mm_processor_cache_type="shm"
--async-scheduling
performance test
evalscope perf --parallel 8 --model qwen3vl --url http://ip:port/v1/chat/completions --api openai --dataset random_vl --min-tokens 300 --max-tokens 300 --prefix-length 0 --min-prompt-length 1024 --max-prompt-length 1024 --image-width 2048 --image-height 2048 --image-format RGB --image-num 1 --number 32 --tokenizer-path /path/Qwen3-VL-235B-A22B-Instruct-W8A8/
Accuracy Test
ais_bench --models vllm_api_stream_chat --datasets textvqa_gen --debug