Add fused multimodal ROPE RMS for Qwen vision language models by xytpai · Pull Request #1406 · ROCm/aiter

xytpai · 2025-11-13T10:14:56Z

Motivation

Add mrope fused kernel for Qwen-vl models with inference mode.

Technical Details

For norm implementation details, we use warp norm method for better performance, overall uplift 10x compared with native torch ops.

Tests

mrope qknorm fused 308 test (Qwen3-VL-235B-A22B-Instruct-FP8-dynamic)

before:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     128       
Benchmark duration (s):                  119.44    
Total input tokens:                      148715    
Total input text tokens:                 68459     
Total input vision tokens:               80256     
Total generated tokens:                  129952    
Total generated tokens (retokenized):    77974     
Request throughput (req/s):              1.07      
Input token throughput (tok/s):          1245.07   
Output token throughput (tok/s):         1087.98   
Total token throughput (tok/s):          2333.05   
Concurrency:                             48.37     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   45137.45  
Median E2E Latency (ms):                 45453.06  
---------------Time to First Token----------------
Mean TTFT (ms):                          3902.20   
Median TTFT (ms):                        1605.80   
P99 TTFT (ms):                           9031.68   
--------------Time per Output Token---------------
Mean TPOT (ms):                          42.72     
Median TPOT (ms):                        42.93     
P99 TPOT (ms):                           76.06     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           53.58     
Median ITL (ms):                         37.15     
P95 ITL (ms):                            111.88    
P99 ITL (ms):                            220.84    
Max ITL (ms):                            6631.77   
==================================================

after:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     128       
Benchmark duration (s):                  111.89    
Total input tokens:                      148823    
Total input text tokens:                 68567     
Total input vision tokens:               80256     
Total generated tokens:                  129952    
Total generated tokens (retokenized):    81784     
Request throughput (req/s):              1.14      
Input token throughput (tok/s):          1330.08   
Output token throughput (tok/s):         1161.42   
Total token throughput (tok/s):          2491.51   
Concurrency:                             48.33     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   42245.86  
Median E2E Latency (ms):                 42871.80  
---------------Time to First Token----------------
Mean TTFT (ms):                          3458.57   
Median TTFT (ms):                        1652.74   
P99 TTFT (ms):                           8223.01   
--------------Time per Output Token---------------
Mean TPOT (ms):                          40.00     
Median TPOT (ms):                        40.37     
P99 TPOT (ms):                           70.99     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           48.53     
Median ITL (ms):                         35.46     
P95 ITL (ms):                            106.41    
P99 ITL (ms):                            182.92    
Max ITL (ms):                            5496.99   
==================================================

valarLip · 2025-11-18T14:42:49Z

let me know once it's ready for review

* add two * fix type error * add extra assert * add v as ret

xytpai · 2025-11-26T04:58:58Z

@valarLip Ready for review

valarLip

LGTM, nice job

Motivation Add mrope fused kernel for Qwen-vl models with inference mode. Technical Details For norm implementation details, we use warp norm method for better performance, overall uplift 10x compared with native torch ops.

add fused mrope rms

04ef66d

xytpai marked this pull request as draft November 13, 2025 10:15

xytpai added 3 commits November 13, 2025 17:47

add op_tests/test_fused_mrope_rms.py

0f866c9

fix build errors

3c008e8

fix bugs

42273b9

xytpai marked this pull request as ready for review November 13, 2025 19:20

xytpai added 3 commits November 14, 2025 03:21

Merge branch 'main' into xyt/qknorm_mrope

4c6ce5e

Merge branch 'main' into xyt/qknorm_mrope

ce7442d

using load instead of nontemporal_load

16bb6e9

This was referenced Nov 18, 2025

[feat][dontmerge] add qknorm rope fused Yuechguo/sglang#9

Merged

[feat] enable fused qknorm and rope zejunchen-zejun/sglang#20

Merged

xytpai requested a review from valarLip November 18, 2025 07:47

xytpai added 4 commits November 18, 2025 15:55

Merge branch 'main' into xyt/qknorm_mrope

ed9b6ac

fix lint

775ad5d

fix lint2

849f899

fix lint2

77aee35

xytpai and others added 10 commits November 19, 2025 02:17

refine code

6642c2d

Merge branch 'main' into xyt/qknorm_mrope

6183bca

[feat] mrotary embedding qknorm fused code (#1427)

152e6cc

* add two * fix type error * add extra assert * add v as ret

Merge branch 'main' into xyt/qknorm_mrope

5c7dd96

Merge branch 'main' into xyt/qknorm_mrope

9ae7129

fix lint

43fa88e

Merge branch 'main' into xyt/qknorm_mrope

9e6fe26

Hotfix: make mrope position inputs contiguous

656b8c1

add contiguous check

0aa8d60

Merge branch 'main' into xyt/qknorm_mrope

e09f9ca

xytpai added 2 commits November 26, 2025 13:49

add strided load support for positions

f704fdc

Merge branch 'main' into xyt/qknorm_mrope

e7ff20a

valarLip approved these changes Nov 26, 2025

View reviewed changes

Merge branch 'main' into xyt/qknorm_mrope

3ba7ed6

xytpai merged commit 6382873 into main Nov 29, 2025
22 checks passed

xytpai deleted the xyt/qknorm_mrope branch November 29, 2025 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fused multimodal ROPE RMS for Qwen vision language models#1406

Add fused multimodal ROPE RMS for Qwen vision language models#1406
xytpai merged 24 commits intomainfrom
xyt/qknorm_mrope

xytpai commented Nov 13, 2025 •

edited

Loading

Uh oh!

valarLip commented Nov 18, 2025

Uh oh!

xytpai commented Nov 26, 2025

Uh oh!

valarLip left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xytpai commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Tests

Uh oh!

valarLip commented Nov 18, 2025

Uh oh!

xytpai commented Nov 26, 2025

Uh oh!

valarLip left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xytpai commented Nov 13, 2025 •

edited

Loading