Skip to content

Add fused multimodal ROPE RMS for Qwen vision language models#1406

Merged
xytpai merged 24 commits intomainfrom
xyt/qknorm_mrope
Nov 29, 2025
Merged

Add fused multimodal ROPE RMS for Qwen vision language models#1406
xytpai merged 24 commits intomainfrom
xyt/qknorm_mrope

Conversation

@xytpai
Copy link
Contributor

@xytpai xytpai commented Nov 13, 2025

Motivation

Add mrope fused kernel for Qwen-vl models with inference mode.

Technical Details

For norm implementation details, we use warp norm method for better performance, overall uplift 10x compared with native torch ops.

Tests

mrope qknorm fused 308 test (Qwen3-VL-235B-A22B-Instruct-FP8-dynamic)

before:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     128       
Benchmark duration (s):                  119.44    
Total input tokens:                      148715    
Total input text tokens:                 68459     
Total input vision tokens:               80256     
Total generated tokens:                  129952    
Total generated tokens (retokenized):    77974     
Request throughput (req/s):              1.07      
Input token throughput (tok/s):          1245.07   
Output token throughput (tok/s):         1087.98   
Total token throughput (tok/s):          2333.05   
Concurrency:                             48.37     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   45137.45  
Median E2E Latency (ms):                 45453.06  
---------------Time to First Token----------------
Mean TTFT (ms):                          3902.20   
Median TTFT (ms):                        1605.80   
P99 TTFT (ms):                           9031.68   
--------------Time per Output Token---------------
Mean TPOT (ms):                          42.72     
Median TPOT (ms):                        42.93     
P99 TPOT (ms):                           76.06     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           53.58     
Median ITL (ms):                         37.15     
P95 ITL (ms):                            111.88    
P99 ITL (ms):                            220.84    
Max ITL (ms):                            6631.77   
==================================================

after:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     128       
Benchmark duration (s):                  111.89    
Total input tokens:                      148823    
Total input text tokens:                 68567     
Total input vision tokens:               80256     
Total generated tokens:                  129952    
Total generated tokens (retokenized):    81784     
Request throughput (req/s):              1.14      
Input token throughput (tok/s):          1330.08   
Output token throughput (tok/s):         1161.42   
Total token throughput (tok/s):          2491.51   
Concurrency:                             48.33     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   42245.86  
Median E2E Latency (ms):                 42871.80  
---------------Time to First Token----------------
Mean TTFT (ms):                          3458.57   
Median TTFT (ms):                        1652.74   
P99 TTFT (ms):                           8223.01   
--------------Time per Output Token---------------
Mean TPOT (ms):                          40.00     
Median TPOT (ms):                        40.37     
P99 TPOT (ms):                           70.99     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           48.53     
Median ITL (ms):                         35.46     
P95 ITL (ms):                            106.41    
P99 ITL (ms):                            182.92    
Max ITL (ms):                            5496.99   
==================================================

@xytpai xytpai marked this pull request as draft November 13, 2025 10:15
@xytpai xytpai marked this pull request as ready for review November 13, 2025 19:20
@valarLip
Copy link
Collaborator

let me know once it's ready for review

@xytpai
Copy link
Contributor Author

xytpai commented Nov 26, 2025

@valarLip Ready for review

Copy link
Collaborator

@valarLip valarLip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice job

@xytpai xytpai merged commit 6382873 into main Nov 29, 2025
22 checks passed
@xytpai xytpai deleted the xyt/qknorm_mrope branch November 29, 2025 13:00
farlukas pushed a commit that referenced this pull request Dec 4, 2025
Motivation
Add mrope fused kernel for Qwen-vl models with inference mode.

Technical Details
For norm implementation details, we use warp norm method for better performance, overall uplift 10x compared with native torch ops.
nsusanto pushed a commit that referenced this pull request Dec 4, 2025
Motivation
Add mrope fused kernel for Qwen-vl models with inference mode.

Technical Details
For norm implementation details, we use warp norm method for better performance, overall uplift 10x compared with native torch ops.
zhuyuhua-v pushed a commit that referenced this pull request Dec 17, 2025
Motivation
Add mrope fused kernel for Qwen-vl models with inference mode.

Technical Details
For norm implementation details, we use warp norm method for better performance, overall uplift 10x compared with native torch ops.
valarLip pushed a commit that referenced this pull request Mar 18, 2026
Motivation
Add mrope fused kernel for Qwen-vl models with inference mode.

Technical Details
For norm implementation details, we use warp norm method for better performance, overall uplift 10x compared with native torch ops.
valarLip pushed a commit that referenced this pull request Mar 18, 2026
Motivation
Add mrope fused kernel for Qwen-vl models with inference mode.

Technical Details
For norm implementation details, we use warp norm method for better performance, overall uplift 10x compared with native torch ops.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants