[MM][CG] Support ViT CG for Qwen2-VL by johncalesp · Pull Request #41736 · vllm-project/vllm

johncalesp · 2026-05-05T13:43:41Z

Purpose

Enable Cudagraph for ViT for Qwen2.5-VL following the precedence from #35963.

Test Plan

Added record in the file tests/models/multimodal/generation/test_vit_cudagraph.py

Test Result

E2E
Test on H100
Engine command

vllm serve Qwen/Qwen2-VL-7B-Instruct \
    --max-model-len 8192 \
    --no-enable-prefix-caching \
    --max-num-batched-tokens 4096 \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.85 \
    --distributed-executor-backend uni \
    --limit-mm-per-prompt '{"image": 8, "video": 0}' \
    --compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [512, 768, 1024], "encoder_cudagraph_max_vision_items_per_batch": 8}'

Benchmark command

vllm bench serve \
    --endpoint /v1/chat/completions \
    --backend openai-chat \
    --dataset-name random-mm \
    --input-len 32 \
    --output-len 1 \
    --random-mm-base-items-per-request 8 \
    --random-mm-num-mm-items-range-ratio 0 \
    --random-mm-bucket-config "{(224,224,1): 1.0}" \
    --random-mm-limit-mm-per-prompt '{"image": 8}' \
    --num-prompts 1200 \
    --num-warmups 120 \
    --request-rate 36

Result no CG:

============ Serving Benchmark Result ============
Successful requests:                     1200
Failed requests:                         0
Request rate configured (RPS):           36.00
Benchmark duration (s):                  43.93
Total input tokens:                      694799
Total generated tokens:                  1200
Request throughput (req/s):              27.31
Output token throughput (tok/s):         27.31
Peak output token throughput (tok/s):    100.00
Peak concurrent requests:                493.00
Total token throughput (tok/s):          15841.72
---------------Time to First Token----------------
Mean TTFT (ms):                          9910.92
Median TTFT (ms):                        11011.92
P99 TTFT (ms):                           12938.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P99 TPOT (ms):                           0.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.01
Median ITL (ms):                         0.01
P99 ITL (ms):                            0.02
==================================================

Result CG:

============ Serving Benchmark Result ============
Successful requests:                     1200
Failed requests:                         0
Request rate configured (RPS):           36.00
Benchmark duration (s):                  40.07
Total input tokens:                      694799
Total generated tokens:                  1200
Request throughput (req/s):              29.95
Output token throughput (tok/s):         29.95
Peak output token throughput (tok/s):    98.00
Peak concurrent requests:                311.00
Total token throughput (tok/s):          17371.63
---------------Time to First Token----------------
Mean TTFT (ms):                          4768.54
Median TTFT (ms):                        4793.35
P99 TTFT (ms):                           8473.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P99 TPOT (ms):                           0.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.01
Median ITL (ms):                         0.00
P99 ITL (ms):                            0.03
==================================================

Signed-off-by: John Calderon <jcalderon@nvidia.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mergify · 2026-05-05T13:44:32Z

Documentation preview: https://vllm--41736.org.readthedocs.build/en/41736/

mergify · 2026-05-05T13:44:58Z

Hi @johncalesp, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

gemini-code-assist

Code Review

This pull request enables CUDA Graph support for the Qwen2-VL model by implementing the SupportsEncoderCudaGraph protocol and adding the necessary metadata preparation logic. It also updates the documentation and includes a new test configuration for the model. A potential IndexError was identified in the prepare_encoder_metadata method when handling empty inputs in multi-GPU environments, which can be resolved by ensuring the input array is correctly reshaped.

johncalesp · 2026-05-05T14:27:33Z

@b-mu can you help me and review this PR when you get a chance, thx!.
cc @wangshangsam

shen-shanshan · 2026-05-06T04:01:03Z

LGTM.

b-mu · 2026-05-07T20:37:36Z

LGTM

johncalesp · 2026-05-07T21:01:38Z

@shen-shanshan can we set the ready tag to run the CI?

shen-shanshan · 2026-05-08T01:33:22Z

@shen-shanshan can we set the ready tag to run the CI?

I don't have the authority to add label...

CC @DarkLight1337 @Isotr0py

johncalesp added 2 commits May 5, 2026 09:10

enable cg for qwen2-vl

515031a

Signed-off-by: John Calderon <jcalderon@nvidia.com>

Merge branch 'main' into jcalderon/enable-cg-qwen2-vl

70bbe5a

johncalesp requested review from DarkLight1337, sighingnow, vadiklyutiy and ywang96 as code owners May 5, 2026 13:43

claude Bot reviewed May 5, 2026

View reviewed changes

mergify Bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) qwen Related to Qwen models nvidia labels May 5, 2026

github-project-automation Bot added this to NVIDIA May 5, 2026

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

Comment thread vllm/model_executor/models/qwen2_vl.py

johncalesp mentioned this pull request May 5, 2026

[RFC]: Support ViT Full CUDA Graph (Tracker) #38175

Open

20 tasks

Merge branch 'main' into jcalderon/enable-cg-qwen2-vl

87ee979

Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label May 8, 2026

johncalesp added 2 commits May 8, 2026 14:23

Merge branch 'main' into jcalderon/enable-cg-qwen2-vl

52959c8

Merge branch 'main' into jcalderon/enable-cg-qwen2-vl

3ad9623

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MM][CG] Support ViT CG for Qwen2-VL#41736

[MM][CG] Support ViT CG for Qwen2-VL#41736
johncalesp wants to merge 5 commits intovllm-project:mainfrom
CentML:jcalderon/enable-cg-qwen2-vl

johncalesp commented May 5, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

mergify Bot commented May 5, 2026

Uh oh!

mergify Bot commented May 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

johncalesp commented May 5, 2026

Uh oh!

shen-shanshan commented May 6, 2026

Uh oh!

b-mu commented May 7, 2026

Uh oh!

johncalesp commented May 7, 2026

Uh oh!

shen-shanshan commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

johncalesp commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

mergify Bot commented May 5, 2026

Uh oh!

mergify Bot commented May 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

johncalesp commented May 5, 2026

Uh oh!

shen-shanshan commented May 6, 2026

Uh oh!

b-mu commented May 7, 2026

Uh oh!

johncalesp commented May 7, 2026

Uh oh!

shen-shanshan commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

johncalesp commented May 5, 2026 •

edited

Loading