Skip to content

[MM][CG] Support ViT CG for Qwen2-VL#41736

Open
johncalesp wants to merge 5 commits intovllm-project:mainfrom
CentML:jcalderon/enable-cg-qwen2-vl
Open

[MM][CG] Support ViT CG for Qwen2-VL#41736
johncalesp wants to merge 5 commits intovllm-project:mainfrom
CentML:jcalderon/enable-cg-qwen2-vl

Conversation

@johncalesp
Copy link
Copy Markdown
Contributor

@johncalesp johncalesp commented May 5, 2026

Purpose

Enable Cudagraph for ViT for Qwen2.5-VL following the precedence from #35963.

Test Plan

Added record in the file tests/models/multimodal/generation/test_vit_cudagraph.py

Test Result

E2E
Test on H100
Engine command

vllm serve Qwen/Qwen2-VL-7B-Instruct \
    --max-model-len 8192 \
    --no-enable-prefix-caching \
    --max-num-batched-tokens 4096 \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.85 \
    --distributed-executor-backend uni \
    --limit-mm-per-prompt '{"image": 8, "video": 0}' \
    --compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [512, 768, 1024], "encoder_cudagraph_max_vision_items_per_batch": 8}'

Benchmark command

vllm bench serve \
    --endpoint /v1/chat/completions \
    --backend openai-chat \
    --dataset-name random-mm \
    --input-len 32 \
    --output-len 1 \
    --random-mm-base-items-per-request 8 \
    --random-mm-num-mm-items-range-ratio 0 \
    --random-mm-bucket-config "{(224,224,1): 1.0}" \
    --random-mm-limit-mm-per-prompt '{"image": 8}' \
    --num-prompts 1200 \
    --num-warmups 120 \
    --request-rate 36

Result no CG:

============ Serving Benchmark Result ============
Successful requests:                     1200
Failed requests:                         0
Request rate configured (RPS):           36.00
Benchmark duration (s):                  43.93
Total input tokens:                      694799
Total generated tokens:                  1200
Request throughput (req/s):              27.31
Output token throughput (tok/s):         27.31
Peak output token throughput (tok/s):    100.00
Peak concurrent requests:                493.00
Total token throughput (tok/s):          15841.72
---------------Time to First Token----------------
Mean TTFT (ms):                          9910.92
Median TTFT (ms):                        11011.92
P99 TTFT (ms):                           12938.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P99 TPOT (ms):                           0.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.01
Median ITL (ms):                         0.01
P99 ITL (ms):                            0.02
==================================================

Result CG:

============ Serving Benchmark Result ============
Successful requests:                     1200
Failed requests:                         0
Request rate configured (RPS):           36.00
Benchmark duration (s):                  40.07
Total input tokens:                      694799
Total generated tokens:                  1200
Request throughput (req/s):              29.95
Output token throughput (tok/s):         29.95
Peak output token throughput (tok/s):    98.00
Peak concurrent requests:                311.00
Total token throughput (tok/s):          17371.63
---------------Time to First Token----------------
Mean TTFT (ms):                          4768.54
Median TTFT (ms):                        4793.35
P99 TTFT (ms):                           8473.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P99 TPOT (ms):                           0.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.01
Median ITL (ms):                         0.00
P99 ITL (ms):                            0.03
==================================================

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 5, 2026

Documentation preview: https://vllm--41736.org.readthedocs.build/en/41736/

@mergify mergify Bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) qwen Related to Qwen models nvidia labels May 5, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 5, 2026

Hi @johncalesp, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables CUDA Graph support for the Qwen2-VL model by implementing the SupportsEncoderCudaGraph protocol and adding the necessary metadata preparation logic. It also updates the documentation and includes a new test configuration for the model. A potential IndexError was identified in the prepare_encoder_metadata method when handling empty inputs in multi-GPU environments, which can be resolved by ensuring the input array is correctly reshaped.

Comment thread vllm/model_executor/models/qwen2_vl.py
@johncalesp
Copy link
Copy Markdown
Contributor Author

@b-mu can you help me and review this PR when you get a chance, thx!.
cc @wangshangsam

@shen-shanshan
Copy link
Copy Markdown
Contributor

LGTM.

@b-mu
Copy link
Copy Markdown
Contributor

b-mu commented May 7, 2026

LGTM

@johncalesp
Copy link
Copy Markdown
Contributor Author

@shen-shanshan can we set the ready tag to run the CI?

@shen-shanshan
Copy link
Copy Markdown
Contributor

@shen-shanshan can we set the ready tag to run the CI?

I don't have the authority to add label...

CC @DarkLight1337 @Isotr0py

@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) nvidia qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants