[MM][Perf][CG] Support ViT full CUDA graph for Qwen3-VL video inference by shen-shanshan · Pull Request #38061 · vllm-project/vllm

shen-shanshan · 2026-03-25T02:59:27Z

Purpose

Following #35963 (only supports image inference), this PR continues to work on it to support video inference for Qwen3-VL.

TODO:

Unit test.
E2E functional test.
Benchmark in some scenarios:
- no DP VIT + eager vs no DP VIT + cuda graph.
- DP VIT + eager vs DP VIT + cuda graph.
Update "Vision Encoder (ViT) CUDA Graphs" docs.

🤖 AI Summary

Following #35963 (ViT full CUDA graph support for image inference), this PR extends the encoder CUDA graph framework to support video inference for Qwen3-VL. Previously, the CUDA graph capture/replay path only handled image inputs (pixel_values + image_grid_thw). Video inputs use different keys (pixel_values_videos + video_grid_thw) and require larger cu_seqlens buffers because each video item contributes multiple frames (T attention sequences). This PR generalizes the protocol and manager to handle both modalities through a single shared graph manager.

Note: Video CUDA graphs are automatically disabled when EVS (Efficient Video Sampling) pruning is enabled, since EVS makes the token count data-dependent and incompatible with CUDA graph capture.

Key Changes:

EncoderCudaGraphConfig (vllm/v1/worker/encoder_cudagraph_defs.py): Replace single input_key field with input_key_by_modality dict (e.g., {"image": "pixel_values", "video": "pixel_values_videos"}) to support per-modality input tensor routing.
SupportsEncoderCudaGraph protocol (vllm/model_executor/models/interfaces.py): Add get_input_modality(mm_kwargs) method to determine whether inputs are image or video. Add max_frames_per_batch parameter to prepare_encoder_cudagraph_capture_inputs() and prepare_encoder_cudagraph_replay_buffers().
Qwen3VLForConditionalGeneration (vllm/model_executor/models/qwen3_vl.py):
- Implement get_input_modality() to route based on mm_kwargs keys.
- Add _get_pixel_values_by_modality() and _get_grid_thw_by_modality() helpers to abstract modality-specific key access across all protocol methods.
- Update prepare_encoder_cudagraph_capture_inputs() to build video-format grid configs (T>1 per item) when max_frames_per_batch exceeds max_batch_size, sizing cu_seqlens buffer for video replays.
- Add replay buffer caching (_replay_buffer_cache) keyed by (modality, grid_thw) to avoid redundant CPU-side NumPy computation for repeated grid shapes.
- Update prepare_encoder_metadata() to accept max_frames_per_batch for cu_seqlens padding, allowing video frames to exceed max_batch_size.
EncoderCudaGraphManager (vllm/v1/worker/encoder_cudagraph.py):
- Add max_frames_per_batch field to BudgetGraphMetadata and manager initialization.
- Rename encoder_cudagraph_max_images_per_batch → encoder_cudagraph_max_mm_items_per_batch for generality.
- Route input_key lookup through get_input_modality() during replay instead of using a fixed key.
CompilationConfig (vllm/config/compilation.py): Add encoder_cudagraph_max_frames_per_batch config option (0 = auto-infer). Rename encoder_cudagraph_max_images_per_batch → encoder_cudagraph_max_mm_items_per_batch.
Tests (tests/v1/cudagraph/test_encoder_cudagraph.py): Add SimpleMockViTVideoModel with dual-modality support, TestGetInputModality (no GPU), and TestEncoderCudaGraphVideoReplay (GPU) covering video capture/replay, fallback, counters, chunking, and mixed image+video through a shared manager. (+316 lines)

Test Plan

Unit test:

pytest tests/v1/cudagraph/test_encoder_cudagraph.py -v

Functional test:

# Pass compilation_config to EngineArgs in run_qwen3_vl()
# compilation_config={
#     "cudagraph_mm_encoder": true,
#     "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048],
#     "encoder_cudagraph_max_mm_items_per_batch": 4,
#     "encoder_cudagraph_max_frames_per_batch": 32,
# }
python examples/offline_inference/vision_language.py -m qwen3_vl --modality "video"

Benchmark:

# Single GPU:
vllm bench mm-processor \
--model /shared/models/modelscope/models/Qwen/Qwen3-VL-8B-Instruct \
--max-model-len 16384 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 1}' \
--num-prompts 100 \
--seed 42 \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048, 2560, 3072, 3584, 4096], "encoder_cudagraph_max_mm_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

# Multi GPU:
vllm bench mm-processor \
--model /shared/models/modelscope/models/Qwen/Qwen3-VL-32B-Instruct \
--max-model-len 8192 \
--dataset-name random-mm \
--random-mm-base-items-per-request 4 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 4}' \
--num-prompts 100 \
--seed 42 \
--tensor-parallel-size 4 \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend FLASHINFER \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048, 2560, 3072, 3584, 4096], "encoder_cudagraph_max_mm_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

Test Result

✅ Unit test:

tests/v1/cudagraph/test_encoder_cudagraph.py::TestGenerateBudgets::test_exact_powers_of_2 PASSED
tests/v1/cudagraph/test_encoder_cudagraph.py::TestGenerateBudgets::test_max_not_power_of_2 PASSED
tests/v1/cudagraph/test_encoder_cudagraph.py::TestGenerateBudgets::test_min_equals_max PASSED
...
36 passed, 3 warnings in 10.04s

✅ Functional test:

--------------------------------------------------
The video is funny because it captures a baby wearing glasses and pretending to read a book, which is an adorable and endearing sight. The baby’s serious expression and focused demeanor while pretending to read, combined with the fact that they are so young and unable to actually read, creates a humorous contrast. The baby’s movements
--------------------------------------------------
The video is funny because it captures a baby wearing glasses and pretending to read a book, which is an adorable and endearing sight. The baby's serious expression and focused posture, combined with the fact that they are clearly not reading in the traditional sense, create a humorous contrast. The baby's attempts to turn the pages
--------------------------------------------------
The video is funny because it captures a toddler wearing glasses and pretending to read a book, which is an adorable and endearing sight. The child's focused expression and the way they turn the pages with their hands, as if they are truly engrossed in the book, adds to the humor. The fact that the
--------------------------------------------------
The video is funny because it captures a baby wearing glasses and pretending to read a book, which is an adorable and endearing sight. The baby's serious demeanor and focused expression while holding the book add to the humor, as it creates a comical contrast between the baby's innocent actions and the adult-like behavior of reading
--------------------------------------------------

✅ Benchmark:

Single GPU (Qwen3-VL-8B-Instruct, 1xA100, random-mm, 100 prompts):

Backend	Mean	P99
FLASH_ATTN	-24.52% (3.67ms -> 4.57ms)	+61.66% (17.03ms -> 6.53ms)
FLASHINFER	+21.84% (8.38ms -> 6.55ms)	+87.60% (58.62ms -> 7.27ms)

Multi GPU (Qwen3-VL-32B-Instruct, 4xA100, random-mm, 100 prompts):

Backend	Mean	P99
FLASH_ATTN	+13.44% (5.43ms -> 4.70ms)	+83.22% (51.25ms -> 8.60ms)
FLASHINFER	+21.37% (8.75ms -> 6.88ms)	+82.85% (77.77ms -> 13.34ms)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

gemini-code-assist

Code Review

This pull request extends CUDA graph support to video inputs for multimodal models, specifically demonstrated with Qwen3-VL. It introduces a new is_image_inputs method to the SupportsEncoderCudaGraph protocol and updates the EncoderCudaGraphConfig to support modality-specific input keys. The Qwen3-VL model implementation is refactored to handle both image and video inputs dynamically, with video CUDA graph capture being conditionally enabled based on whether Efficient Video Sampling (EVS) pruning is active. The CUDA graph manager is also updated to correctly identify and process inputs based on their modality. Comprehensive tests for video CUDA graph capture, replay, fallback, and packing are added to ensure correct functionality.

mergify · 2026-03-26T04:04:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @shen-shanshan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: shen-shanshan <467638484@qq.com>

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

Signed-off-by: shen-shanshan <467638484@qq.com>

shen-shanshan · 2026-04-13T01:53:38Z

@ywang96 Could you help review this PR? Since it has been pending for a long time. 😃

ywang96 · 2026-04-13T03:22:15Z

@ywang96 Could you help review this PR? Since it has been pending for a long time. 😃

@shen-shanshan Sorry for the late reply - I will take a look at this PR tonight! 🙏

ywang96

Apologies for the late review! I left some comments/questions!

ywang96 · 2026-04-13T06:37:05Z

-    encoder_cudagraph_max_images_per_batch: int = 0
-    """Maximum number of images per batch for encoder CUDA graph capture.
+    encoder_cudagraph_max_mm_items_per_batch: int = 0
+    """Maximum number of images/videos per batch for encoder CUDA graph capture.


I'm slightly concerned about this since the naming suggests that we basically include audio here as well.

How about encoder_cudagraph_max_vision_items_per_batch? Please also update this config name everywhere in the example/doc.

I agree with this. vision_items is clearer since we don't plan to support CG for audio encode currently.

Sounds good - let's update with encoder_cudagraph_max_vision_items_per_batch then!

Updated. Let's wait for CI finish. 😃

ywang96 · 2026-04-13T06:46:51Z

+    def get_input_modality(
+        self,
+        mm_kwargs: dict[str, Any],
+    ) -> str:
+        if "image_grid_thw" in mm_kwargs:
+            return "image"
+        return "video"


Does this mean in order to use this feature with video input, users will have to turn off the image modality at launch time?

ywang96 · 2026-04-13T06:48:32Z

+            max_frames_per_batch: If set, overrides max_batch_size for
+                cu_seqlens padding. For video inputs each item contributes
+                T attention sequences (frames); this sizes the buffer to
+                the total frame budget so video replays never overflow.


Does this mean if a video has more frames than this value, it falls back to eager?

Yes. If the smallest num of frames of one video doesn't fit in this budget, CG will not be enabled.

Signed-off-by: shen-shanshan <467638484@qq.com>

ywang96 · 2026-04-14T05:20:20Z

  --tensor-parallel-size 4 --mm-encoder-tp-mode data \
  --compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [512, 1024, 1536, 2048, 2560, 3072, 3584, 4096, 4864], "encoder_cudagraph_max_images_per_batch": 8}'


One final nit - update this flag too

ywang96 · 2026-04-14T05:20:26Z

Signed-off-by: shen-shanshan <467638484@qq.com>

…ce (vllm-project#38061) Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: zengxian <xiangdong.zeng@intel.com>

…ce (vllm-project#38061) Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io>

…ce (vllm-project#38061) Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

…ce (vllm-project#38061) Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io>

shen-shanshan requested review from WoosukKwon, njhill and sighingnow as code owners March 25, 2026 02:59

shen-shanshan marked this pull request as draft March 25, 2026 02:59

mergify Bot added qwen Related to Qwen models nvidia v1 labels Mar 25, 2026

github-project-automation Bot added this to NVIDIA Mar 25, 2026

gemini-code-assist Bot reviewed Mar 25, 2026

View reviewed changes

wangshangsam added this to Qwen3.5 Mar 25, 2026

github-project-automation Bot moved this to Backlog in Qwen3.5 Mar 25, 2026

wangshangsam assigned shen-shanshan Mar 25, 2026

wangshangsam added performance Performance-related issues multi-modality Related to multi-modality (#4194) labels Mar 25, 2026

shen-shanshan mentioned this pull request Mar 26, 2026

[RFC]: Support ViT Full CUDA Graph (Tracker) #38175

Open

20 tasks

mergify Bot added the needs-rebase label Mar 26, 2026

shen-shanshan force-pushed the vit-cg branch from 77c7c3e to 1fb8ef3 Compare March 26, 2026 09:18

mergify Bot removed the needs-rebase label Mar 26, 2026

shen-shanshan force-pushed the vit-cg branch from e363dfc to a587f17 Compare March 31, 2026 08:33

shen-shanshan marked this pull request as ready for review March 31, 2026 08:34

shen-shanshan requested review from ProExpertProg, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, vadiklyutiy, yewentao256 and youkaichao as code owners March 31, 2026 08:34

shen-shanshan and others added 3 commits April 7, 2026 02:03

update

89e5fce

Signed-off-by: shen-shanshan <467638484@qq.com>

update comments

4edc75d

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

update doc

bdef875

Signed-off-by: shen-shanshan <467638484@qq.com>

shen-shanshan force-pushed the vit-cg branch from 97000db to bdef875 Compare April 7, 2026 02:05

mergify Bot removed the needs-rebase label Apr 7, 2026

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 7, 2026

Merge branch 'main' into vit-cg

8f2e792

ywang96 reviewed Apr 13, 2026

View reviewed changes

update var name

804e1e9

Signed-off-by: shen-shanshan <467638484@qq.com>

ywang96 approved these changes Apr 14, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA Apr 14, 2026

update doc

8038f3d

Signed-off-by: shen-shanshan <467638484@qq.com>

DarkLight1337 merged commit 8011885 into vllm-project:main Apr 14, 2026
69 checks passed

github-project-automation Bot moved this from Backlog to Done in Qwen3.5 Apr 14, 2026

github-project-automation Bot moved this from Ready to Done in NVIDIA Apr 14, 2026

shen-shanshan mentioned this pull request Apr 20, 2026

[Doc] Update ViT CUDA graph doc for mixed (image+video) inputs #40355

Merged

4 tasks

grYe99 mentioned this pull request Apr 22, 2026

[MM][Perf][CG] Support ViT full CUDA graph for glm4_1v image and video inference #40576

Open

4 tasks

This was referenced May 5, 2026

[MM][Perf][CG] Support ViT full CUDA graph for InternVL #41759

Open

[MM][Perf][CG] Support ViT full CUDA graph for Kimi-VL #41992

Open

YunzhuLu mentioned this pull request May 7, 2026

[MM][Perf][CG] Enable encoder CUDA Graph for MiniCPM-V #41996

Draft

4 tasks

Victor49152 mentioned this pull request May 8, 2026

[Performance]: Qwen3-VL-235B-A22B-Instruct NVFP4 Performance dropped at upstream main #42096

Open

1 task

		--tensor-parallel-size 4 --mm-encoder-tp-mode data \
		--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [512, 1024, 1536, 2048, 2560, 3072, 3584, 4096, 4864], "encoder_cudagraph_max_images_per_batch": 8}'

Uh oh!

Conversation

shen-shanshan commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

🤖 AI Summary

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify Bot commented Mar 26, 2026

Uh oh!

shen-shanshan commented Apr 13, 2026

Uh oh!

ywang96 commented Apr 13, 2026

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

ywang96 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

shen-shanshan Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

ywang96 Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shen-shanshan Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywang96 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

ywang96 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

shen-shanshan Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

ywang96 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

ywang96 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

shen-shanshan commented Mar 25, 2026 •

edited

Loading

ywang96 Apr 13, 2026 •

edited

Loading

shen-shanshan Apr 14, 2026 •

edited

Loading