Skip to content

[MM][Perf][CG] Support ViT full CUDA graph for Qwen3-VL video inference#38061

Merged
DarkLight1337 merged 17 commits intovllm-project:mainfrom
shen-shanshan:vit-cg
Apr 14, 2026
Merged

[MM][Perf][CG] Support ViT full CUDA graph for Qwen3-VL video inference#38061
DarkLight1337 merged 17 commits intovllm-project:mainfrom
shen-shanshan:vit-cg

Conversation

@shen-shanshan
Copy link
Copy Markdown
Contributor

@shen-shanshan shen-shanshan commented Mar 25, 2026

Purpose

Following #35963 (only supports image inference), this PR continues to work on it to support video inference for Qwen3-VL.

TODO:

  • Unit test.
  • E2E functional test.
  • Benchmark in some scenarios:
    • no DP VIT + eager vs no DP VIT + cuda graph.
    • DP VIT + eager vs DP VIT + cuda graph.
  • Update "Vision Encoder (ViT) CUDA Graphs" docs.

🤖 AI Summary

Following #35963 (ViT full CUDA graph support for image inference), this PR extends the encoder CUDA graph framework to support video inference for Qwen3-VL. Previously, the CUDA graph capture/replay path only handled image inputs (pixel_values + image_grid_thw). Video inputs use different keys (pixel_values_videos + video_grid_thw) and require larger cu_seqlens buffers because each video item contributes multiple frames (T attention sequences). This PR generalizes the protocol and manager to handle both modalities through a single shared graph manager.

Note: Video CUDA graphs are automatically disabled when EVS (Efficient Video Sampling) pruning is enabled, since EVS makes the token count data-dependent and incompatible with CUDA graph capture.

Key Changes:

  • EncoderCudaGraphConfig (vllm/v1/worker/encoder_cudagraph_defs.py): Replace single input_key field with input_key_by_modality dict (e.g., {"image": "pixel_values", "video": "pixel_values_videos"}) to support per-modality input tensor routing.
  • SupportsEncoderCudaGraph protocol (vllm/model_executor/models/interfaces.py): Add get_input_modality(mm_kwargs) method to determine whether inputs are image or video. Add max_frames_per_batch parameter to prepare_encoder_cudagraph_capture_inputs() and prepare_encoder_cudagraph_replay_buffers().
  • Qwen3VLForConditionalGeneration (vllm/model_executor/models/qwen3_vl.py):
    • Implement get_input_modality() to route based on mm_kwargs keys.
    • Add _get_pixel_values_by_modality() and _get_grid_thw_by_modality() helpers to abstract modality-specific key access across all protocol methods.
    • Update prepare_encoder_cudagraph_capture_inputs() to build video-format grid configs (T>1 per item) when max_frames_per_batch exceeds max_batch_size, sizing cu_seqlens buffer for video replays.
    • Add replay buffer caching (_replay_buffer_cache) keyed by (modality, grid_thw) to avoid redundant CPU-side NumPy computation for repeated grid shapes.
    • Update prepare_encoder_metadata() to accept max_frames_per_batch for cu_seqlens padding, allowing video frames to exceed max_batch_size.
  • EncoderCudaGraphManager (vllm/v1/worker/encoder_cudagraph.py):
    • Add max_frames_per_batch field to BudgetGraphMetadata and manager initialization.
    • Rename encoder_cudagraph_max_images_per_batchencoder_cudagraph_max_mm_items_per_batch for generality.
    • Route input_key lookup through get_input_modality() during replay instead of using a fixed key.
  • CompilationConfig (vllm/config/compilation.py): Add encoder_cudagraph_max_frames_per_batch config option (0 = auto-infer). Rename encoder_cudagraph_max_images_per_batchencoder_cudagraph_max_mm_items_per_batch.
  • Tests (tests/v1/cudagraph/test_encoder_cudagraph.py): Add SimpleMockViTVideoModel with dual-modality support, TestGetInputModality (no GPU), and TestEncoderCudaGraphVideoReplay (GPU) covering video capture/replay, fallback, counters, chunking, and mixed image+video through a shared manager. (+316 lines)

Test Plan

Unit test:

pytest tests/v1/cudagraph/test_encoder_cudagraph.py -v

Functional test:

# Pass compilation_config to EngineArgs in run_qwen3_vl()
# compilation_config={
#     "cudagraph_mm_encoder": true,
#     "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048],
#     "encoder_cudagraph_max_mm_items_per_batch": 4,
#     "encoder_cudagraph_max_frames_per_batch": 32,
# }
python examples/offline_inference/vision_language.py -m qwen3_vl --modality "video"

Benchmark:

# Single GPU:
vllm bench mm-processor \
--model /shared/models/modelscope/models/Qwen/Qwen3-VL-8B-Instruct \
--max-model-len 16384 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 1}' \
--num-prompts 100 \
--seed 42 \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048, 2560, 3072, 3584, 4096], "encoder_cudagraph_max_mm_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

# Multi GPU:
vllm bench mm-processor \
--model /shared/models/modelscope/models/Qwen/Qwen3-VL-32B-Instruct \
--max-model-len 8192 \
--dataset-name random-mm \
--random-mm-base-items-per-request 4 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 4}' \
--num-prompts 100 \
--seed 42 \
--tensor-parallel-size 4 \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend FLASHINFER \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048, 2560, 3072, 3584, 4096], "encoder_cudagraph_max_mm_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

Test Result

✅ Unit test:

tests/v1/cudagraph/test_encoder_cudagraph.py::TestGenerateBudgets::test_exact_powers_of_2 PASSED
tests/v1/cudagraph/test_encoder_cudagraph.py::TestGenerateBudgets::test_max_not_power_of_2 PASSED
tests/v1/cudagraph/test_encoder_cudagraph.py::TestGenerateBudgets::test_min_equals_max PASSED
...
36 passed, 3 warnings in 10.04s

✅ Functional test:

--------------------------------------------------
The video is funny because it captures a baby wearing glasses and pretending to read a book, which is an adorable and endearing sight. The baby’s serious expression and focused demeanor while pretending to read, combined with the fact that they are so young and unable to actually read, creates a humorous contrast. The baby’s movements
--------------------------------------------------
The video is funny because it captures a baby wearing glasses and pretending to read a book, which is an adorable and endearing sight. The baby's serious expression and focused posture, combined with the fact that they are clearly not reading in the traditional sense, create a humorous contrast. The baby's attempts to turn the pages
--------------------------------------------------
The video is funny because it captures a toddler wearing glasses and pretending to read a book, which is an adorable and endearing sight. The child's focused expression and the way they turn the pages with their hands, as if they are truly engrossed in the book, adds to the humor. The fact that the
--------------------------------------------------
The video is funny because it captures a baby wearing glasses and pretending to read a book, which is an adorable and endearing sight. The baby's serious demeanor and focused expression while holding the book add to the humor, as it creates a comical contrast between the baby's innocent actions and the adult-like behavior of reading
--------------------------------------------------

✅ Benchmark:

Single GPU (Qwen3-VL-8B-Instruct, 1xA100, random-mm, 100 prompts):

Backend Mean P99
FLASH_ATTN -24.52% (3.67ms -> 4.57ms) +61.66% (17.03ms -> 6.53ms)
FLASHINFER +21.84% (8.38ms -> 6.55ms) +87.60% (58.62ms -> 7.27ms)

Multi GPU (Qwen3-VL-32B-Instruct, 4xA100, random-mm, 100 prompts):

Backend Mean P99
FLASH_ATTN +13.44% (5.43ms -> 4.70ms) +83.22% (51.25ms -> 8.60ms)
FLASHINFER +21.37% (8.75ms -> 6.88ms) +82.85% (77.77ms -> 13.34ms)

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

@shen-shanshan shen-shanshan marked this pull request as draft March 25, 2026 02:59
@mergify mergify Bot added qwen Related to Qwen models nvidia v1 labels Mar 25, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request extends CUDA graph support to video inputs for multimodal models, specifically demonstrated with Qwen3-VL. It introduces a new is_image_inputs method to the SupportsEncoderCudaGraph protocol and updates the EncoderCudaGraphConfig to support modality-specific input keys. The Qwen3-VL model implementation is refactored to handle both image and video inputs dynamically, with video CUDA graph capture being conditionally enabled based on whether Efficient Video Sampling (EVS) pruning is active. The CUDA graph manager is also updated to correctly identify and process inputs based on their modality. Comprehensive tests for video CUDA graph capture, replay, fallback, and packing are added to ensure correct functionality.

@github-project-automation github-project-automation Bot moved this to Backlog in Qwen3.5 Mar 25, 2026
@wangshangsam wangshangsam added performance Performance-related issues multi-modality Related to multi-modality (#4194) labels Mar 25, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 26, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @shen-shanshan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

shen-shanshan and others added 3 commits April 7, 2026 02:03
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
@mergify mergify Bot removed the needs-rebase label Apr 7, 2026
@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 7, 2026
@shen-shanshan
Copy link
Copy Markdown
Contributor Author

@ywang96 Could you help review this PR? Since it has been pending for a long time. 😃

@ywang96
Copy link
Copy Markdown
Member

ywang96 commented Apr 13, 2026

@ywang96 Could you help review this PR? Since it has been pending for a long time. 😃

@shen-shanshan Sorry for the late reply - I will take a look at this PR tonight! 🙏

Copy link
Copy Markdown
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the late review! I left some comments/questions!

Comment on lines -522 to +523
encoder_cudagraph_max_images_per_batch: int = 0
"""Maximum number of images per batch for encoder CUDA graph capture.
encoder_cudagraph_max_mm_items_per_batch: int = 0
"""Maximum number of images/videos per batch for encoder CUDA graph capture.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly concerned about this since the naming suggests that we basically include audio here as well.

How about encoder_cudagraph_max_vision_items_per_batch? Please also update this config name everywhere in the example/doc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. vision_items is clearer since we don't plan to support CG for audio encode currently.

Copy link
Copy Markdown
Member

@ywang96 ywang96 Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - let's update with encoder_cudagraph_max_vision_items_per_batch then!

Copy link
Copy Markdown
Contributor Author

@shen-shanshan shen-shanshan Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Let's wait for CI finish. 😃

Comment on lines +1623 to +1629
def get_input_modality(
self,
mm_kwargs: dict[str, Any],
) -> str:
if "image_grid_thw" in mm_kwargs:
return "image"
return "video"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean in order to use this feature with video input, users will have to turn off the image modality at launch time?

Comment on lines +706 to +709
max_frames_per_batch: If set, overrides max_batch_size for
cu_seqlens padding. For video inputs each item contributes
T attention sequences (frames); this sizes the buffer to
the total frame budget so video replays never overflow.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean if a video has more frames than this value, it falls back to eager?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If the smallest num of frames of one video doesn't fit in this budget, CG will not be enabled.

Signed-off-by: shen-shanshan <467638484@qq.com>
Comment thread docs/design/cuda_graphs_multimodal.md Outdated
Comment on lines 223 to 224
--tensor-parallel-size 4 --mm-encoder-tp-mode data \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [512, 1024, 1536, 2048, 2560, 3072, 3584, 4096, 4864], "encoder_cudagraph_max_images_per_batch": 8}'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final nit - update this flag too

Comment thread docs/design/cuda_graphs_multimodal.md Outdated
Comment on lines 198 to 199
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Apr 14, 2026
Signed-off-by: shen-shanshan <467638484@qq.com>
@DarkLight1337 DarkLight1337 merged commit 8011885 into vllm-project:main Apr 14, 2026
69 checks passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in Qwen3.5 Apr 14, 2026
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Apr 14, 2026
zxd1997066 pushed a commit to zxd1997066/vllm that referenced this pull request Apr 15, 2026
…ce (vllm-project#38061)

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026
…ce (vllm-project#38061)

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
…ce (vllm-project#38061)

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
…ce (vllm-project#38061)

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) nvidia performance Performance-related issues qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

8 participants