[MM][Perf][CG] Support ViT full CUDA graph for Qwen3-VL video inference#38061
[MM][Perf][CG] Support ViT full CUDA graph for Qwen3-VL video inference#38061DarkLight1337 merged 17 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request extends CUDA graph support to video inputs for multimodal models, specifically demonstrated with Qwen3-VL. It introduces a new is_image_inputs method to the SupportsEncoderCudaGraph protocol and updates the EncoderCudaGraphConfig to support modality-specific input keys. The Qwen3-VL model implementation is refactored to handle both image and video inputs dynamically, with video CUDA graph capture being conditionally enabled based on whether Efficient Video Sampling (EVS) pruning is active. The CUDA graph manager is also updated to correctly identify and process inputs based on their modality. Comprehensive tests for video CUDA graph capture, replay, fallback, and packing are added to ensure correct functionality.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
|
@ywang96 Could you help review this PR? Since it has been pending for a long time. 😃 |
@shen-shanshan Sorry for the late reply - I will take a look at this PR tonight! 🙏 |
ywang96
left a comment
There was a problem hiding this comment.
Apologies for the late review! I left some comments/questions!
| encoder_cudagraph_max_images_per_batch: int = 0 | ||
| """Maximum number of images per batch for encoder CUDA graph capture. | ||
| encoder_cudagraph_max_mm_items_per_batch: int = 0 | ||
| """Maximum number of images/videos per batch for encoder CUDA graph capture. |
There was a problem hiding this comment.
I'm slightly concerned about this since the naming suggests that we basically include audio here as well.
How about encoder_cudagraph_max_vision_items_per_batch? Please also update this config name everywhere in the example/doc.
There was a problem hiding this comment.
I agree with this. vision_items is clearer since we don't plan to support CG for audio encode currently.
There was a problem hiding this comment.
Sounds good - let's update with encoder_cudagraph_max_vision_items_per_batch then!
There was a problem hiding this comment.
Updated. Let's wait for CI finish. 😃
| def get_input_modality( | ||
| self, | ||
| mm_kwargs: dict[str, Any], | ||
| ) -> str: | ||
| if "image_grid_thw" in mm_kwargs: | ||
| return "image" | ||
| return "video" |
There was a problem hiding this comment.
Does this mean in order to use this feature with video input, users will have to turn off the image modality at launch time?
| max_frames_per_batch: If set, overrides max_batch_size for | ||
| cu_seqlens padding. For video inputs each item contributes | ||
| T attention sequences (frames); this sizes the buffer to | ||
| the total frame budget so video replays never overflow. |
There was a problem hiding this comment.
Does this mean if a video has more frames than this value, it falls back to eager?
There was a problem hiding this comment.
Yes. If the smallest num of frames of one video doesn't fit in this budget, CG will not be enabled.
Signed-off-by: shen-shanshan <467638484@qq.com>
| --tensor-parallel-size 4 --mm-encoder-tp-mode data \ | ||
| --compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [512, 1024, 1536, 2048, 2560, 3072, 3584, 4096, 4864], "encoder_cudagraph_max_images_per_batch": 8}' |
There was a problem hiding this comment.
One final nit - update this flag too
Signed-off-by: shen-shanshan <467638484@qq.com>
…ce (vllm-project#38061) Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: zengxian <xiangdong.zeng@intel.com>
…ce (vllm-project#38061) Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io>
…ce (vllm-project#38061) Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
…ce (vllm-project#38061) Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io>
Purpose
Following #35963 (only supports image inference), this PR continues to work on it to support video inference for Qwen3-VL.
TODO:
🤖 AI Summary
Following #35963 (ViT full CUDA graph support for image inference), this PR extends the encoder CUDA graph framework to support video inference for Qwen3-VL. Previously, the CUDA graph capture/replay path only handled image inputs (
pixel_values+image_grid_thw). Video inputs use different keys (pixel_values_videos+video_grid_thw) and require largercu_seqlensbuffers because each video item contributes multiple frames (T attention sequences). This PR generalizes the protocol and manager to handle both modalities through a single shared graph manager.Note: Video CUDA graphs are automatically disabled when EVS (Efficient Video Sampling) pruning is enabled, since EVS makes the token count data-dependent and incompatible with CUDA graph capture.
Key Changes:
EncoderCudaGraphConfig(vllm/v1/worker/encoder_cudagraph_defs.py): Replace singleinput_keyfield withinput_key_by_modalitydict (e.g.,{"image": "pixel_values", "video": "pixel_values_videos"}) to support per-modality input tensor routing.SupportsEncoderCudaGraphprotocol (vllm/model_executor/models/interfaces.py): Addget_input_modality(mm_kwargs)method to determine whether inputs are image or video. Addmax_frames_per_batchparameter toprepare_encoder_cudagraph_capture_inputs()andprepare_encoder_cudagraph_replay_buffers().Qwen3VLForConditionalGeneration(vllm/model_executor/models/qwen3_vl.py):get_input_modality()to route based onmm_kwargskeys._get_pixel_values_by_modality()and_get_grid_thw_by_modality()helpers to abstract modality-specific key access across all protocol methods.prepare_encoder_cudagraph_capture_inputs()to build video-format grid configs (T>1 per item) whenmax_frames_per_batchexceedsmax_batch_size, sizingcu_seqlensbuffer for video replays._replay_buffer_cache) keyed by(modality, grid_thw)to avoid redundant CPU-side NumPy computation for repeated grid shapes.prepare_encoder_metadata()to acceptmax_frames_per_batchforcu_seqlenspadding, allowing video frames to exceedmax_batch_size.EncoderCudaGraphManager(vllm/v1/worker/encoder_cudagraph.py):max_frames_per_batchfield toBudgetGraphMetadataand manager initialization.encoder_cudagraph_max_images_per_batch→encoder_cudagraph_max_mm_items_per_batchfor generality.input_keylookup throughget_input_modality()during replay instead of using a fixed key.CompilationConfig(vllm/config/compilation.py): Addencoder_cudagraph_max_frames_per_batchconfig option (0 = auto-infer). Renameencoder_cudagraph_max_images_per_batch→encoder_cudagraph_max_mm_items_per_batch.tests/v1/cudagraph/test_encoder_cudagraph.py): AddSimpleMockViTVideoModelwith dual-modality support,TestGetInputModality(no GPU), andTestEncoderCudaGraphVideoReplay(GPU) covering video capture/replay, fallback, counters, chunking, and mixed image+video through a shared manager. (+316 lines)Test Plan
Unit test:
Functional test:
Benchmark:
Test Result
✅ Unit test:
✅ Functional test:
✅ Benchmark:
Single GPU (Qwen3-VL-8B-Instruct, 1xA100, random-mm, 100 prompts):
Multi GPU (Qwen3-VL-32B-Instruct, 4xA100, random-mm, 100 prompts):
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)