[Feature] ViT Full CUDA Graph#35963
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces full CUDA graph support for the Vision Transformer (ViT) encoder, aiming to reduce kernel launch overhead and improve performance, particularly in multi-GPU scenarios. The implementation is well-structured, featuring a budget-based graph capture system with greedy bin-packing for efficient batching of images. Key additions include new configuration flags for enabling and tuning the feature, data-parallel sharding utilities for multi-GPU vision processing, and a new EncoderCudaGraphManager that encapsulates the CUDA graph lifecycle management. The integration into the existing model runner is clean, and the necessary modifications to the Qwen3-VL model are minimal and well-justified. Furthermore, a comprehensive new test suite has been added to validate the functionality of the encoder CUDA graph manager, covering various scenarios including capture, replay, fallbacks, and data parallelism. Overall, this is a high-quality contribution that brings a significant performance enhancement.
Note: Security Review did not run due to the size of the PR.
|
Hi @b-mu, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @b-mu, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Isotr0py
left a comment
There was a problem hiding this comment.
With a first glance, my concern is the generality of current encoder CG manager. I feel current implementation is too qwen3vl-specific (MRoPE + ViT RoPE), and it's difficult to boardcast this CG support to other ViTs.
| and modality == "image" | ||
| and "pixel_values" in mm_kwargs_group | ||
| and "image_grid_thw" in mm_kwargs_group |
There was a problem hiding this comment.
Hmmm, I feel this is too model-specific, and it 's difficult to use for other models with different mm_kwargs naming.
Can we execute encoder_cudagraph_manage with mm_kwargs_group directly?
There was a problem hiding this comment.
To address the general concern about encoder cudagraph manager being too model specific, I'm thinking about having a class SupportsEncoderCudaGraph(Protocol) in vllm/model_executor/models/interfaces.py. To use encoder cudagraph support for a model, the model needs to implement a list of methods to tell the encoder cudagraph manager how to extract inputs, sequence metadata, etc. The encoder cudagraph manager would be model agnostic. The specific concern here in gpu_model_runner.py can use mm_kwargs_group directly. What do you think?
There was a problem hiding this comment.
We introduced SupportsEncoderCudaGraph protocol in interfaces.py with 9 protocol methods. The manager is now model-agnostic — all Qwen3-VL-specific logic lives in qwen3_vl.py implementing the protocol.
The specific concern here in gpu_model_runner.py now uses mm_kwargs_batch directly and uses self.encoder_cudagraph_manager.supports_modality(modality) instead of checking for specific keys.
| # Generate dummy grid config for capture only | ||
| # (not used for runtime batching). This is just one arbitrary | ||
| # example configuration that produces token_budget tokens. | ||
| # At runtime, actual images will be packed in any | ||
| # combination that fits the budget. | ||
| dummy_grid_config = self._generate_grid_config_for_budget( | ||
| token_budget, self.max_batch_size | ||
| ) | ||
|
|
||
| dummy_pixel_values, dummy_grid_thw = self._prepare_dummy_inputs( | ||
| dummy_grid_config | ||
| ) |
There was a problem hiding this comment.
I feel this is something belongs to DummyInputsBuilder, otherwise the CG manager's dummy data creation could be quite complicated if we want to support other models.
There was a problem hiding this comment.
Dummy input generation has been moved out of the manager and into the protocol method prepare_encoder_cudagraph_capture_inputs(). Each model implements its own dummy input logic (e.g., in qwen3_vl.py).
|
This pull request has merge conflicts that must be resolved before it can be |
52c4ab7 to
9521d31
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
@b-mu @Isotr0py can we have follow up benchmarking
Past attempts suggest that CUDAGraph might not be beneficial for all workloads
|
Add documentation for the encoder CUDA graph feature (PR vllm-project#35963), covering budget-based capture/replay, greedy bin-packing, data-parallel support, SupportsEncoderCudaGraph protocol, configuration, and usage. Signed-off-by: Baorun Mu <bmu@nvidia.com>
I believe those are in the "Test Result" section of this PR description (i.e., single-GPU is no DP and the 4-GPU is DP+TP)? If what you are looking for is 4-GPU pure TP, I don't see how the (impact of) allreduces are relevant to this feature.
I think most optimization techniques (despite what marketing or academic papers could claim) can only provide perf boost in some senarios, and you always have to make trade-offs. Sweeping this feature across other workloads is beyond the scope of this PR and we have other priorities at the moment, but you are very welcome to give it a try at the workloads you care about, and let us know if something breaks and we will fix it. |
Signed-off-by: Baorun Mu <bmu@nvidia.com>
Signed-off-by: Baorun Mu <bmu@nvidia.com>
Signed-off-by: Baorun Mu <bmu@nvidia.com>
Thanks for the feedback @wangshangsam . My intention is to understand if we should always turn this feature on by default. Because in the past PRs that attempt to enable Cuda Graph to ViT will share that they have specific usecases that speed up and which cases will not. What's your thoughts on this feature that you and your team have integrated? Did you guys manage to test it on variation of workloads? It is fine even if there is no conclusion since this feature is still under experimental feature. :) I will also try to benchmark this feature when I have time. |
Signed-off-by: Baorun Mu <bmu@nvidia.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
Signed-off-by: Baorun Mu <bmu@nvidia.com>
Signed-off-by: Baorun Mu <bmu@nvidia.com>
Signed-off-by: Baorun Mu <bmu@nvidia.com>
Purpose
Add full CUDA graph for the ViT to reduce kernel launch overheads.
Features:
[256, 512, 1024, 2048, 4096]).during replays.
mm_encoder_tp_mode=data, each TP rank runs the ViT independently via data parallelism.SupportsEncoderCudaGraphprotocol ininterfaces.py— models opt in by implementing 9 protocol methods for inputhandling, metadata computation, and forward dispatch.
EncoderCudaGraphManageris fully model-agnostic; all model-specific logic (grid config, dummy inputs, embeddingcomputation) lives in the model class.
New config flags (via
--compilation-config):cudagraph_mm_encoder: true— enable encoder CUDA graphencoder_cudagraph_token_budgets: [...]— list of token budget sizes to captureencoder_cudagraph_max_images_per_batch: N— max images per graph replayFiles changed:
vllm/config/compilation.py— new config flagsvllm/model_executor/models/interfaces.py—SupportsEncoderCudaGraphprotocol andsupports_encoder_cudagraph()type guard
vllm/model_executor/models/qwen3_vl.py— implementSupportsEncoderCudaGraphonQwen3VLForConditionalGenerationvllm/v1/worker/gpu/mm/encoder_cudagraph_defs.py—EncoderCudaGraphConfig,EncoderCudaGraphCaptureInputs,EncoderCudaGraphReplayBuffersdataclassesvllm/v1/worker/gpu/mm/encoder_cudagraph.py—EncoderCudaGraphManager(capture, replay, packing, DP)vllm/v1/worker/gpu_model_runner.py— integration into V1 model runnertests/v1/cudagraph/test_encoder_cudagraph.py— unit and GPU testscc @maxyanghu @wangshangsam @Anerudhan
Test Plan
Unit Tests:
End-to-End Tests:
Test Result
Single GPU (Qwen3-VL-30B, 1×GB200, VisionArena-Chat, 3000 prompts):
Multi GPU (Qwen3-VL-32B, 4×GB200 TP=4 DP=4, random-mm 20img/req, 1000 prompts):
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.