[Misc][ViT Cuda Graphs] Enable Piecewise CUDA Graphs for Qwen3-VL and Qwen2.5-VL ViT to Improve Performance#33232
Conversation
|
Documentation preview: https://vllm--33232.org.readthedocs.build/en/33232/ |
|
Hi @HirokenOvo, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant performance optimization by enabling piecewise CUDA graphs for the Vision Transformer (ViT) encoder in Qwen2.5-VL and Qwen3-VL models. The changes are comprehensive, spanning model definitions, configuration, data-parallel handling, and the CUDA graph dispatching mechanism. The implementation is well-tested and demonstrates clear performance benefits, especially for low-concurrency scenarios. My primary concern is the introduction of global state to manage graph compilation options, which could affect maintainability and introduce subtle bugs in more complex execution scenarios.
vllm/compilation/backends.py
Outdated
| # A global flag to indicate if the current graph being compiled | ||
| # is the last one in a sequence of graphs (e.g., a sequence of blocks). | ||
| # This is a workaround to control CUDAGraph weak_ref_output behavior | ||
| # in **vit** piecewise compilation. | ||
| _is_last_graph_in_vit_sequence: bool = True | ||
|
|
||
|
|
||
| @contextmanager | ||
| def set_is_last_graph_in_vit_sequence(is_last: bool) -> Iterator[None]: | ||
| """Context manager to indicate if the current graph being compiled | ||
| is the last one in a sequence of graphs (e.g., a sequence of blocks). | ||
| """ | ||
| global _is_last_graph_in_vit_sequence | ||
| original_value = _is_last_graph_in_vit_sequence | ||
| _is_last_graph_in_vit_sequence = is_last | ||
| try: | ||
| yield | ||
| finally: | ||
| _is_last_graph_in_vit_sequence = original_value | ||
|
|
||
|
|
||
| # A global flag to indicate if the current graph being compiled | ||
| # is the first one in a sequence of graphs (e.g., a sequence of blocks). | ||
| _is_first_graph_in_vit_sequence: bool = True | ||
|
|
||
|
|
||
| @contextmanager | ||
| def set_is_first_graph_in_vit_sequence(is_first: bool) -> Iterator[None]: | ||
| """Context manager to indicate if the current graph being compiled | ||
| is the first one in a sequence of graphs (e.g., a sequence of blocks). | ||
| """ | ||
| global _is_first_graph_in_vit_sequence | ||
| original_value = _is_first_graph_in_vit_sequence | ||
| _is_first_graph_in_vit_sequence = is_first | ||
| try: | ||
| yield | ||
| finally: | ||
| _is_first_graph_in_vit_sequence = original_value | ||
|
|
There was a problem hiding this comment.
The introduction of global flags _is_last_graph_in_vit_sequence and _is_first_graph_in_vit_sequence to control CUDA graph options is a significant design concern. While the use of context managers helps to scope their usage, relying on global state makes the code harder to reason about, debug, and is not thread-safe, which could lead to subtle bugs if compilations were to ever run concurrently.
Consider exploring alternatives to pass this state explicitly through the call stack, for instance, by extending the ForwardContext or another context object. This would make the data flow explicit and improve the overall maintainability and robustness of the compilation backend.
There was a problem hiding this comment.
I want to second this comment - I have had issues reasoning about and seen bugs introduced from these global variables from context managers. If we can explore another way of passing this information, I think it would be much better for the code
There was a problem hiding this comment.
I want to second this comment - I have had issues reasoning about and seen bugs introduced from these global variables from context managers. If we can explore another way of passing this information, I think it would be much better for the code
@Lucaskabela The original reason for using global variables was that ViT torch.compile is applied piecewise to individual sub-modules, rather than the top-level module. This means the compilation backend only sees one sub-module's graph at a time, not the full ViT graph, rendering the existing first/last graph detection logic (used for LLMs) incorrect.
In the latest commit, I have refactored this to pass the information via the forward context, eliminating the need for these global variables. cc @wangxingran222
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 6 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
|
Hey @HirokenOvo Thank you very much for contributing this PR! 🚀 This is definitely one of the ideas that we are aware of but didn't get time to experiment yet, so very much appreciated your effort here! I'm a bit busy recently so prob don't have time to review this in the next two days, but I'll definitely try to get to this PR this weekend! cc @ProExpertProg @DarkLight1337 in case you want to take a first pass. |
|
@HirokenOvo is the torch compile graph enabled by default? Because last time when we were testing the ViT torch compile feature before your PR on ROCm, there are performance regression in certain cases, like when the DP ViT feature is enabled. Can you also try running your feature with DP ViT feature enabled, can you check if DP ViT + torch compile is faster than DP ViT + without torch compile? |
@tjtanaa Sorry for the late reply. First question: No, this feature is not enabled by default. The Piecewise CUDA Graph implementation relies on Second question: As mentioned in the Limitations & Notes section, However, this PR specifically targets scenarios with fewer images or when images are distributed via ViT DP. In these cases, the computational load per rank is smaller, and the execution time is dominated by "bubbles" caused by kernel launch overhead rather than the operator execution itself. The performance gain from using CUDA Graphs to eliminate these bubbles outweighs the slight regression introduced by Regarding the root cause of In the future, we plan to develop a Full CUDA Graph based on FA3, which will not require |
vllm/compilation/backends.py
Outdated
| # A global flag to indicate if the current graph being compiled | ||
| # is the last one in a sequence of graphs (e.g., a sequence of blocks). | ||
| # This is a workaround to control CUDAGraph weak_ref_output behavior | ||
| # in **vit** piecewise compilation. | ||
| _is_last_graph_in_vit_sequence: bool = True | ||
|
|
||
|
|
||
| @contextmanager | ||
| def set_is_last_graph_in_vit_sequence(is_last: bool) -> Iterator[None]: | ||
| """Context manager to indicate if the current graph being compiled | ||
| is the last one in a sequence of graphs (e.g., a sequence of blocks). | ||
| """ | ||
| global _is_last_graph_in_vit_sequence | ||
| original_value = _is_last_graph_in_vit_sequence | ||
| _is_last_graph_in_vit_sequence = is_last | ||
| try: | ||
| yield | ||
| finally: | ||
| _is_last_graph_in_vit_sequence = original_value | ||
|
|
||
|
|
||
| # A global flag to indicate if the current graph being compiled | ||
| # is the first one in a sequence of graphs (e.g., a sequence of blocks). | ||
| _is_first_graph_in_vit_sequence: bool = True | ||
|
|
||
|
|
||
| @contextmanager | ||
| def set_is_first_graph_in_vit_sequence(is_first: bool) -> Iterator[None]: | ||
| """Context manager to indicate if the current graph being compiled | ||
| is the first one in a sequence of graphs (e.g., a sequence of blocks). | ||
| """ | ||
| global _is_first_graph_in_vit_sequence | ||
| original_value = _is_first_graph_in_vit_sequence | ||
| _is_first_graph_in_vit_sequence = is_first | ||
| try: | ||
| yield | ||
| finally: | ||
| _is_first_graph_in_vit_sequence = original_value | ||
|
|
There was a problem hiding this comment.
I want to second this comment - I have had issues reasoning about and seen bugs introduced from these global variables from context managers. If we can explore another way of passing this information, I think it would be much better for the code
vllm/config/vllm.py
Outdated
|
|
||
| 3. If no sizes are provided by the user, a default list of sizes is | ||
| generated up to a maximum of 5120. The default sizes are: | ||
| [512, 1024, 1536] + list(range(2048, 2048, 128)) + list( |
There was a problem hiding this comment.
Let's add a comment explaining why these are the default ranges we are using (i.e image sizes are usually one of these or something to that effect)
There was a problem hiding this comment.
Let's add a comment explaining why these are the default ranges we are using (i.e image sizes are usually one of these or something to that effect)
@Lucaskabela Actually, in our internal usage, we specify vit_cudagraph_capture_sizes based on specific workload requirements. For the general community default, we designed it with the following considerations:
- Patch Variance: Unlike the LLM part (which might use a step size of 8), image patch counts vary significantly.
- Small Inputs: For scenarios with very few images, the GEMM computation is too small to effectively hide kernel launch overheads. Therefore, padding to a larger starting size is acceptable.
- Step Size: Increasing the stride doesn't significantly add to the computation time for the ViT part. However, a larger stride reduces the number of graphs to be captured, which saves VRAM and reduces startup time.
We are not certain if this default fits all general community scenarios, so we are open to discussion if there are better configurations.
|
Hi @HirokenOvo from the torch.compile team - this is awesome work! We were intending to investigate this enablement in the coming few months, so it is great to see it here working today! In regards to torch.compile negatively impacting VIT operator performance, we are planning on looking into this very soon - I would not expect this to be a fundamental limitation of torch.compile, but perhaps some oversight in how one of our passes impacts multimodal encoders |
|
@cursor review |
Signed-off-by: Hongjian Zhang <hirokenovo@gmail.com>
Signed-off-by: Hongjian Zhang <hirokenovo@gmail.com>
Signed-off-by: Hongjian Zhang <hirokenovo@gmail.com>
Signed-off-by: Hongjian Zhang <hirokenovo@gmail.com>
Signed-off-by: Hongjian Zhang <hirokenovo@gmail.com> Signed-off-by: Hongjian Zhang <zhanghongjian@xiaohongshu.com>
… multimodal input handling Signed-off-by: Hongjian Zhang <hirokenovo@gmail.com>
Signed-off-by: Hongjian Zhang <hirokenovo@gmail.com>
b667b32 to
53814ec
Compare
@DarkLight1337 Thank you for the feedback. Here's the rationale for the current design: 1. Model Runner Refactoring I extracted 2. Why Model-Layer Modifications Are Necessary The model-layer changes (persistent buffers and
|
vllm/v1/cudagraph_dispatcher.py
Outdated
| has_lora: bool = False, | ||
| disable_full: bool = False, | ||
| num_active_loras: int = 0, | ||
| is_mm_encoder: bool = False, |
There was a problem hiding this comment.
Thanks for the contribution, the performance gains are impressive!
i agree with @DarkLight1337, the complexity it too high right now, in particular inside the model runner and CudagraphDispatcher, on that note
why not just instantiate another cudagraph_dispatcher in the gpu_model_runner
# Cudagraph dispatcher for runtime cudagraph dispatching.
self.cudagraph_dispatcher = CudagraphDispatcher(self.vllm_config)
if self.supports_mm_inputs:
self.mm_cudagraph_dispatcher = CudagraphDispatcher(self.vllm_config)
...
and initialize these with different keys. Then these flags are not required. We can initialize_cudagraph_keys to accept capture_sizes instead of fetching it directly from self.compilation_config.cudagraph_capture_sizes to assist with this. We do this for eagle (have a separate cudagraph_dispatcher instance)
Ideally we'd keep the CudagraphDispatcher as simple as possible.
There was a problem hiding this comment.
see:
vllm/vllm/v1/spec_decode/eagle.py
Lines 107 to 111 in 61e632a
vllm/vllm/v1/spec_decode/eagle.py
Lines 272 to 287 in 61e632a
There was a problem hiding this comment.
Thanks for your suggestion. Following your advice has indeed significantly reduced the modifications required in CudagraphDispatcher. Regarding gpu_model_runner, I have encapsulated the relevant logic into MMEncoderCudagraphManager to minimize changes to the runner itself.
Signed-off-by: Hongjian Zhang <hirokenovo@gmail.com>
Signed-off-by: Hongjian Zhang <hirokenovo@gmail.com>
vllm/v1/cudagraph_dispatcher.py
Outdated
| """ | ||
|
|
||
| def __init__(self, vllm_config: VllmConfig): | ||
| def __init__(self, vllm_config: VllmConfig, is_mm_encoder: bool = False): |
There was a problem hiding this comment.
thanks for cleaning this up! lets just pass max_capture_size and capture_sizes so we can avoid this and make more extensible
There was a problem hiding this comment.
(will be helpful if we have separate sizes for drafters too)
There was a problem hiding this comment.
we should pass it into initialize_cudagraph_keys instead of __init__ to so it happens after adjust_cudagraph_sizes_for_spec_decode
There was a problem hiding this comment.
Thanks for the suggestions! I have refactored CudagraphDispatcher as requested.
Signed-off-by: Hongjian Zhang <hirokenovo@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
| all_moe_layers: list[str] | None = None | ||
| moe_layer_index: int = 0 | ||
|
|
||
| # mm_encoder Multi-Modal Encoder flags used by backend compiler |
There was a problem hiding this comment.
Am I missing something? Where are these flags being read?
There was a problem hiding this comment.
Am I missing something? Where are these flags being read?
@DarkLight1337 Sorry about the confusing comment. These flags are actually consumed by the Dynamo post-processing logic in vllm/compilation/backends.py.
Since ViT uses torch.compile on submodules (unlike the top-level approach in LLMs), we end up with multiple Dynamo graphs. The backend incorrectly treats each submodule as a standalone sequence, leading to wrong gc_disable and weak_ref_output settings. This causes intermediate tensors to be prematurely garbage collected during capture.
To fix this, we introduced these flags and manually call set_is_first_graph_in_mm_encoder_sequence and set_is_last_graph_in_mm_encoder_sequence inside the ViT forward method. This explicitly hints the correct global sequence boundaries to the backend wrapper.
|
@HirokenOvo @Lucaskabela @DarkLight1337 @LucasWilkinson I agree with @DarkLight1337 that the PR introduces a lot of complexity. My concerns are as follows: Concern 1:
As mentioned above, this feature is only restricted to Concern 2: Lack of extensibility
We have to reimplement the same complex logic to all other models. The management of buffers in the model definition file will bloat the code. I am not very familiar with torch compile, can we do the buffer management elsewhere like at Layer class rather than in model definition file? |
|
@HirokenOvo @Lucaskabela When you are planning and implementing your solutions, does you think the framework is restrictive? iirc that there have been quite a number of attempts to add torch compile feature to the ViT/Multimodal part, and many introduces large complexity and limited usage. It makes me think if we will need to redesign the framework for ViT/Multimodality so that torch compile can be supported natively without largely modifying the model definition file and manually manage the persistent buffers. And given that it also requires changes to the model runner, maybe we can get more of your thoughts in the model runner V2, trying to make multimodality torch compile and cudagraph compatible. What is everyone's thoughts on this? CC @ProExpertProg to the discussion as he is working on the torch compile features e.g. fusion passes. Maybe we can have a new perspective on this. |
|
@tjtanaa During our development process, we indeed encountered several framework constraints, primarily in two areas: Constraint 1: ViT DP Load Balancing at the Model Layer Constraint 2: To correctly chain independently compiled graph segments generated by |
|
To avoid torch.compile we could maybe try going straight to FULL-cudagraphs? or is there something preventing that? |
@LucasWilkinson There are some CPU operations within the forward pass that cannot be captured by CUDA Graphs. vllm/vllm/model_executor/models/qwen3_vl.py Lines 551 to 559 in 5a5c435 This prevents us from wrapping the entire ViT in a single full CUDA graph, unlike the LLM part. We still need to wrap it at the sub-modules. Therefore, we still need to manually persist memory addresses before calling each sub-module in the forward pass. While using FULL mode would strictly eliminate kernel launch overheads for performance, it does not avoid the code complexity associated with this piecewise capturing and manual buffer management.
|
For torch.compile integration on it's own, I think it is quite unrestrictive which perhaps leads to the confusion - we can apply it almost anywhere with minimal complexity (there are some support structures like That said, I do think many of the current implementations for ViT models are not written with upmost efficiency in mind as there are some number of ops that have this CPU sync and/or buffer issues. Additionally, having a clear and consistent integration point (like |
For the MLPerf v6.0 Qwen3-VL submission from NVIDIA, our team actually developed full cudagraph (and piece-wise graph and torch.compile too, though slightly differently from this PR). We are happy to upstream this feature. cc @b-mu @maxyanghu who developed it, and @ywang96 who's helping to upstream some other coming-from-MLPerf features currently. |
Purpose
Based on the torch.compile mechanism for generic nn.Module from PR #23207 and PR #27741, this PR implements piecewise CUDA graph support for the Vision Transformer (ViT) encoder in Qwen2.5-VL and Qwen3-VL models. The primary goal is to eliminate kernel launch bubbles for operators other than attention, thereby reducing overhead and improving inference performance.
This optimization is particularly effective for scenarios with low concurrency where the number of images is insufficient to fully saturate the ViT's computational capacity. By reducing kernel launch overhead, it helps to lower the TTFT.
Key Features
Qwen2_5_VisionTransformerandQwen3_VisionTransformerto be graph-friendly, including handling of persistent buffers for hidden states and RoPE.Usage
Enable ViT CUDA Graph
To enable this feature, you need to set
compile_mm_encodertoTruevia the--compilation-configargument.--compilation-config '{"compile_mm_encoder": true}'Configure Capture Sizes
You can specify the image patch counts for which to capture ViT CUDA graphs using
mm_encoder_cudagraph_capture_sizes.--compilation-config '{"compile_mm_encoder": true, "mm_encoder_cudagraph_capture_sizes": [512, 1024]}'If not specified, vLLM will automatically select a set of sizes based on the model's encoder budget.
Alternatively, you can specify
max_mm_encoder_cudagraph_capture_sizeto generate a default list of capture sizes up to the given value:--compilation-config '{"compile_mm_encoder": true, "max_mm_encoder_cudagraph_capture_size": 2048}'Limitations & Notes
torch.compileIssues:torch.compileconsumes a significant amount of GPU memory for compiling the ViT layer and may introduce negative optimization, leading to a slight increase in latency for individual operators. However, this issue is orthogonal to this PR, which focuses on enabling CUDA graph capture to reduce launch overhead.Performance
The performance was benchmarked under the following configuration:
(20 images / 4 DP ranks) * 128 tokens/image * 4 merge_size = 2560.without ViT cudagraph:

with ViT cudagraph:


The remaining bubbles are due to the GEMM being too small to fully hide the kernel launch overhead of the attention operator. Future work could involve implementing a full CUDA graph with FlashAttention-3 for the ViT to completely eliminate this overhead.
Test Plan
Run the following test to verify the consistency between ViT CUDA graph execution and eager mode:
Test Result
6 passed, confirming the correctness of the piecewise CUDA graph implementation.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.