Skip to content

[MM][Perf][CG] Enable encoder CUDA Graph for MiniCPM-V#41996

Draft
YunzhuLu wants to merge 2 commits intovllm-project:mainfrom
YunzhuLu:vit-cuda-graph-minicpmv-2.6-4.0-4.5
Draft

[MM][Perf][CG] Enable encoder CUDA Graph for MiniCPM-V#41996
YunzhuLu wants to merge 2 commits intovllm-project:mainfrom
YunzhuLu:vit-cuda-graph-minicpmv-2.6-4.0-4.5

Conversation

@YunzhuLu
Copy link
Copy Markdown
Contributor

@YunzhuLu YunzhuLu commented May 7, 2026

Purpose

Add encoder CUDA Graph support for MiniCPM-V 2.5, 2.6, 4.0, and 4.5 as part of tracker #38175. This implementation follows the existing workflow introduced in #38061.

The captured graph covers both the ViT encoder (VPM) and the resampler, with version-specific handling for:

  • MiniCPM-V 2.5, where the VPM does not accept tgt_sizes
  • MiniCPM-V 4.5, which passes temporal_ids to the resampler

MiniCPM-V 2.0 is not included, as it predates the slice-based vision architecture required by this implementation.

Key Updates & Fixes

This PR also refactors Resampler to be fully compatible with CUDA Graph capture and fixes a silent pipeline bug in eager mode

Tensorized Resampler Forward Pass

  • Refactored Resampler2_5 and Resampler4_5 to eliminate host-side logic (for loops, .tolist(), and .item()).

  • Replaced dynamic python-slicing with pure tensor operations (h_idx, w_idx from a flat seq_idx). Handled out-of-bounds safety using torch.clamp and a dynamic key_padding_mask via torch.where.

Fixed temporal_ids Incompatibility for v4.5

  • Added a dispatch in Resampler4_5.forward to accept a flat 1D torch.Tensor during graph replay, safely bypassing the Python-level cross-frame merge loop.

  • Passed temporal_ids correctly through encoder_cudagraph_forward.

Fixed Eager-Mode Video Input Bug

  • Registered temporal_ids as a video batched field in _minicpmv_field_config.

  • Defined temporal_ids safely with Annotated in MiniCPMVImagePixelInputs to pass TensorSchema validation.

  • Removed premature flatten_2d_lists in get_vision_hidden_states to preserve the list[list[int]] structure required by the resampler's eager path.

Test Plan

Unit test

pytest tests/v1/cudagraph/test_encoder_cudagraph.py -v

Functional test

GPU: RTX 5090
Model: MiniCPM-V-4_5

No CUDA Graph

vllm serve /root/autodl-tmp/huggingface/hub/MiniCPM-V-4_5/OpenBMB/MiniCPM-V-4_5 \
  --trust-remote-code \
  --served-model-name MiniCPM-V-4_5 \
  --gpu-memory-utilization 0.75 \
  --max-model-len 4096 \
  --max-num-batched-tokens 4096 \
  --limit-mm-per-prompt '{"video": 1, "image": 1}'

With CUDA Graph

 vllm serve /root/autodl-tmp/huggingface/hub/MiniCPM-V-4_5/OpenBMB/MiniCPM-V-4_5 \
  --trust-remote-code \
  --served-model-name MiniCPM-V-4_5 \
  --gpu-memory-utilization 0.75 \
  --max-model-len 4096 \
  --max-num-batched-tokens 4096 \
  --limit-mm-per-prompt '{"video": 1, "image": 1}' \
  --compilation-config '{
    "cudagraph_mm_encoder": true,
    "encoder_cudagraph_token_budgets": [1024],
    "encoder_cudagraph_max_vision_items_per_batch": 32
  }'

benchmark

GPU: RTX 5090
Model: MiniCPM-V-2_6 / MiniCPM-V-4_5

vllm bench mm-processor \
  --model /root/autodl-tmp/huggingface/hub/MiniCPM-V-4_5/OpenBMB/MiniCPM-V-4_5 \
  --trust-remote-code \
  # --tokenizer-mode slow \
  --dataset-name random-mm \
  --num-prompts 100 \
  --num-warmups 10 \
  --max-model-len 4096 \
  --seed 42 \
  --random-mm-base-items-per-request 1 \
  --random-mm-num-mm-items-range-ratio 0.0 \
  --random-mm-bucket-config '{"(448, 448, 3)": 1.0}' \
  --compilation-config '{
    "cudagraph_mm_encoder": true, 
    "encoder_cudagraph_token_budgets": [512, 1024], 
    "encoder_cudagraph_max_vision_items_per_batch": 4
  }'

E2E benchmark

GPU: RTX 5090
Model: MiniCPM-V-2_6 / MiniCPM-V-4_5

vllm serve /root/autodl-tmp/huggingface/hub/MiniCPM-V-4_5/OpenBMB/MiniCPM-V-4_5 \
  --trust-remote-code \
  --served-model-name MiniCPM-V-4_5 \
  --gpu-memory-utilization 0.75 \
  --max-model-len 4096 \
  --max-num-batched-tokens 4096\
  --limit-mm-per-prompt '{"image": 1}' \
  --compilation-config '{
    "cudagraph_mm_encoder": true,
    "encoder_cudagraph_token_budgets": [512, 1024],
    "encoder_cudagraph_max_vision_items_per_batch": 8
  }'

benchmark

vllm bench serve \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --backend openai-chat \
  --dataset-name random-mm \
  --input-len 128 \
  --output-len 4 \
  --random-mm-base-items-per-request 1 \
  --random-mm-bucket-config '{"(448,448,3)": 1.0}' \
  --num-prompts 100 \
  --num-warmups 50 \
  --request-rate 1

Test Result

Unit test

36 passed, 16 warnings in 7.11s 

Functional test
Image

# No CUDA Graph
--------------------------------------------------
This image shows a woman standing in a room. She is wearing a green t-shirt and denim shorts. The room has light blue walls, a window, a table with chairs, and a television screen displaying images of aircraft carriers and text in Chinese. The time on the TV screen reads "下午20:54" (7:54 PM).
--------------------------------------------------
This image shows a woman standing in a room. She is wearing a green t-shirt and denim shorts. The room has light blue walls, a window, a table with chairs, and a television screen displaying images of aircraft carriers and text in Chinese. The time on the TV screen reads "下午20:54" (7:54 PM).
--------------------------------------------------
This image shows a woman standing in a room. She is wearing a green t-shirt and denim shorts. The room has light blue walls, a window, a table with chairs, and a television mounted on the wall. The TV screen displays images of aircraft carriers and text in Chinese, which includes the time "下午20:54" (7:54 PM) and the phrase "中国舰八月见" (Chinese ships in August).
--------------------------------------------------
# With CUDA Graph 
--------------------------------------------------
This image shows a woman standing in a room. She is wearing a green t-shirt and denim shorts. The room has light blue walls, a window, and a television screen displaying images of aircraft carriers and text in Chinese. There are also some chairs and a table in the background.
--------------------------------------------------
The image shows a woman standing in a room. She is wearing a green t-shirt and denim shorts. Behind her, there is a television screen displaying images of aircraft carriers and text in Chinese, which includes the time "下午20:54" (20:54 PM) and the phrase "中国舰八月见" (Chinese ships in August). The room has a light blue wall, a window, and some chairs and tables.
--------------------------------------------------
The image shows a woman standing in a room. She is wearing a green t-shirt and denim shorts. Behind her, there is a television screen displaying images of aircraft carriers and text in Chinese, which includes the time "下午20:54" (20:54 PM) and the phrase "中国舰八月见" (Chinese ships in August). The room has a light blue wall, a window, and some chairs and tables.
--------------------------------------------------

Video

# No CUDA Graph
--------------------------------------------------
The video shows a bridge over a river, with a train passing through on the bridge. The sky is cloudy, and the train is lit up with yellow lights. The train moves from left to right, and the camera follows its movement. The scene is calm and peaceful, with the train being the only source of movement.
--------------------------------------------------
The video shows a bridge over a river at night. A train is seen moving from the left side of the screen to the right. The train is lit up with yellow lights, and its reflection can be seen in the water below. The sky is dark with some clouds visible. The train continues to move across the bridge, and the scene remains unchanged.
--------------------------------------------------
The video shows a bridge with a train passing over it. The train is moving from the left to the right side of the frame. The sky is dark and cloudy, and the water below the bridge is calm. The train is lit up with yellow lights, and the bridge is also lit up with yellow lights. The train is moving at a steady pace, and the camera is stationary. The video is taken at night, and the only source of light is the train and the bridge. The train is the main focus of the video, and it is the only moving object in the scene. The video is peaceful and serene, with no sound or music. The train is the only thing that can be heard, and it is the sound of the train moving over the bridge. The video is a beautiful representation of a train passing over a bridge at night, with the yellow lights of the train and the bridge creating a stunning contrast against the dark sky and calm water.
--------------------------------------------------
# With CUDA Graph 
--------------------------------------------------
The video shows a train crossing a bridge over a river. The train is lit up with yellow lights and moves from left to right across the screen. The sky is dark and cloudy, with a hint of light visible in the background. The train continues moving along the bridge until it disappears from the frame.
--------------------------------------------------
The video shows a train crossing a bridge over a river. The train is moving from the left side of the frame to the right, and the bridge is illuminated by yellow lights. The sky is cloudy, and the water in the river is calm. The train continues to move across the bridge until it disappears from the right side of the frame.
--------------------------------------------------
The video shows a train moving from the right side of the screen to the left, passing over a bridge. The train is yellow and has a red tail light. The sky is dark and cloudy, and the water below the bridge is dark and murky. The train is the only object in motion in the video.
--------------------------------------------------

Benchmark:

MiniCPM-V-2_6 + image
No CUDA Graph:

With CUDA Graph:

MiniCPM-V-4_5 + image
No CUDA Graph:

================================================================================
Multimodal Processor Benchmark Results
================================================================================

MM Processor Metrics:
                     Stage  Mean Median  Std P99.0
          get_mm_hashes_ms  0.50   0.47 0.14  1.09
get_cache_missing_items_ms  0.03   0.03 0.01  0.05
     apply_hf_processor_ms 14.32  13.61 3.08 29.89
        merge_mm_kwargs_ms  0.09   0.07 0.08  0.36
   apply_prompt_updates_ms  5.54   5.81 1.40  9.62
     preprocessor_total_ms 20.47  19.75 4.26 40.36
        encoder_forward_ms 24.50  24.43 1.09 27.07
         num_encoder_calls  1.00   1.00 0.00  1.00

Summary: 100 total encoder calls across 100 requests.

End-to-End Latency (ms):
Metric Value (ms)
  Mean   10549.41
Median    9868.40
   Std    2881.78
 P99.0   14650.85

With CUDA Graph:

================================================================================
Multimodal Processor Benchmark Results
================================================================================

MM Processor Metrics:
                     Stage   Mean Median    Std  P99.0
          get_mm_hashes_ms   0.51   0.47   0.17   1.32
get_cache_missing_items_ms   0.03   0.03   0.01   0.06
     apply_hf_processor_ms  13.94  13.02   3.45  29.19
        merge_mm_kwargs_ms   0.08   0.07   0.03   0.22
   apply_prompt_updates_ms   5.44   5.65   1.36  10.45
     preprocessor_total_ms  19.99  19.12   4.54  38.72
        encoder_forward_ms 469.24 394.19 149.10 690.33
         num_encoder_calls   1.00   1.00   0.00   1.00

Summary: 100 total encoder calls across 100 requests.

End-to-End Latency (ms):
Metric Value (ms)
  Mean   41424.83
Median   41809.09
   Std   14973.20
 P99.0   59107.62

MiniCPM-V-4_5 + image
No CUDA Graph:

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Request rate configured (RPS):           1.00      
Benchmark duration (s):                  100.24    
Total input tokens:                      33500     
Total generated tokens:                  400       
Request throughput (req/s):              1.00      
Output token throughput (tok/s):         3.99      
Peak output token throughput (tok/s):    20.00     
Peak concurrent requests:                4.00      
Total token throughput (tok/s):          338.18    
---------------Time to First Token----------------
Mean TTFT (ms):                          150.51    
Median TTFT (ms):                        150.65    
P99 TTFT (ms):                           212.62    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.55     
Median TPOT (ms):                        9.83      
P99 TPOT (ms):                           25.29     
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.97      
Median ITL (ms):                         9.68      
P99 ITL (ms):                            38.58     
==================================================

With CUDA Graph:

============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Request rate configured (RPS):           1.00      
Benchmark duration (s):                  501.48    
Total input tokens:                      167499    
Total generated tokens:                  2000      
Request throughput (req/s):              1.00      
Output token throughput (tok/s):         3.99      
Peak output token throughput (tok/s):    43.00     
Peak concurrent requests:                12.00     
Total token throughput (tok/s):          338.00    
---------------Time to First Token----------------
Mean TTFT (ms):                          1165.54   
Median TTFT (ms):                        1008.79   
P99 TTFT (ms):                           2903.78   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          192.60    
Median TPOT (ms):                        246.59    
P99 TPOT (ms):                           728.90    
---------------Inter-token Latency----------------
Mean ITL (ms):                           144.73    
Median ITL (ms):                         10.38     
P99 ITL (ms):                            1448.34   
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: YunzhuLu <lucia.yunzhu@gmail.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 7, 2026

Documentation preview: https://vllm--41996.org.readthedocs.build/en/41996/

@mergify mergify Bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) nvidia labels May 7, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Encoder CUDA Graph support for MiniCPM-V models (versions 2.5, 2.6, 4.0, and 4.5) by implementing a new mixin class, _MiniCPMVEncoderCudaGraphMixin, and updating the corresponding model classes. It also includes necessary test configurations and documentation updates. The review highlighted several critical issues that need to be addressed: the resampler's forward method is currently incompatible with CUDA graph capture due to host-side logic, and the handling of temporal_ids requires refinement, specifically by using -1 instead of 0 for dummy or missing values and ensuring compatibility with tensor inputs.


if self.version == (4, 5):
temporal_ids = buffers[_MINICPMV_CUDAGRAPH_BUF_KEY_TEMPORAL_IDS]
resampler_out = self.resampler(vision_embedding, tgt_sizes, temporal_ids)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Resampler4_5.forward expects temporal_ids to be a nested list (it uses chain.from_iterable and len() on elements), but the CUDA graph path passes a 1D torch.Tensor from the replay buffers. This will cause a TypeError during graph capture or replay. The resampler's forward method must be updated to handle tensor inputs for temporal_ids to support CUDA Graphs.

Comment thread vllm/model_executor/models/minicpmv.py Outdated
Comment on lines +1489 to +1491
buffers[_MINICPMV_CUDAGRAPH_BUF_KEY_TEMPORAL_IDS] = torch.zeros(
max_num_slices, dtype=torch.long, device=device
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The temporal_ids buffer should be initialized with -1 instead of 0 for dummy inputs. In Resampler4_5.forward, a value of -1 indicates that no temporal embedding should be applied, whereas 0 will cause the model to apply the temporal embedding at index 0, which is incorrect for non-video or dummy inputs.

Suggested change
buffers[_MINICPMV_CUDAGRAPH_BUF_KEY_TEMPORAL_IDS] = torch.zeros(
max_num_slices, dtype=torch.long, device=device
)
buffers[_MINICPMV_CUDAGRAPH_BUF_KEY_TEMPORAL_IDS] = torch.full(
(max_num_slices,), -1, dtype=torch.long, device=device
)

Comment thread vllm/model_executor/models/minicpmv.py Outdated
flatten_2d_lists(temporal_ids), dtype=torch.long, device=device
)
else:
flat_ids = torch.zeros(len(tgt_sizes), dtype=torch.long, device=device)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

flat_ids should be initialized with -1 instead of 0 when temporal_ids is missing. This ensures that no temporal embedding is applied by the resampler for these items, matching the logic in the eager path.

Suggested change
flat_ids = torch.zeros(len(tgt_sizes), dtype=torch.long, device=device)
flat_ids = torch.full(
(len(tgt_sizes),), -1, dtype=torch.long, device=device
)

Comment on lines +1568 to +1570
resampler_out = self.resampler(vision_embedding, tgt_sizes, temporal_ids)
else:
resampler_out = self.resampler(vision_embedding, tgt_sizes)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The resampler (both Resampler2_5 and Resampler4_5) is currently incompatible with CUDA Graph capture because its forward method contains host-side logic such as Python loops over the batch dimension, .item(), and .tolist() calls. These operations are executed once during capture and will not be updated during replay, leading to incorrect results if the input tgt_sizes or patch counts differ from the capture-time dummy inputs. The resampler needs to be refactored to use pure tensor operations to be graph-friendly.

Signed-off-by: YunzhuLu <lucia.yunzhu@gmail.com>
@YunzhuLu YunzhuLu force-pushed the vit-cuda-graph-minicpmv-2.6-4.0-4.5 branch from e1f1c07 to 220cba0 Compare May 8, 2026 05:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) nvidia

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant