[MM][Perf][CG] Enable encoder CUDA Graph for MiniCPM-V#41996
[MM][Perf][CG] Enable encoder CUDA Graph for MiniCPM-V#41996YunzhuLu wants to merge 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: YunzhuLu <lucia.yunzhu@gmail.com>
|
Documentation preview: https://vllm--41996.org.readthedocs.build/en/41996/ |
There was a problem hiding this comment.
Code Review
This pull request introduces Encoder CUDA Graph support for MiniCPM-V models (versions 2.5, 2.6, 4.0, and 4.5) by implementing a new mixin class, _MiniCPMVEncoderCudaGraphMixin, and updating the corresponding model classes. It also includes necessary test configurations and documentation updates. The review highlighted several critical issues that need to be addressed: the resampler's forward method is currently incompatible with CUDA graph capture due to host-side logic, and the handling of temporal_ids requires refinement, specifically by using -1 instead of 0 for dummy or missing values and ensuring compatibility with tensor inputs.
|
|
||
| if self.version == (4, 5): | ||
| temporal_ids = buffers[_MINICPMV_CUDAGRAPH_BUF_KEY_TEMPORAL_IDS] | ||
| resampler_out = self.resampler(vision_embedding, tgt_sizes, temporal_ids) |
There was a problem hiding this comment.
Resampler4_5.forward expects temporal_ids to be a nested list (it uses chain.from_iterable and len() on elements), but the CUDA graph path passes a 1D torch.Tensor from the replay buffers. This will cause a TypeError during graph capture or replay. The resampler's forward method must be updated to handle tensor inputs for temporal_ids to support CUDA Graphs.
| buffers[_MINICPMV_CUDAGRAPH_BUF_KEY_TEMPORAL_IDS] = torch.zeros( | ||
| max_num_slices, dtype=torch.long, device=device | ||
| ) |
There was a problem hiding this comment.
The temporal_ids buffer should be initialized with -1 instead of 0 for dummy inputs. In Resampler4_5.forward, a value of -1 indicates that no temporal embedding should be applied, whereas 0 will cause the model to apply the temporal embedding at index 0, which is incorrect for non-video or dummy inputs.
| buffers[_MINICPMV_CUDAGRAPH_BUF_KEY_TEMPORAL_IDS] = torch.zeros( | |
| max_num_slices, dtype=torch.long, device=device | |
| ) | |
| buffers[_MINICPMV_CUDAGRAPH_BUF_KEY_TEMPORAL_IDS] = torch.full( | |
| (max_num_slices,), -1, dtype=torch.long, device=device | |
| ) |
| flatten_2d_lists(temporal_ids), dtype=torch.long, device=device | ||
| ) | ||
| else: | ||
| flat_ids = torch.zeros(len(tgt_sizes), dtype=torch.long, device=device) |
There was a problem hiding this comment.
flat_ids should be initialized with -1 instead of 0 when temporal_ids is missing. This ensures that no temporal embedding is applied by the resampler for these items, matching the logic in the eager path.
| flat_ids = torch.zeros(len(tgt_sizes), dtype=torch.long, device=device) | |
| flat_ids = torch.full( | |
| (len(tgt_sizes),), -1, dtype=torch.long, device=device | |
| ) |
| resampler_out = self.resampler(vision_embedding, tgt_sizes, temporal_ids) | ||
| else: | ||
| resampler_out = self.resampler(vision_embedding, tgt_sizes) |
There was a problem hiding this comment.
The resampler (both Resampler2_5 and Resampler4_5) is currently incompatible with CUDA Graph capture because its forward method contains host-side logic such as Python loops over the batch dimension, .item(), and .tolist() calls. These operations are executed once during capture and will not be updated during replay, leading to incorrect results if the input tgt_sizes or patch counts differ from the capture-time dummy inputs. The resampler needs to be refactored to use pure tensor operations to be graph-friendly.
Signed-off-by: YunzhuLu <lucia.yunzhu@gmail.com>
e1f1c07 to
220cba0
Compare
Purpose
Add encoder CUDA Graph support for MiniCPM-V 2.5, 2.6, 4.0, and 4.5 as part of tracker #38175. This implementation follows the existing workflow introduced in #38061.
The captured graph covers both the ViT encoder (VPM) and the resampler, with version-specific handling for:
tgt_sizestemporal_idsto the resamplerMiniCPM-V 2.0 is not included, as it predates the slice-based vision architecture required by this implementation.
Key Updates & Fixes
This PR also refactors
Resamplerto be fully compatible with CUDA Graph capture and fixes a silent pipeline bug in eager modeTensorized Resampler Forward Pass
Refactored
Resampler2_5andResampler4_5to eliminate host-side logic (forloops, .tolist(), and.item()).Replaced dynamic python-slicing with pure tensor operations (
h_idx,w_idxfrom a flatseq_idx). Handled out-of-bounds safety usingtorch.clampand a dynamickey_padding_maskviatorch.where.Fixed temporal_ids Incompatibility for v4.5
Added a dispatch in
Resampler4_5.forwardto accept a flat 1Dtorch.Tensorduring graph replay, safely bypassing the Python-level cross-frame merge loop.Passed
temporal_idscorrectly throughencoder_cudagraph_forward.Fixed Eager-Mode Video Input Bug
Registered
temporal_idsas a video batched field in_minicpmv_field_config.Defined
temporal_idssafely with Annotated inMiniCPMVImagePixelInputsto passTensorSchemavalidation.Removed premature flatten_2d_lists in
get_vision_hidden_statesto preserve thelist[list[int]]structure required by the resampler's eager path.Test Plan
Unit test
Functional test
GPU: RTX 5090
Model: MiniCPM-V-4_5
No CUDA Graph
vllm serve /root/autodl-tmp/huggingface/hub/MiniCPM-V-4_5/OpenBMB/MiniCPM-V-4_5 \ --trust-remote-code \ --served-model-name MiniCPM-V-4_5 \ --gpu-memory-utilization 0.75 \ --max-model-len 4096 \ --max-num-batched-tokens 4096 \ --limit-mm-per-prompt '{"video": 1, "image": 1}'With CUDA Graph
benchmark
GPU: RTX 5090
Model: MiniCPM-V-2_6 / MiniCPM-V-4_5
E2E benchmark
GPU: RTX 5090
Model: MiniCPM-V-2_6 / MiniCPM-V-4_5
benchmark
vllm bench serve \ --base-url http://localhost:8000 \ --endpoint /v1/chat/completions \ --backend openai-chat \ --dataset-name random-mm \ --input-len 128 \ --output-len 4 \ --random-mm-base-items-per-request 1 \ --random-mm-bucket-config '{"(448,448,3)": 1.0}' \ --num-prompts 100 \ --num-warmups 50 \ --request-rate 1Test Result
✅ Unit test
36 passed, 16 warnings in 7.11s✅Functional test
Image
Video
✅ Benchmark:
MiniCPM-V-2_6 + image
No CUDA Graph:
With CUDA Graph:
MiniCPM-V-4_5 + image
No CUDA Graph:
================================================================================ Multimodal Processor Benchmark Results ================================================================================ MM Processor Metrics: Stage Mean Median Std P99.0 get_mm_hashes_ms 0.50 0.47 0.14 1.09 get_cache_missing_items_ms 0.03 0.03 0.01 0.05 apply_hf_processor_ms 14.32 13.61 3.08 29.89 merge_mm_kwargs_ms 0.09 0.07 0.08 0.36 apply_prompt_updates_ms 5.54 5.81 1.40 9.62 preprocessor_total_ms 20.47 19.75 4.26 40.36 encoder_forward_ms 24.50 24.43 1.09 27.07 num_encoder_calls 1.00 1.00 0.00 1.00 Summary: 100 total encoder calls across 100 requests. End-to-End Latency (ms): Metric Value (ms) Mean 10549.41 Median 9868.40 Std 2881.78 P99.0 14650.85With CUDA Graph:
================================================================================ Multimodal Processor Benchmark Results ================================================================================ MM Processor Metrics: Stage Mean Median Std P99.0 get_mm_hashes_ms 0.51 0.47 0.17 1.32 get_cache_missing_items_ms 0.03 0.03 0.01 0.06 apply_hf_processor_ms 13.94 13.02 3.45 29.19 merge_mm_kwargs_ms 0.08 0.07 0.03 0.22 apply_prompt_updates_ms 5.44 5.65 1.36 10.45 preprocessor_total_ms 19.99 19.12 4.54 38.72 encoder_forward_ms 469.24 394.19 149.10 690.33 num_encoder_calls 1.00 1.00 0.00 1.00 Summary: 100 total encoder calls across 100 requests. End-to-End Latency (ms): Metric Value (ms) Mean 41424.83 Median 41809.09 Std 14973.20 P99.0 59107.62MiniCPM-V-4_5 + image
No CUDA Graph:
With CUDA Graph:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.