[Fix] Misc Fixes in ViT CUDA Graph#38040
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the initialization logic for EncoderCudaGraphManager to enforce an invariant that max_batch_size must be less than or equal to the smallest token budget. This prevents potential reshape crashes during CUDA graph capture. The changes include adding validation for user-specified configurations and ensuring model-provided budget ranges are valid (positive and min <= max). New tests have been added to cover various configuration paths and validate these invariants. However, the review highlights a critical issue: user-provided encoder_cudagraph_token_budgets are not validated for positivity, which could lead to a ZeroDivisionError if a non-positive budget is supplied.
| self.token_budgets = sorted(user_budgets) | ||
| self.max_batch_size = user_max_images | ||
| min_tok = min(self.token_budgets) |
There was a problem hiding this comment.
User-provided encoder_cudagraph_token_budgets are not validated for positivity. If a user provides a non-positive budget (e.g., [0, 128]), min(self.token_budgets) could be zero or negative. This can lead to self.max_batch_size being set to zero, which will cause a ZeroDivisionError later during CUDA graph capture preparation.
You should add validation to ensure all user-provided budgets are positive. A similar check is needed in the elif user_budgets: block.
self.token_budgets = sorted(user_budgets)
if self.token_budgets[0] <= 0:
raise ValueError(
f"Invalid encoder_cudagraph_token_budgets: {user_budgets}. "
"All budget values must be positive."
)
self.max_batch_size = user_max_images
min_tok = self.token_budgets[0]There was a problem hiding this comment.
This is a fair point, but I feel that the better way to handle this is through pydantic (cuz ultimately this is an input validation problem), e.g., in the definition of CompilationConfig:
from pydantic.types import PositiveInt
@config
class CompilationConfig:
...
encoder_cudagraph_token_budgets: list[PositiveInt]There was a problem hiding this comment.
I have added the check in vllm/config/compilation.py.
| self.token_budgets = sorted(user_budgets) | ||
| self.max_batch_size = min( | ||
| max_budget // min_budget, | ||
| min(self.token_budgets), | ||
| ) |
There was a problem hiding this comment.
Similar to the other block, user-provided encoder_cudagraph_token_budgets are not validated for positivity here. This can lead to a ZeroDivisionError if a non-positive budget is provided.
self.token_budgets = sorted(user_budgets)
if self.token_budgets[0] <= 0:
raise ValueError(
f"Invalid encoder_cudagraph_token_budgets: {user_budgets}. "
"All budget values must be positive."
)
self.max_batch_size = min(
max_budget // min_budget,
self.token_budgets[0],
)There was a problem hiding this comment.
This has been addressed by checking in vllm/config/compilation.py.
|
This pull request has merge conflicts that must be resolved before it can be |
db7c2fd to
8a2c74f
Compare
Move non-positive budget validation into CompilationConfig.__post_init__ so invalid values are caught early during config parsing rather than at runtime in EncoderCudaGraphManager. Addresses PR vllm-project#38040 review feedback. Signed-off-by: Baorun Mu <bmu@nvidia.com>
When auto-inferring max_batch_size as max_budget // min_budget, the result can exceed min_budget (e.g. 16384 // 64 = 256 > 64). During graph capture, prepare_encoder_cudagraph_capture_inputs divides token_budget by max_batch_size to get per_image_output, which yields 0 for small budgets (64 // 256 = 0), causing a reshape crash on empty tensors. Fix by capping max_batch_size to min(max_budget // min_budget, min_budget), ensuring per_image_output >= 1 for all budgets. Signed-off-by: Baorun Mu <bmu@nvidia.com>
… graphs The previous fix (bbb31043b) only capped auto-inferred max_batch_size, but several configuration paths still violated the invariant max_batch_size <= min(token_budgets), causing per_image_output = 0 and reshape crashes during CUDA graph capture: 1. Fully user-specified: no validation at all 2. User provides only max_images: value used directly without cap 3. User provides only budgets: auto-inferred cap used model's min_budget instead of min(user_budgets) Fix by handling each configuration path explicitly: - User-specified both: validate and raise informative ValueError - User max_images only: adjust budget generation to start at max(min_budget, user_max_images) - User budgets only: cap max_batch_size to min(user_budgets) - Fully auto-inferred: cap max_batch_size to min(generated_budgets) Also validate model-returned budget range (positive, min <= max). Signed-off-by: Baorun Mu <bmu@nvidia.com>
Move non-positive budget validation into CompilationConfig.__post_init__ so invalid values are caught early during config parsing rather than at runtime in EncoderCudaGraphManager. Addresses PR vllm-project#38040 review feedback. Signed-off-by: Baorun Mu <bmu@nvidia.com>
Post-rebase fallout from upstream commit 936e0b7 (`[MM][CG] Optimize default max_frames_per_batch auto-infer`): - Mock fixtures: add `_MockMultimodalConfig.get_limit_per_prompt` returning 0 for "video" so the new max_frames_per_batch branch short-circuits before the model-side `get_max_frames_per_video` call (image-only mocks don't need video-frame plumbing). Also switch `encoder_cudagraph_max_frames_per_batch` default in the mock to None to match upstream's `int | None = None` field type. - ruff-format reflow in EncoderCudaGraphManager.__init__ where the conflict resolution had wrapped a single-line call. Signed-off-by: Baorun Mu <bmu@nvidia.com>
Two correctness fixes on top of the existing Qwen3-VL ViT CUDA graph implementation, surfaced during review of vllm-project#40830 (qwen2.5-vl): 1. Use ceil (not floor) for per_mm_item_output in prepare_encoder_cudagraph_capture_inputs. Floor sizes the captured pixel_values buffer at max_batch_size * (token_budget // max_batch_size), which is < token_budget whenever the budget is not divisible by max_batch_size. Replay copy for the worst-case single-item batch then raises a shape mismatch. The non-divisible case is reachable in every config path: the manager enforces max_batch_size <= min(budgets) but not divisibility. Mirrors the fix in vllm-project#40830 for qwen2.5-vl. 2. Disable ViT CUDA graph for ALL modalities (not just video) when EVS pruning is enabled. embed_multimodal post-processes both image and video embeddings (mrope positions appended for image, prune+append for video) when is_multimodal_pruning_enabled is True. The encoder CUDA graph path bypasses that hook, producing inconsistent embedding formats vs eager. Matches the pattern in qwen2.5-vl PR vllm-project#40830: `modalities = [] if pruning else ["image", "video"]`. Signed-off-by: Baorun Mu <bmu@nvidia.com>
Head branch was pushed to by a user without write access
8a2c74f to
b3b32b4
Compare
Purpose
max_batch_size = max_budget // min_budgetcould exceedmin_budget, causingprepare_encoder_cudagraph_capture_inputsto computeper_image_output = token_budget // max_batch_size = 0for small budgets, leading to a reshape crash on empty tensors inQwen3_VisionPatchEmbed.forward. Fixed by capping tomin(max_budget // min_budget, min_budget)if both budgets and max batch size are auto-inferred. For the paths where either budget or max batch size are provided by the user, we adjust the other (i.e. the one that is auto-inferred) to satisfy the invariant:max_batch_size <= min_budget.max_batch_size, input buffer is not under-allocated.Test Plan
4×GB200 NVLink (TP=4, ViT DP=4), Qwen3-VL-32B-Instruct, random-mm dataset (synthetic)
Test Result
Encoder Forward Latency (mean):
Encoder Forward Latency (median):
Encoder Forward Latency (P99):
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.