Skip to content

[Multimodal] Simplify ViT CUDA graph interfaces#41234

Open
Isotr0py wants to merge 11 commits intovllm-project:mainfrom
Isotr0py:refactor-vit-cg
Open

[Multimodal] Simplify ViT CUDA graph interfaces#41234
Isotr0py wants to merge 11 commits intovllm-project:mainfrom
Isotr0py:refactor-vit-cg

Conversation

@Isotr0py
Copy link
Copy Markdown
Member

@Isotr0py Isotr0py commented Apr 29, 2026

Purpose

  • To support ViT cuda graph, we need to implement about 11 new class methods at model implementation, which has made it much messy.
  • This PR consolidates get_encoder_cudagraph_num_items, get_encoder_cudagraph_per_item_output_tokens and get_encoder_cudagraph_per_item_input_sizes into one get_encoder_cudagraph_item_specs function.
  • Also consolidate encoder_cudagraph_forward and encoder_eager_forward into one encoder_forward function.

Test Plan

pytest -s -v tests/v1/cudagraph/test_encoder_cudagraph.py
pytest -s -v tests/models/multimodal/generation/test_vit_cudagraph.py

Test Result

All tests should pass.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify mergify Bot added qwen Related to Qwen models nvidia v1 labels Apr 29, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the encoder CUDA graph interface to simplify model implementations. It replaces several specific methods, such as get_input_modality and get_max_frames_per_video, with a unified get_encoder_cudagraph_item_specs method and a consolidated encoder_forward method. The EncoderCudaGraphManager now auto-detects input keys based on configuration. Feedback suggests using ValueError instead of AssertionError for unreachable code in qwen3_vl.py and improving the specificity of error messages in the CUDA graph manager to aid debugging.

Comment thread vllm/model_executor/models/qwen3_vl.py
Comment thread vllm/v1/worker/encoder_cudagraph.py Outdated
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@Isotr0py Isotr0py marked this pull request as ready for review May 6, 2026 09:01
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@Isotr0py
Copy link
Copy Markdown
Member Author

Isotr0py commented May 6, 2026

cc @shen-shanshan @b-mu about ViT CUDA graph cleanup.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify mergify Bot added the multi-modality Related to multi-modality (#4194) label May 6, 2026
Comment thread vllm/v1/worker/encoder_cudagraph.py Outdated
# actual inputs may be smaller. Zero then slice-copy so padded
# positions are invisible to attention (cu_seqlens masks them out).
input_key = self.config.input_key_by_modality[
input_key = input_key = self.config.input_key_by_modality[
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This input_key = input_key = ... is changed by mistake?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, good catch!

Comment on lines +1552 to +1555
def get_encoder_cudagraph_item_specs(
self,
mm_kwargs: dict[str, Any],
) -> int:
"""Return the number of items (e.g. images) in the batch."""
...

def get_encoder_cudagraph_per_item_output_tokens(
self,
mm_kwargs: dict[str, Any],
) -> list[int]:
"""Return output token count for each item.

Used for greedy packing and DP load balancing.
"""
...

def get_encoder_cudagraph_per_item_input_sizes(
self,
mm_kwargs: dict[str, Any],
) -> list[int]:
"""Return input size (e.g. patch count) for each item.
) -> list["EncoderItemSpec"]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since #40830 has been merged, maybe we should also make Qwen2.5-VL adapt to these new interfaces.

Isotr0py added 2 commits May 7, 2026 10:42
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-modality Related to multi-modality (#4194) nvidia qwen Related to Qwen models v1

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants