Skip to content

[MM][Perf][CG] Support ViT full CUDA graph for Kimi-VL#41992

Open
oguzhankir wants to merge 2 commits intovllm-project:mainfrom
oguzhankir:oguzhankir/vit-cuda-graph-kimivl
Open

[MM][Perf][CG] Support ViT full CUDA graph for Kimi-VL#41992
oguzhankir wants to merge 2 commits intovllm-project:mainfrom
oguzhankir:oguzhankir/vit-cuda-graph-kimivl

Conversation

@oguzhankir
Copy link
Copy Markdown

Purpose

Add ViT CUDA Graph support for Kimi-VL (KimiVLForConditionalGeneration), following the pattern established in #38061 (Qwen3-VL). Part of the tracker issue #38175.

Kimi-VL's MoonVitPretrainedModel contains .tolist() calls in its forward path (pos embedding interpolation, RoPE frequency computation, patch merging) that are incompatible with CUDA graph capture. This PR refactors moonvit.py to add a graph-safe path via precomputed metadata buffers, then wires kimi_vl.py to the SupportsEncoderCudaGraph protocol.

Test Plan

  • Unit tests: pytest tests/v1/cudagraph/test_encoder_cudagraph.py -v
  • Added Kimi-VL entry to tests/models/multimodal/generation/test_vit_cudagraph.py
  • E2E benchmark on RTX 4090

Test Result

Unit tests: 36 passed ✅

E2E Benchmark — RTX 4090, moonshotai/Kimi-VL-A3B-Instruct, 2 images/req, 200 prompts @ 8 RPS:

Metric No CG With CG Δ
Mean TTFT 132.91 ms 111.13 ms ↓ 16.4%
Median TTFT 130.88 ms 102.64 ms ↓ 21.6%
P99 TTFT 290.47 ms 193.25 ms ↓ 33.5%

CG config: encoder_cudagraph_token_budgets=[256,512,1024], encoder_cudagraph_max_vision_items_per_batch=4

Documentation

  • Added KimiVLForConditionalGeneration row to docs/design/cuda_graphs_multimodal.md
  • Added kimi_vl to MODELS_SUPPORT_VIT_CUDA_GRAPH in examples/generate/multimodal/vision_language_offline.py

oguzhankir added 2 commits May 7, 2026 20:41
Signed-off-by: oguz <oguzhankir17@gmail.com>
Signed-off-by: oguz <oguzhankir17@gmail.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 7, 2026

Documentation preview: https://vllm--41992.org.readthedocs.build/en/41992/

@mergify mergify Bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) nvidia labels May 7, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements encoder CUDA graph support for the Kimi-VL model. It introduces the SupportsEncoderCudaGraph protocol to KimiVLForConditionalGeneration and refactors the underlying MoonVit components to allow precomputing grid-dependent metadata—such as positional embeddings, RoPE frequencies, and sequence lengths—outside of the captured CUDA graph. Additionally, it adds a CUDA-graph-safe patch merging implementation and includes tests and documentation updates for the new functionality. I have no feedback to provide.

@oguzhankir oguzhankir changed the title Oguzhankir/vit cuda graph kimivl [MM][Perf][CG] Support ViT full CUDA graph for Kimi-VL May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) nvidia

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant