Skip to content

[MM][Perf][CG] Support ViT full CUDA graph for glm4_1v image and video inference #40576

Open
grYe99 wants to merge 19 commits intovllm-project:mainfrom
grYe99:support_vit_cudagraph_glm4_1v
Open

[MM][Perf][CG] Support ViT full CUDA graph for glm4_1v image and video inference #40576
grYe99 wants to merge 19 commits intovllm-project:mainfrom
grYe99:support_vit_cudagraph_glm4_1v

Conversation

@grYe99
Copy link
Copy Markdown
Contributor

@grYe99 grYe99 commented Apr 22, 2026

Purpose

Following #38175, this PR implements ViT CUDA graph support for glm4_1v models image and video inference . The implementation draws references from #35963 (image) and #38061 (video)

  1. Functional Test
  2. Benchmark in some scenarios:
  • no DP VIT + eager vs no DP VIT + cuda graph.
  • DP VIT + eager vs DP VIT + cuda graph.
  1. Bench Serve

Test Plan

1. Functional Test
python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality image
python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality video
pytest tests/models/multimodal/generation/test_vit_cudagraph.py -k "glm4_1v"
2. Benchmark
# Image
# Single GPU:
vllm bench mm-processor \
--model zai-org/GLM-4.1V-9B-Thinking \
--max-model-len 4096 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 1): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 1, "video": 0}' \
--num-prompts 1000 \
--seed 42 \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

# Multi GPU:
vllm bench mm-processor \
--model zai-org/GLM-4.1V-9B-Thinking \
--max-model-len 4096 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 1): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 2, "video": 0}' \
--num-prompts 1000 \
--seed 42 \
--tensor-parallel-size 2 \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

# Video
# Single GPU:
vllm bench mm-processor \
--model zai-org/GLM-4.6V-Flash \
--max-model-len 4096 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 1}' \
--num-prompts 1000 \
--seed 42 \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

# Multi GPU:
vllm bench mm-processor \
--model zai-org/GLM-4.6V-Flash \
--max-model-len 4096 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 2}' \
--num-prompts 1000 \
--seed 42 \
--tensor-parallel-size 2 \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'
3. Bench Serve
python -m vllm.entrypoints.cli.main serve zai-org/GLM-4.6V-Flash \
--max-model-len 4096 \
--trust-remote-code \
--tensor-parallel-size 2 \
--mm-encoder-tp-mode data  \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

vllm bench serve   \
--backend openai-chat   \
--model zai-org/GLM-4.6V-Flash  \
 --base-url http://localhost:8000   \
--endpoint /v1/chat/completions   \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 2}' \
--num-prompts 100 \
--seed 42 

Test Result

1.Functional Test
# python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality image
--------------------------------------------------
<think>Got it, let's look at the image. There are cherry blossom trees with pink flowers, and in the background, there's a tall tower, which is the Tokyo Skytree, I think. The sky is clear blue, and the cherry blossoms are in the foreground, framing the tower. So the content
--------------------------------------------------
<think>Got it, let's look at the image. There are cherry blossom trees with pink flowers, and in the background, there's a tall tower, which is the Tokyo Skytree, right? The sky is clear blue, and the cherry blossoms are in the foreground, framing the tower. So the content includes
--------------------------------------------------
<think>Got it, let's look at the image. There are cherry blossom trees with pink flowers, and in the background, there's a tall tower, which is the Tokyo Skytree, right? The sky is clear blue, and the cherry blossoms are in full bloom, framing the tower. So the content includes
--------------------------------------------------
<think>Got it, let's look at the image. There are cherry blossom trees with pink flowers, and in the background, there's a tall tower, which is the Tokyo Skytree, right? The sky is clear blue, and the cherry blossoms are in the foreground, framing the tower. So the content includes
--------------------------------------------------

# python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality video
--------------------------------------------------
<think>Got it, let's figure out why this video is funny. First, look at the elements: a young child wearing oversized glasses, which are way too big for their face. That's a common humorous setup—kids in adult-sized items look comical. Then, the child is engrossed in a book
--------------------------------------------------
<think>Got it, let's figure out why this video is funny. First, look at the elements: a baby wearing oversized glasses, which are way too big for a baby's face. Babies are usually cute, but the mismatch between the big glasses and the baby's small face creates a humorous contrast. Also, the
--------------------------------------------------
<think>Got it, let's figure out why this video is funny. First, look at the elements: a young child with oversized glasses, which are probably not meant for them, so that's a funny contrast. The child is "reading" a book, but maybe the glasses make them look like an adult trying to
--------------------------------------------------
<think>Got it, let's figure out why this video is funny. First, look at the elements: a young child wearing oversized glasses, which are way too big for their face. That's a common humorous setup—kids in adult accessories. Then, the child is engrossed in a book, maybe pretending to
--------------------------------------------------
  1. Benchmark (encoder_forward_ms)
  • Image Inference:

Single GPU (zai-org/GLM-4.1V-9B-Thinking, 1xRTX4090, random-mm, 1000 prompts):

Backend Mean P99
FLASH_ATTN +2.88% (4.17ms -> 4.05ms) +48.34% (9.33ms -> 4.82ms)

Multi GPU (zai-org/GLM-4.1V-9B-Thinking, 2xRTX4090, random-mm, 1000 prompts):

Backend Mean P99
FLASH_ATTN +62.73% (6.01ms -> 2.24ms) +62.43% (24.73ms -> 9.29ms)
  • Video Inference:

Single GPU (zai-org/GLM-4.6V-Flash,, 1xRTX4090, random-mm, 1000 prompts):

Backend Mean P99
FLASH_ATTN +37.47% (7.26ms -> 4.54ms) +63.07% (25.78ms -> 9.52ms)

Multi GPU (zai-org/GLM-4.6V-Flash, 2xRTX4090, random-mm, 1000 prompts):

Backend Mean P99
FLASH_ATTN +66.77% (9.51ms -> 3.16ms) +62.44% (26.17ms -> 9.83ms)
  1. Bench Serve
    eager
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  28.34     
Total input tokens:                      110200    
Total generated tokens:                  12800     
Request throughput (req/s):              3.53      
Output token throughput (tok/s):         451.65    
Peak output token throughput (tok/s):    4100.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4340.10   
---------------Time to First Token----------------
Mean TTFT (ms):                          18770.72  
Median TTFT (ms):                        19287.97  
P99 TTFT (ms):                           25433.90  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          71.09     
Median TPOT (ms):                        67.74     
P99 TPOT (ms):                           187.21    
---------------Inter-token Latency----------------
Mean ITL (ms):                           70.53     
Median ITL (ms):                         24.00     
P99 ITL (ms):                            236.64    
==================================================

cuda graph

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  26.45     
Total input tokens:                      110200    
Total generated tokens:                  12800     
Request throughput (req/s):              3.78      
Output token throughput (tok/s):         483.91    
Peak output token throughput (tok/s):    4200.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4650.05   
---------------Time to First Token----------------
Mean TTFT (ms):                          16884.68  
Median TTFT (ms):                        17274.28  
P99 TTFT (ms):                           23563.19  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          71.04     
Median TPOT (ms):                        68.08     
P99 TPOT (ms):                           170.80    
---------------Inter-token Latency----------------
Mean ITL (ms):                           70.48     
Median ITL (ms):                         23.95     
P99 ITL (ms):                            239.59    
==================================================

Note

Glm4vVisionAttention not support --mm-encoder-attn-backend FLASHINFER yet, thus only test in FLASH_ATTN. It will be supported in another PR.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@grYe99
Copy link
Copy Markdown
Contributor Author

grYe99 commented Apr 22, 2026

@claude review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements CUDA graph support for the GLM-4V model by introducing a fused Triton kernel for position-embedding interpolation and refactoring the vision encoder's metadata preparation. Key changes include the addition of a native PyTorch fallback for interpolation, the implementation of the SupportsEncoderCudaGraph protocol, and optimizations to rotary position ID generation using lru_cache. Review feedback identified a potential regression in model accuracy due to the switch from bicubic to bilinear interpolation and highlighted a lack of error handling for empty input lists in the metadata preparation logic.

Comment thread vllm/model_executor/models/glm4_1v.py Outdated
Comment thread vllm/model_executor/models/glm4_1v.py
@@ -1385,7 +1704,12 @@ def get_video_replacement_glm4v(item_idx: int):
dummy_inputs=Glm4vDummyInputsBuilder,
)
class Glm4vForConditionalGeneration(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

@grYe99 grYe99 Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grYe99 grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch from a9b7652 to 2ffaca2 Compare April 25, 2026 08:50
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 25, 2026

Documentation preview: https://vllm--40576.org.readthedocs.build/en/40576/

@mergify mergify Bot added the documentation Improvements or additions to documentation label Apr 25, 2026
@mergify mergify Bot added the multi-modality Related to multi-modality (#4194) label Apr 25, 2026
@grYe99 grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch 4 times, most recently from ae6631e to d824bba Compare April 25, 2026 14:44
@grYe99
Copy link
Copy Markdown
Contributor Author

grYe99 commented Apr 25, 2026

@DarkLight1337 Hi, could you give this PR a 'ready' label to run CI tests? Thanks!

@grYe99 grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch from d824bba to 893f600 Compare May 6, 2026 08:42
grYe99 added 11 commits May 6, 2026 22:28
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
grYe99 added 7 commits May 6, 2026 22:28
This reverts commit 87184c4.

Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
This reverts commit 142b265.

Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
@grYe99 grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch from b05a230 to 3a67b98 Compare May 6, 2026 14:28
@grYe99
Copy link
Copy Markdown
Contributor Author

grYe99 commented May 6, 2026

@shen-shanshan @b-mu Hi, could you help review this PR when you have time? I recently updated code to support "auto-infer compilation-config" and passed the following tests:

python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality image
python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality video
pytest tests/models/multimodal/generation/test_vit_cudagraph.py -k "glm4_1v"

and serve with --compilation-config '{"cudagraph_mm_encoder": true}'
Any suggestions are welcome!

Comment thread vllm/model_executor/models/glm4_1v.py Outdated
Comment on lines +956 to +973
def fast_pos_embed_interpolate(self, grid_thw: list[list[int]]) -> torch.Tensor:
interpolate_fn = (
triton_pos_embed_interpolate if HAS_TRITON else pos_embed_interpolate_native
)
outputs = []
for t, h, w in grid_thw:
outputs.append(
interpolate_fn(
self.embeddings.position_embedding.weight,
int(t),
int(h),
int(w),
self.num_grid_per_side,
self.spatial_merge_size,
self.dtype,
)
)
return torch.cat(outputs, dim=0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double check if Qwen3-VL and GLM-4.1V's VisiomEmbed implementation are fully equivalent with converged numeric results?

I think GLM-4.1V uses cubic interpolation instead of Qwen3-VL's bilinear interpolation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Isotr0py Thanks for spotting this. Indeed the original GLM-4.1V uses bicubic interpolation for vision position embeddings. I have updated code and passed functional tests.

Signed-off-by: grYe99 <guorongye99@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) nvidia

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants