[MM][Perf][CG] Support ViT full CUDA graph for glm4_1v image and video inference by grYe99 · Pull Request #40576 · vllm-project/vllm

grYe99 · 2026-04-22T03:25:13Z

Purpose

Following #38175, this PR implements ViT CUDA graph support for glm4_1v models image and video inference . The implementation draws references from #35963 (image) and #38061 (video)

Functional Test
Benchmark in some scenarios:

no DP VIT + eager vs no DP VIT + cuda graph.
DP VIT + eager vs DP VIT + cuda graph.

Bench Serve

Test Plan

1. Functional Test

python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality image
python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality video
pytest tests/models/multimodal/generation/test_vit_cudagraph.py -k "glm4_1v"

2. Benchmark

# Image
# Single GPU:
vllm bench mm-processor \
--model zai-org/GLM-4.1V-9B-Thinking \
--max-model-len 4096 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 1): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 1, "video": 0}' \
--num-prompts 1000 \
--seed 42 \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

# Multi GPU:
vllm bench mm-processor \
--model zai-org/GLM-4.1V-9B-Thinking \
--max-model-len 4096 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 1): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 2, "video": 0}' \
--num-prompts 1000 \
--seed 42 \
--tensor-parallel-size 2 \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

# Video
# Single GPU:
vllm bench mm-processor \
--model zai-org/GLM-4.6V-Flash \
--max-model-len 4096 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 1}' \
--num-prompts 1000 \
--seed 42 \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

# Multi GPU:
vllm bench mm-processor \
--model zai-org/GLM-4.6V-Flash \
--max-model-len 4096 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 2}' \
--num-prompts 1000 \
--seed 42 \
--tensor-parallel-size 2 \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

3. Bench Serve

python -m vllm.entrypoints.cli.main serve zai-org/GLM-4.6V-Flash \
--max-model-len 4096 \
--trust-remote-code \
--tensor-parallel-size 2 \
--mm-encoder-tp-mode data  \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048], "encoder_cudagraph_max_vision_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

vllm bench serve   \
--backend openai-chat   \
--model zai-org/GLM-4.6V-Flash  \
 --base-url http://localhost:8000   \
--endpoint /v1/chat/completions   \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 2}' \
--num-prompts 100 \
--seed 42

Test Result

1.Functional Test

# python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality image
--------------------------------------------------
<think>Got it, let's look at the image. There are cherry blossom trees with pink flowers, and in the background, there's a tall tower, which is the Tokyo Skytree, I think. The sky is clear blue, and the cherry blossoms are in the foreground, framing the tower. So the content
--------------------------------------------------
<think>Got it, let's look at the image. There are cherry blossom trees with pink flowers, and in the background, there's a tall tower, which is the Tokyo Skytree, right? The sky is clear blue, and the cherry blossoms are in the foreground, framing the tower. So the content includes
--------------------------------------------------
<think>Got it, let's look at the image. There are cherry blossom trees with pink flowers, and in the background, there's a tall tower, which is the Tokyo Skytree, right? The sky is clear blue, and the cherry blossoms are in full bloom, framing the tower. So the content includes
--------------------------------------------------
<think>Got it, let's look at the image. There are cherry blossom trees with pink flowers, and in the background, there's a tall tower, which is the Tokyo Skytree, right? The sky is clear blue, and the cherry blossoms are in the foreground, framing the tower. So the content includes
--------------------------------------------------

# python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality video
--------------------------------------------------
<think>Got it, let's figure out why this video is funny. First, look at the elements: a young child wearing oversized glasses, which are way too big for their face. That's a common humorous setup—kids in adult-sized items look comical. Then, the child is engrossed in a book
--------------------------------------------------
<think>Got it, let's figure out why this video is funny. First, look at the elements: a baby wearing oversized glasses, which are way too big for a baby's face. Babies are usually cute, but the mismatch between the big glasses and the baby's small face creates a humorous contrast. Also, the
--------------------------------------------------
<think>Got it, let's figure out why this video is funny. First, look at the elements: a young child with oversized glasses, which are probably not meant for them, so that's a funny contrast. The child is "reading" a book, but maybe the glasses make them look like an adult trying to
--------------------------------------------------
<think>Got it, let's figure out why this video is funny. First, look at the elements: a young child wearing oversized glasses, which are way too big for their face. That's a common humorous setup—kids in adult accessories. Then, the child is engrossed in a book, maybe pretending to
--------------------------------------------------

Benchmark (encoder_forward_ms)

Image Inference:

Single GPU (zai-org/GLM-4.1V-9B-Thinking, 1xRTX4090, random-mm, 1000 prompts):

Backend	Mean	P99
FLASH_ATTN	+2.88% (4.17ms -> 4.05ms)	+48.34% (9.33ms -> 4.82ms)

Multi GPU (zai-org/GLM-4.1V-9B-Thinking, 2xRTX4090, random-mm, 1000 prompts):

Backend	Mean	P99
FLASH_ATTN	+62.73% (6.01ms -> 2.24ms)	+62.43% (24.73ms -> 9.29ms)

Video Inference:

Single GPU (zai-org/GLM-4.6V-Flash,, 1xRTX4090, random-mm, 1000 prompts):

Backend	Mean	P99
FLASH_ATTN	+37.47% (7.26ms -> 4.54ms)	+63.07% (25.78ms -> 9.52ms)

Multi GPU (zai-org/GLM-4.6V-Flash, 2xRTX4090, random-mm, 1000 prompts):

Backend	Mean	P99
FLASH_ATTN	+66.77% (9.51ms -> 3.16ms)	+62.44% (26.17ms -> 9.83ms)

Bench Serve
eager

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  28.34     
Total input tokens:                      110200    
Total generated tokens:                  12800     
Request throughput (req/s):              3.53      
Output token throughput (tok/s):         451.65    
Peak output token throughput (tok/s):    4100.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4340.10   
---------------Time to First Token----------------
Mean TTFT (ms):                          18770.72  
Median TTFT (ms):                        19287.97  
P99 TTFT (ms):                           25433.90  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          71.09     
Median TPOT (ms):                        67.74     
P99 TPOT (ms):                           187.21    
---------------Inter-token Latency----------------
Mean ITL (ms):                           70.53     
Median ITL (ms):                         24.00     
P99 ITL (ms):                            236.64    
==================================================

cuda graph

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  26.45     
Total input tokens:                      110200    
Total generated tokens:                  12800     
Request throughput (req/s):              3.78      
Output token throughput (tok/s):         483.91    
Peak output token throughput (tok/s):    4200.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4650.05   
---------------Time to First Token----------------
Mean TTFT (ms):                          16884.68  
Median TTFT (ms):                        17274.28  
P99 TTFT (ms):                           23563.19  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          71.04     
Median TPOT (ms):                        68.08     
P99 TPOT (ms):                           170.80    
---------------Inter-token Latency----------------
Mean ITL (ms):                           70.48     
Median ITL (ms):                         23.95     
P99 ITL (ms):                            239.59    
==================================================

Note

Glm4vVisionAttention not support --mm-encoder-attn-backend FLASHINFER yet, thus only test in FLASH_ATTN. It will be supported in another PR.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

grYe99 · 2026-04-22T03:26:17Z

@claude review

gemini-code-assist

Code Review

This pull request implements CUDA graph support for the GLM-4V model by introducing a fused Triton kernel for position-embedding interpolation and refactoring the vision encoder's metadata preparation. Key changes include the addition of a native PyTorch fallback for interpolation, the implementation of the SupportsEncoderCudaGraph protocol, and optimizations to rotary position ID generation using lru_cache. Review feedback identified a potential regression in model accuracy due to the switch from bicubic to bilinear interpolation and highlighted a lack of error handling for empty input lists in the metadata preparation logic.

shen-shanshan · 2026-04-25T07:47:05Z

@@ -1385,7 +1704,12 @@ def get_video_replacement_glm4v(item_idx: int):
    dummy_inputs=Glm4vDummyInputsBuilder,
 )
 class Glm4vForConditionalGeneration(


Please also update this model to:

Doc: https://docs.vllm.ai/en/latest/design/cuda_graphs_multimodal/#model-integration-via-supportsencodercudagraph

Example: https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/vision_language.py#L2466-L2469 (if needed)

CI test: https://github.com/vllm-project/vllm/blob/main/tests/models/multimodal/generation/test_vit_cudagraph.py#L44-L58

@shen-shanshan done

mergify · 2026-04-25T09:15:52Z

Documentation preview: https://vllm--40576.org.readthedocs.build/en/40576/

grYe99 · 2026-04-25T14:48:00Z

@DarkLight1337 Hi, could you give this PR a 'ready' label to run CI tests? Thanks!

Signed-off-by: grYe99 <guorongye99@gmail.com>

This reverts commit 87184c4. Signed-off-by: grYe99 <guorongye99@gmail.com>

Signed-off-by: grYe99 <guorongye99@gmail.com>

This reverts commit 142b265. Signed-off-by: grYe99 <guorongye99@gmail.com>

Signed-off-by: grYe99 <guorongye99@gmail.com>

grYe99 · 2026-05-06T14:52:02Z

@shen-shanshan @b-mu Hi, could you help review this PR when you have time? I recently updated code to support "auto-infer compilation-config" and passed the following tests:

python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality image
python examples/generate/multimodal/vision_language_offline.py -m glm4_1v --enable_vit_cuda_graph --modality video
pytest tests/models/multimodal/generation/test_vit_cudagraph.py -k "glm4_1v"

and serve with --compilation-config '{"cudagraph_mm_encoder": true}'
Any suggestions are welcome!

Isotr0py · 2026-05-08T04:54:45Z

+    def fast_pos_embed_interpolate(self, grid_thw: list[list[int]]) -> torch.Tensor:
+        interpolate_fn = (
+            triton_pos_embed_interpolate if HAS_TRITON else pos_embed_interpolate_native
+        )
+        outputs = []
+        for t, h, w in grid_thw:
+            outputs.append(
+                interpolate_fn(
+                    self.embeddings.position_embedding.weight,
+                    int(t),
+                    int(h),
+                    int(w),
+                    self.num_grid_per_side,
+                    self.spatial_merge_size,
+                    self.dtype,
+                )
+            )
+        return torch.cat(outputs, dim=0)


Can you double check if Qwen3-VL and GLM-4.1V's VisiomEmbed implementation are fully equivalent with converged numeric results?

I think GLM-4.1V uses cubic interpolation instead of Qwen3-VL's bilinear interpolation.

@Isotr0py Thanks for spotting this. Indeed the original GLM-4.1V uses bicubic interpolation for vision position embeddings. I have updated code and passed functional tests.

Signed-off-by: grYe99 <guorongye99@gmail.com>

claude Bot reviewed Apr 22, 2026

View reviewed changes

mergify Bot added the nvidia label Apr 22, 2026

github-project-automation Bot added this to NVIDIA Apr 22, 2026

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread vllm/model_executor/models/glm4_1v.py Outdated

Comment thread vllm/model_executor/models/glm4_1v.py

shen-shanshan mentioned this pull request Apr 22, 2026

[RFC]: Support ViT Full CUDA Graph (Tracker) #38175

Open

20 tasks

shen-shanshan reviewed Apr 25, 2026

View reviewed changes

grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch from a9b7652 to 2ffaca2 Compare April 25, 2026 08:50

mergify Bot added the documentation Improvements or additions to documentation label Apr 25, 2026

grYe99 requested review from DarkLight1337 and ywang96 as code owners April 25, 2026 09:17

mergify Bot added the multi-modality Related to multi-modality (#4194) label Apr 25, 2026

grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch 4 times, most recently from ae6631e to d824bba Compare April 25, 2026 14:44

grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch from d824bba to 893f600 Compare May 6, 2026 08:42

grYe99 added 11 commits May 6, 2026 22:28

feat: support vit cudagraph for glm4_1v

bebd3b3

Signed-off-by: grYe99 <guorongye99@gmail.com>

update doc

f603f87

Signed-off-by: grYe99 <guorongye99@gmail.com>

update MODELS_SUPPORT_VIT_CUDA_GRAPH

1d510b2

Signed-off-by: grYe99 <guorongye99@gmail.com>

update CI test script

790d7cd

Signed-off-by: grYe99 <guorongye99@gmail.com>

remove default eagle in vision_language

b11c726

Signed-off-by: grYe99 <guorongye99@gmail.com>

limit max_budget

8cdcd25

Signed-off-by: grYe99 <guorongye99@gmail.com>

limit max_budget

11d36bc

Signed-off-by: grYe99 <guorongye99@gmail.com>

remove default eagle in vision_language

1a5259c

Signed-off-by: grYe99 <guorongye99@gmail.com>

update

baa6db5

Signed-off-by: grYe99 <guorongye99@gmail.com>

update get_max_frames_per_video

fb8ed9c

Signed-off-by: grYe99 <guorongye99@gmail.com>

update: get max video tokens

02d558e

Signed-off-by: grYe99 <guorongye99@gmail.com>

grYe99 added 7 commits May 6, 2026 22:28

Revert "update: get max video tokens"

2fbe8db

This reverts commit 87184c4. Signed-off-by: grYe99 <guorongye99@gmail.com>

update: get_max_frames_per_video

aade230

Signed-off-by: grYe99 <guorongye99@gmail.com>

update: get_max_frames_per_video

f30e56f

Signed-off-by: grYe99 <guorongye99@gmail.com>

update vision_language_offline.py

51d1ca2

Signed-off-by: grYe99 <guorongye99@gmail.com>

Revert "update vision_language_offline.py"

019e9ad

This reverts commit 142b265. Signed-off-by: grYe99 <guorongye99@gmail.com>

update: get_max_frames_per_video

34593b4

Signed-off-by: grYe99 <guorongye99@gmail.com>

update: get_max_frames_per_video

3a67b98

Signed-off-by: grYe99 <guorongye99@gmail.com>

grYe99 force-pushed the support_vit_cudagraph_glm4_1v branch from b05a230 to 3a67b98 Compare May 6, 2026 14:28

Isotr0py reviewed May 8, 2026

View reviewed changes

update pos_embeds_interpolate-bicubic

9981330

Signed-off-by: grYe99 <guorongye99@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MM][Perf][CG] Support ViT full CUDA graph for glm4_1v image and video inference #40576

[MM][Perf][CG] Support ViT full CUDA graph for glm4_1v image and video inference #40576
grYe99 wants to merge 19 commits intovllm-project:mainfrom
grYe99:support_vit_cudagraph_glm4_1v

grYe99 commented Apr 22, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

grYe99 commented Apr 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

shen-shanshan Apr 25, 2026

Uh oh!

grYe99 Apr 25, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Apr 25, 2026

Uh oh!

grYe99 commented Apr 25, 2026 •

edited

Loading

Uh oh!

grYe99 commented May 6, 2026

Uh oh!

Isotr0py May 8, 2026

Uh oh!

grYe99 May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

grYe99 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Note

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

grYe99 commented Apr 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

shen-shanshan Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

grYe99 Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 25, 2026

Uh oh!

grYe99 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grYe99 commented May 6, 2026

Uh oh!

Isotr0py May 8, 2026

Choose a reason for hiding this comment

Uh oh!

grYe99 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

grYe99 commented Apr 22, 2026 •

edited

Loading

grYe99 Apr 25, 2026 •

edited

Loading

grYe99 commented Apr 25, 2026 •

edited

Loading