[MM][Perf][CG] Support ViT full CUDA graph for glm4_1v image and video inference #40576
[MM][Perf][CG] Support ViT full CUDA graph for glm4_1v image and video inference #40576grYe99 wants to merge 19 commits intovllm-project:mainfrom
Conversation
|
@claude review |
There was a problem hiding this comment.
Code Review
This pull request implements CUDA graph support for the GLM-4V model by introducing a fused Triton kernel for position-embedding interpolation and refactoring the vision encoder's metadata preparation. Key changes include the addition of a native PyTorch fallback for interpolation, the implementation of the SupportsEncoderCudaGraph protocol, and optimizations to rotary position ID generation using lru_cache. Review feedback identified a potential regression in model accuracy due to the switch from bicubic to bilinear interpolation and highlighted a lack of error handling for empty input lists in the metadata preparation logic.
| @@ -1385,7 +1704,12 @@ def get_video_replacement_glm4v(item_idx: int): | |||
| dummy_inputs=Glm4vDummyInputsBuilder, | |||
| ) | |||
| class Glm4vForConditionalGeneration( | |||
There was a problem hiding this comment.
Please also update this model to:
Example: https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/vision_language.py#L2466-L2469 (if needed)
a9b7652 to
2ffaca2
Compare
|
Documentation preview: https://vllm--40576.org.readthedocs.build/en/40576/ |
ae6631e to
d824bba
Compare
|
@DarkLight1337 Hi, could you give this PR a 'ready' label to run CI tests? Thanks! |
d824bba to
893f600
Compare
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
This reverts commit 87184c4. Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
This reverts commit 142b265. Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
Signed-off-by: grYe99 <guorongye99@gmail.com>
b05a230 to
3a67b98
Compare
|
@shen-shanshan @b-mu Hi, could you help review this PR when you have time? I recently updated code to support "auto-infer compilation-config" and passed the following tests: and serve with |
| def fast_pos_embed_interpolate(self, grid_thw: list[list[int]]) -> torch.Tensor: | ||
| interpolate_fn = ( | ||
| triton_pos_embed_interpolate if HAS_TRITON else pos_embed_interpolate_native | ||
| ) | ||
| outputs = [] | ||
| for t, h, w in grid_thw: | ||
| outputs.append( | ||
| interpolate_fn( | ||
| self.embeddings.position_embedding.weight, | ||
| int(t), | ||
| int(h), | ||
| int(w), | ||
| self.num_grid_per_side, | ||
| self.spatial_merge_size, | ||
| self.dtype, | ||
| ) | ||
| ) | ||
| return torch.cat(outputs, dim=0) |
There was a problem hiding this comment.
Can you double check if Qwen3-VL and GLM-4.1V's VisiomEmbed implementation are fully equivalent with converged numeric results?
I think GLM-4.1V uses cubic interpolation instead of Qwen3-VL's bilinear interpolation.
There was a problem hiding this comment.
@Isotr0py Thanks for spotting this. Indeed the original GLM-4.1V uses bicubic interpolation for vision position embeddings. I have updated code and passed functional tests.
Signed-off-by: grYe99 <guorongye99@gmail.com>
Purpose
Following #38175, this PR implements ViT CUDA graph support for glm4_1v models image and video inference . The implementation draws references from #35963 (image) and #38061 (video)
Test Plan
1. Functional Test
2. Benchmark
3. Bench Serve
Test Result
1.Functional Test
Single GPU (zai-org/GLM-4.1V-9B-Thinking, 1xRTX4090, random-mm, 1000 prompts):
Multi GPU (zai-org/GLM-4.1V-9B-Thinking, 2xRTX4090, random-mm, 1000 prompts):
Single GPU (zai-org/GLM-4.6V-Flash,, 1xRTX4090, random-mm, 1000 prompts):
Multi GPU (zai-org/GLM-4.6V-Flash, 2xRTX4090, random-mm, 1000 prompts):
eager
cuda graph
Note
Glm4vVisionAttention not support
--mm-encoder-attn-backend FLASHINFERyet, thus only test in FLASH_ATTN. It will be supported in another PR.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.