[VLM] Optimize GLM4.5-V-style video processing to only decode necessary frames by Isotr0py · Pull Request #24161 · vllm-project/vllm

Isotr0py · 2025-09-03T08:18:12Z

Purpose

To make sure video processing correctness for GLM4.5V and upcoming Qwen3-VL, we need to add --media-io-kwargs '{"video": {"num_frames": -1}}', which is not safe enough and cause extremly high RAM usage to crash server if input video is quite long.
This PR adds a new video loader to support GLM4.5V-style dynamic sampling, so we don't need to decode all frames anymore.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py · 2025-09-10T16:11:20Z

Benchmark results

Script: https://gist.github.com/Isotr0py/921b17edaeef1ed8bc211e22b47c84b4
Hardware: AMD Ryzen Threadripper 3970X 32-Core Processor

[Full decoding backend processing] start
[Full decoding backend processing] memory cost: 4396.312MB
[Full decoding backend processing] time cost: 27.156s

[Dynamic decoding backend processing] start
[Dynamic decoding backend processing] memory cost: 500.066MB
[Dynamic decoding backend processing] time cost: 12.666s

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

vllm/multimodal/video.py

DarkLight1337 · 2025-09-11T08:58:00Z

vllm/model_executor/models/glm4_1v.py

-                    input_ids)[0]
+                if "do_sample_frames" in mm_kwargs and not mm_kwargs[
+                        "do_sample_frames"]:
+                    # Transformers v4.55 has incorrect timestamps issue for


Is there a link to the relevant issue so we know when to remove this workaround?

The root issue is the hardcoded 24 fps in Transformers v4.55's no sampling code path:
https://github.com/huggingface/transformers/blob/d79b2d981f28b2730d402244ac3c2e9a8c054eee/src/transformers/models/glm4v/video_processing_glm4v.py#L173-L176

I think huggingface/transformers#39600 should have fixed this issue. And we can remove this after Transformers v4.56 update. (Although current GLM4.1V's vLLM multimodal processor is broken on Transformers v4.56, I would like to fix it in following PR together 😅)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

vllm/multimodal/video.py

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

vllm/multimodal/video.py

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

…ry frames (vllm-project#24161) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

init

0e8d811

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

mergify bot added the multi-modality Related to multi-modality (#4194) label Sep 3, 2025

Isotr0py added 3 commits September 10, 2025 01:47

fix assertion

11f322a

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

add test

536bccb

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Merge branch 'main' into glm-video-loader

1952720

Isotr0py added 2 commits September 11, 2025 00:12

Merge branch 'main' into glm-video-loader

d6c8ce5

cleanup

f895449

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py marked this pull request as ready for review September 11, 2025 08:34

Isotr0py requested review from DarkLight1337 and ywang96 as code owners September 11, 2025 08:34

code format

27d5c15

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

DarkLight1337 reviewed Sep 11, 2025

View reviewed changes

vllm/multimodal/video.py Outdated Show resolved Hide resolved

DarkLight1337 reviewed Sep 11, 2025

View reviewed changes

address comment

ac8d1de

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py requested a review from NickLucche as a code owner September 11, 2025 09:27

add sorted back

f7f4942

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

DarkLight1337 reviewed Sep 11, 2025

View reviewed changes

vllm/multimodal/video.py Outdated Show resolved Hide resolved

add sorted back to set and video loader test

67fd6f0

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

DarkLight1337 reviewed Sep 11, 2025

View reviewed changes

vllm/multimodal/video.py Outdated Show resolved Hide resolved

remove rename and add reference

8c8994a

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

DarkLight1337 approved these changes Sep 11, 2025

View reviewed changes

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 11, 2025

Merge branch 'main' into glm-video-loader

4cb3abd

vllm-bot merged commit bcbe2a4 into vllm-project:main Sep 11, 2025
38 of 41 checks passed

Isotr0py deleted the glm-video-loader branch September 11, 2025 18:05

DarkLight1337 mentioned this pull request Sep 12, 2025

Update to Transformers v4.56.2 #24638

Merged

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

[VLM] Optimize GLM4.5-V-style video processing to only decode necessa…

4692578

…ry frames (vllm-project#24161) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py mentioned this pull request Sep 14, 2025

[Bugfix] Fix GLM4.1V multimodal processor with compatability for Transformers v4.56 #24822

Merged

5 tasks

dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025

[VLM] Optimize GLM4.5-V-style video processing to only decode necessa…

03e655f

…ry frames (vllm-project#24161) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[VLM] Optimize GLM4.5-V-style video processing to only decode necessa…

ea3fb5c

…ry frames (vllm-project#24161) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[VLM] Optimize GLM4.5-V-style video processing to only decode necessary frames#24161

[VLM] Optimize GLM4.5-V-style video processing to only decode necessary frames#24161
vllm-bot merged 12 commits intovllm-project:mainfrom
Isotr0py:glm-video-loader

Isotr0py commented Sep 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

Isotr0py commented Sep 10, 2025

Uh oh!

Uh oh!

DarkLight1337 Sep 11, 2025 •

edited

Loading

Uh oh!

Isotr0py Sep 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Isotr0py commented Sep 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Isotr0py commented Sep 10, 2025

Benchmark results

Uh oh!

Uh oh!

DarkLight1337 Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Isotr0py Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Isotr0py commented Sep 3, 2025 •

edited by github-actions bot

Loading

DarkLight1337 Sep 11, 2025 •

edited

Loading

Isotr0py Sep 11, 2025 •

edited

Loading