[VLM] Optimize GLM4.5-V-style video processing to only decode necessary frames#24161
[VLM] Optimize GLM4.5-V-style video processing to only decode necessary frames#24161vllm-bot merged 12 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Benchmark resultsScript: https://gist.github.com/Isotr0py/921b17edaeef1ed8bc211e22b47c84b4 |
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
| input_ids)[0] | ||
| if "do_sample_frames" in mm_kwargs and not mm_kwargs[ | ||
| "do_sample_frames"]: | ||
| # Transformers v4.55 has incorrect timestamps issue for |
There was a problem hiding this comment.
Is there a link to the relevant issue so we know when to remove this workaround?
There was a problem hiding this comment.
The root issue is the hardcoded 24 fps in Transformers v4.55's no sampling code path:
https://github.com/huggingface/transformers/blob/d79b2d981f28b2730d402244ac3c2e9a8c054eee/src/transformers/models/glm4v/video_processing_glm4v.py#L173-L176
I think huggingface/transformers#39600 should have fixed this issue. And we can remove this after Transformers v4.56 update. (Although current GLM4.1V's vLLM multimodal processor is broken on Transformers v4.56, I would like to fix it in following PR together 😅)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…ry frames (vllm-project#24161) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…ry frames (vllm-project#24161) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…ry frames (vllm-project#24161) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Purpose
--media-io-kwargs '{"video": {"num_frames": -1}}', which is not safe enough and cause extremly high RAM usage to crash server if input video is quite long.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.