Training Qwen2-vl with video data #70

caesaralpha · 2025-02-07T21:04:25Z

caesaralpha
Feb 7, 2025

Hi, I know that Qwen2VL currently only supports fine-tuning with image. I was wondering if there is a way to use videos for the fine-tuning process. Previously, I tried sampling some images and using the multi-image format, but I often encountered CUDA out-of-memory issues. Please let me know if there is a workaround for this or may be there is a new development to make qwen2vl supporting video for its fine tuning.

Answered by zjysteven

Feb 8, 2025

Just to add on what @lavinal712 commented, as far as I know current models are all using sampled image frames to represent video, exactly like what you have tried. So I don't think there is a "workaround".

I would instead suggest to try some other models, especially the ones that by design supports video or multi-image setting, like Llava-Next-Video and Llava-OneVision. The reason is that these models will downsample image frames so that there won't be too many image tokens that will explode your memory.

View full answer

lavinal712 · 2025-02-08T01:16:28Z

lavinal712
Feb 8, 2025
Collaborator

Fine-tuning with video data requires a large amount of GPU memory. May I ask what your training GPU configuration is? Here are a few suggestions: reduce the batch size, lower the LoRA rank, use DeepSpeed ZeRO-3 for training, or switch to a model with fewer parameters. If none of these options work, then you might need to upgrade your device for this finetune (Todd Howard).

1 reply

caesaralpha Feb 8, 2025
Author

Thank you for the suggestions. I will try that. Anyway, I am running on NVIDIA Ampere A100 80Gb now and also experimenting with llava-next-video

zjysteven · 2025-02-08T03:46:54Z

zjysteven
Feb 8, 2025
Maintainer

Just to add on what @lavinal712 commented, as far as I know current models are all using sampled image frames to represent video, exactly like what you have tried. So I don't think there is a "workaround".

I would instead suggest to try some other models, especially the ones that by design supports video or multi-image setting, like Llava-Next-Video and Llava-OneVision. The reason is that these models will downsample image frames so that there won't be too many image tokens that will explode your memory.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Qwen2-vl with video data #70

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Training Qwen2-vl with video data #70

caesaralpha Feb 7, 2025

Replies: 2 comments · 1 reply

lavinal712 Feb 8, 2025 Collaborator

caesaralpha Feb 8, 2025 Author

zjysteven Feb 8, 2025 Maintainer

caesaralpha
Feb 7, 2025

Replies: 2 comments 1 reply

lavinal712
Feb 8, 2025
Collaborator

caesaralpha Feb 8, 2025
Author

zjysteven
Feb 8, 2025
Maintainer