Training Qwen2-vl with video data #70
-
Hi, I know that Qwen2VL currently only supports fine-tuning with image. I was wondering if there is a way to use videos for the fine-tuning process. Previously, I tried sampling some images and using the multi-image format, but I often encountered CUDA out-of-memory issues. Please let me know if there is a workaround for this or may be there is a new development to make qwen2vl supporting video for its fine tuning. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Fine-tuning with video data requires a large amount of GPU memory. May I ask what your training GPU configuration is? Here are a few suggestions: reduce the batch size, lower the LoRA rank, use DeepSpeed ZeRO-3 for training, or switch to a model with fewer parameters. If none of these options work, then you might need to upgrade your device for this finetune (Todd Howard). |
Beta Was this translation helpful? Give feedback.
-
Just to add on what @lavinal712 commented, as far as I know current models are all using sampled image frames to represent video, exactly like what you have tried. So I don't think there is a "workaround". I would instead suggest to try some other models, especially the ones that by design supports video or multi-image setting, like Llava-Next-Video and Llava-OneVision. The reason is that these models will downsample image frames so that there won't be too many image tokens that will explode your memory. |
Beta Was this translation helpful? Give feedback.
Just to add on what @lavinal712 commented, as far as I know current models are all using sampled image frames to represent video, exactly like what you have tried. So I don't think there is a "workaround".
I would instead suggest to try some other models, especially the ones that by design supports video or multi-image setting, like Llava-Next-Video and Llava-OneVision. The reason is that these models will downsample image frames so that there won't be too many image tokens that will explode your memory.