Skip to content

Training Qwen2-vl with video data #70

Answered by zjysteven
caesaralpha asked this question in Q&A
Discussion options

You must be logged in to vote

Just to add on what @lavinal712 commented, as far as I know current models are all using sampled image frames to represent video, exactly like what you have tried. So I don't think there is a "workaround".

I would instead suggest to try some other models, especially the ones that by design supports video or multi-image setting, like Llava-Next-Video and Llava-OneVision. The reason is that these models will downsample image frames so that there won't be too many image tokens that will explode your memory.

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
1 reply
@caesaralpha
Comment options

Comment options

You must be logged in to vote
0 replies
Answer selected by lavinal712
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants