mtmd: support "frame merge" for qwen-vl-based models#21858
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
|
@ggml-org/maintainers appreciate if someone can give approval(s), thanks! |
|
I'm fairly sure it can't be assumed that two subsequent images in any prompt should be merged. They may be totally unrelated. AFAIK "conv3d" in the qwen vit is an MLP that does arbitrary transformation of image patches. For example, if a user uploads two screenshots of same size (one image is a handbag, and one is an ebay listing of handbags), with prompt "OCR the listing for this particular handbag", the LM decoder wouldn't be able to do the job. In general, |
|
server adds a newline between 2 images, so there is no chances they are being merged. this is only useful for a custom prompt formatting (aka the upcoming mtmd_helper_video interface) that will take advantage of this. and even if user reports problems about this, we can simply add a new API |
|
@ggml-org/maintainers another approval please 🙏 |
|
Noting a couple of technical points post-merger:
|
Overview
Important
This PR is part of #18389 , it doesn't yet provide end-to-end pipeline for video input. Please only report bugs if you really understand the changes in this PR.
Continue #20224 (moved here because I cannot push to the old PR)
How it works:
Requirements