Skip to content

mtmd: support "frame merge" for qwen-vl-based models#21858

Merged
ngxson merged 9 commits into
masterfrom
mtmd-video-api
Jun 6, 2026
Merged

mtmd: support "frame merge" for qwen-vl-based models#21858
ngxson merged 9 commits into
masterfrom
mtmd-video-api

Conversation

@ngxson
Copy link
Copy Markdown
Contributor

@ngxson ngxson commented Apr 13, 2026

Overview

Important

This PR is part of #18389 , it doesn't yet provide end-to-end pipeline for video input. Please only report bugs if you really understand the changes in this PR.

Continue #20224 (moved here because I cannot push to the old PR)

How it works:

  • Qwen-VL-based model have a conv3d for input image, that allows merging 2 input images (must be same size) into the same output embeddings. This speeds up the decoding speed and reduce memory usage
  • This implementation automatically merge 2 input images if they are (1) placed one right next to the other and (2) have the exact same size, no additional public API is added

Requirements

@ngxson

This comment was marked as resolved.

@ngxson ngxson changed the title mtmd: support input sequence of images (initial video support) mtmd: support "frame merge" for qwen-vl-based models Jun 6, 2026
@ngxson ngxson marked this pull request as ready for review June 6, 2026 16:25
@ngxson ngxson requested a review from a team as a code owner June 6, 2026 16:25
@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Jun 6, 2026

@ggml-org/maintainers appreciate if someone can give approval(s), thanks!

@Farmadupe
Copy link
Copy Markdown

Farmadupe commented Jun 6, 2026

I'm fairly sure it can't be assumed that two subsequent images in any prompt should be merged. They may be totally unrelated. AFAIK "conv3d" in the qwen vit is an MLP that does arbitrary transformation of image patches.

For example, if a user uploads two screenshots of same size (one image is a handbag, and one is an ebay listing of handbags), with prompt "OCR the listing for this particular handbag", the LM decoder wouldn't be able to do the job.

In general, v1/chat/completions API cannot unambiguously describe temporal data. Adding video support to llama-server will almost definitely require a nonstandard extension.

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Jun 6, 2026

server adds a newline between 2 images, so there is no chances they are being merged.

this is only useful for a custom prompt formatting (aka the upcoming mtmd_helper_video interface) that will take advantage of this. and even if user reports problems about this, we can simply add a new API mtmd_bitmap_set_merge(false) to explicitly disable this logic

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Jun 6, 2026

@ggml-org/maintainers another approval please 🙏

@ngxson ngxson merged commit 31e8249 into master Jun 6, 2026
24 of 25 checks passed
@Farmadupe
Copy link
Copy Markdown

Noting a couple of technical points post-merger:

  • I'm fairly sure that this is a temporal merging layer and is only intended to run on frames that are known to be temporally related.
  • Adjacency of image chunks in the token stream does not prove or disprove that frames are temporally related.
  • The presernce of newlines between content array elements should be a function of the jinja template. I don't think it's possible to assert that that will always be the case unless special measures are taken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants