mtmd: support "frame merge" for qwen-vl-based models by ngxson · Pull Request #21858 · ggml-org/llama.cpp

ngxson · 2026-04-13T15:45:53Z

Overview

Important

This PR is part of #18389 , it doesn't yet provide end-to-end pipeline for video input. Please only report bugs if you really understand the changes in this PR.

Continue #20224 (moved here because I cannot push to the old PR)

How it works:

Qwen-VL-based model have a conv3d for input image, that allows merging 2 input images (must be same size) into the same output embeddings. This speeds up the decoding speed and reduce memory usage
This implementation automatically merge 2 input images if they are (1) placed one right next to the other and (2) have the exact same size, no additional public API is added

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: no

ngxson · 2026-06-06T16:26:03Z

@ggml-org/maintainers appreciate if someone can give approval(s), thanks!

Farmadupe · 2026-06-06T17:01:07Z

I'm fairly sure it can't be assumed that two subsequent images in any prompt should be merged. They may be totally unrelated. AFAIK "conv3d" in the qwen vit is an MLP that does arbitrary transformation of image patches.

For example, if a user uploads two screenshots of same size (one image is a handbag, and one is an ebay listing of handbags), with prompt "OCR the listing for this particular handbag", the LM decoder wouldn't be able to do the job.

In general, v1/chat/completions API cannot unambiguously describe temporal data. Adding video support to llama-server will almost definitely require a nonstandard extension.

ngxson · 2026-06-06T19:14:39Z

server adds a newline between 2 images, so there is no chances they are being merged.

this is only useful for a custom prompt formatting (aka the upcoming mtmd_helper_video interface) that will take advantage of this. and even if user reports problems about this, we can simply add a new API mtmd_bitmap_set_merge(false) to explicitly disable this logic

ngxson · 2026-06-06T19:17:08Z

@ggml-org/maintainers another approval please 🙏

Farmadupe · 2026-06-06T20:34:01Z

Noting a couple of technical points post-merger:

I'm fairly sure that this is a temporal merging layer and is only intended to run on frames that are known to be temporally related.
Adjacency of image chunks in the token stream does not prove or disprove that frames are temporally related.
The presernce of newlines between content array elements should be a function of the jinja template. I don't think it's possible to assert that that will always be the case unless special measures are taken.

andrewmd5 and others added 3 commits March 6, 2026 21:37

feat: add video support for Qwen3.5

573f2cf

Merge branch 'master' into video-support

f558360

various clean up

c5b682b

github-actions Bot added the examples label Apr 13, 2026

Merge branch 'master' into mtmd-video-api

e5b3d6d

This comment was marked as resolved.

Sign in to view

ngxson mentioned this pull request Jun 5, 2026

mtmd, server: add "placeholder bitmap" for counting tokens , add */input_tokens API #23913

Merged

ngxson added 2 commits June 6, 2026 13:38

Merge branch 'master' into mtmd-video-api

96e24ca

revise the design

a404c4e

ngxson mentioned this pull request Jun 6, 2026

mtmd: plan to add video input support #18389

Open

ngxson changed the title ~~mtmd: support input sequence of images (initial video support)~~ mtmd: support "frame merge" for qwen-vl-based models Jun 6, 2026

ngxson added 3 commits June 6, 2026 18:20

fix llava-uhd case

82b4821

nits

9819ad4

nits 2

b031b60

ngxson marked this pull request as ready for review June 6, 2026 16:25

ngxson requested a review from a team as a code owner June 6, 2026 16:25

ServeurpersoCom approved these changes Jun 6, 2026

View reviewed changes

CISC approved these changes Jun 6, 2026

View reviewed changes

ngxson merged commit 31e8249 into master Jun 6, 2026
24 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtmd: support "frame merge" for qwen-vl-based models#21858

mtmd: support "frame merge" for qwen-vl-based models#21858
ngxson merged 9 commits into
masterfrom
mtmd-video-api

ngxson commented Apr 13, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

ngxson commented Jun 6, 2026

Uh oh!

Farmadupe commented Jun 6, 2026 •

edited

Loading

Uh oh!

ngxson commented Jun 6, 2026 •

edited

Loading

Uh oh!

ngxson commented Jun 6, 2026

Uh oh!

Uh oh!

Farmadupe commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ngxson commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

This comment was marked as resolved.

ngxson commented Jun 6, 2026

Uh oh!

Farmadupe commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jun 6, 2026

Uh oh!

Uh oh!

Farmadupe commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ngxson commented Apr 13, 2026 •

edited

Loading

Farmadupe commented Jun 6, 2026 •

edited

Loading

ngxson commented Jun 6, 2026 •

edited

Loading