feat: add video support to mtmd by andrewmd5 · Pull Request #20224 · ggml-org/llama.cpp

andrewmd5 · 2026-03-08T02:25:37Z

As discussed in #18389 (comment), this PR adds support for video to mtmd. Results below are from a ~5 minute video scene-by-scene summary from Qwen3.5

andrewmd5 · 2026-03-13T06:26:11Z

I noticed upon further testing that #20284 seems to have degraded the performance of this implementation substantially; I'll collect more data and see whats going on.

libin049 · 2026-04-09T02:20:05Z

Great! Are there any test examples available? Is it ready to be tested now?

libin049 · 2026-04-10T02:20:54Z

I tried this branch but encountered some issues. Did I miss any steps? Here are the commands I used:

./build/bin/llama-mtmd-cli -m ./Qwen2.5-Omni-3B.gguf --mmproj ./mmproj-Qwen2.5-Omni-3B.gguf \
  --image ./draw.mp4 --temp 0 -n 128 --flash-attn on -p "what is this?"

The draw.mp4 file is from:
https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4?spm=5176.28103460.0.0.4d216308swiAIN&file=draw.mp4

Error output:

WARN: This is an experimental CLI for testing multimodal capability.
      For normal use cases, please use the standard llama-cli
mtmd_helper_bitmap_init_from_buf: failed to decode image bytes

ngxson

I'm opening some comments for discussion. I'm taking over this PR and will push commits directly here, so please do not push new commits to avoid conflicts.

ngxson · 2026-04-13T14:00:17Z

+    uint32_t nt = 1; // number of temporal positions (1 for images, > 1 for video)
    bool use_mrope_pos = false; // use M-RoPE position counting (the whole image is 1 temporal position)
-    uint32_t n_tokens() const { return nx * ny; }
+    uint32_t n_tokens() const { return nt * nx * ny; }


this seems to be wrong for qwen. it merges 2 frame into one output, so the output token count should stays nx * ny

ngxson · 2026-04-13T14:45:20Z

+    if (is_video) {
+        const size_t nb1 = ggml_row_size(inp_raw->type, img.nx);
+        const size_t nb2 = nb1 * img.ny;
+        ggml_tensor * inp_even = ggml_view_3d(ctx0, inp_raw, img.nx, img.ny, 3, nb1, nb2, 0);
+        ggml_tensor * inp_odd  = ggml_view_3d(ctx0, inp_raw, img.nx, img.ny, 3, nb1, nb2, nb2 * 3);
+        inp = ggml_add(ctx0,
+            ggml_conv_2d(ctx0, model.patch_embeddings_0, inp_even, patch_size, patch_size, 0, 0, 1, 1),
+            ggml_conv_2d(ctx0, model.patch_embeddings_1, inp_odd,  patch_size, patch_size, 0, 0, 1, 1));
+    } else {
+        inp = ggml_add(ctx0,
+            ggml_conv_2d(ctx0, model.patch_embeddings_0, inp_raw, patch_size, patch_size, 0, 0, 1, 1),
+            ggml_conv_2d(ctx0, model.patch_embeddings_1, inp_raw, patch_size, patch_size, 0, 0, 1, 1));
+    }


if I read this correctly, that means number of output token stays unchanged whether we input single image, or 2 frames

ngxson · 2026-04-13T14:46:01Z

+        //
+        // for 6-channel video input, same layout but with 6 planar channels
+
+        for (int b = 0; b < batch_size; b++) {
+            const int cur_nx = imgs.entries[b]->nx;
+            const int cur_ny = imgs.entries[b]->ny;
+            const int cur_n  = cur_nx * cur_ny;
+
+            float * batch_entry = inp_raw.data() + b * (n_channels * cur_n);
+            for (int y = 0; y < cur_ny; y++) {
+                for (int x = 0; x < cur_nx; x++) {
+                    size_t base_src = n_channels * (y * cur_nx + x);
+                    size_t base_dst =              y * cur_nx + x;
+                    for (int c = 0; c < n_channels; c++) {
+                        batch_entry[c * cur_n + base_dst] = imgs.entries[b]->buf[base_src + c];
                    }


probably better to enter multiple images via the batch dimension, rather than using 6 channels

ngxson · 2026-04-13T14:54:52Z

+    void set_position_mrope_3d(llama_pos pos_0, int nx, int ny, int nt, llama_seq_id seq_id) {
+        GGML_ASSERT(n_pos_per_embd == 4);
+        seq_id_0[0] = seq_id;
+        for (int t = 0; t < nt; t++) {
+            for (int y = 0; y < ny; y++) {
+                for (int x = 0; x < nx; x++) {
+                    int i = t * ny * nx + y * nx + x;
+                    pos[i                     ] = pos_0 + t;
+                    pos[i + batch.n_tokens    ] = pos_0 + y;
+                    pos[i + batch.n_tokens * 2] = pos_0 + x;
+                    pos[i + batch.n_tokens * 3] = 0;
+                }
+            }
+        }
+        for (int i = 0; i < batch.n_tokens; i++) {
+            batch.n_seq_id[i] = 1;
+            batch.seq_id  [i] = seq_id_0.data();
+            batch.logits  [i] = false;
+        }
+    }


note: to be replaced by #21851

ngxson · 2026-04-13T15:42:07Z

hmm ok I cannot push to this PR because it's created from an org account

closing this and move to a new one

feat: add video support for Qwen3.5

573f2cf

andrewmd5 requested a review from ngxson as a code owner March 8, 2026 02:25

github-actions Bot added the examples label Mar 8, 2026

ngxson self-assigned this Apr 5, 2026

ngxson reviewed Apr 13, 2026

View reviewed changes

ngxson closed this Apr 13, 2026

ngxson mentioned this pull request Apr 13, 2026

mtmd: support "frame merge" for qwen-vl-based models #21858

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add video support to mtmd#20224

feat: add video support to mtmd#20224
andrewmd5 wants to merge 1 commit into
ggml-org:masterfrom
6over3:video-support

andrewmd5 commented Mar 8, 2026

Uh oh!

andrewmd5 commented Mar 13, 2026

Uh oh!

libin049 commented Apr 9, 2026

Uh oh!

libin049 commented Apr 10, 2026

Uh oh!

ngxson left a comment

Uh oh!

ngxson Apr 13, 2026

Uh oh!

ngxson Apr 13, 2026 •

edited

Loading

Uh oh!

ngxson Apr 13, 2026

Uh oh!

ngxson Apr 13, 2026

Uh oh!

ngxson commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andrewmd5 commented Mar 8, 2026

Uh oh!

andrewmd5 commented Mar 13, 2026

Uh oh!

libin049 commented Apr 9, 2026

Uh oh!

libin049 commented Apr 10, 2026

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson Apr 13, 2026 •

edited

Loading