Skip to content

feat: add video support to mtmd#20224

Closed
andrewmd5 wants to merge 1 commit into
ggml-org:masterfrom
6over3:video-support
Closed

feat: add video support to mtmd#20224
andrewmd5 wants to merge 1 commit into
ggml-org:masterfrom
6over3:video-support

Conversation

@andrewmd5

Copy link
Copy Markdown
Contributor

As discussed in #18389 (comment), this PR adds support for video to mtmd. Results below are from a ~5 minute video scene-by-scene summary from Qwen3.5

Image

@andrewmd5 andrewmd5 requested a review from ngxson as a code owner March 8, 2026 02:25
@andrewmd5

Copy link
Copy Markdown
Contributor Author

I noticed upon further testing that #20284 seems to have degraded the performance of this implementation substantially; I'll collect more data and see whats going on.

@ngxson ngxson self-assigned this Apr 5, 2026
@libin049

libin049 commented Apr 9, 2026

Copy link
Copy Markdown

Great! Are there any test examples available? Is it ready to be tested now?

@libin049

Copy link
Copy Markdown

I tried this branch but encountered some issues. Did I miss any steps? Here are the commands I used:

./build/bin/llama-mtmd-cli -m ./Qwen2.5-Omni-3B.gguf --mmproj ./mmproj-Qwen2.5-Omni-3B.gguf \
  --image ./draw.mp4 --temp 0 -n 128 --flash-attn on -p "what is this?"

The draw.mp4 file is from:
https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4?spm=5176.28103460.0.0.4d216308swiAIN&file=draw.mp4

Error output:

WARN: This is an experimental CLI for testing multimodal capability.
      For normal use cases, please use the standard llama-cli
mtmd_helper_bitmap_init_from_buf: failed to decode image bytes

@ngxson ngxson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm opening some comments for discussion. I'm taking over this PR and will push commits directly here, so please do not push new commits to avoid conflicts.

Comment thread tools/mtmd/mtmd.cpp
Comment on lines +39 to +41
uint32_t nt = 1; // number of temporal positions (1 for images, > 1 for video)
bool use_mrope_pos = false; // use M-RoPE position counting (the whole image is 1 temporal position)
uint32_t n_tokens() const { return nx * ny; }
uint32_t n_tokens() const { return nt * nx * ny; }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be wrong for qwen. it merges 2 frame into one output, so the output token count should stays nx * ny

Comment on lines +25 to +37
if (is_video) {
const size_t nb1 = ggml_row_size(inp_raw->type, img.nx);
const size_t nb2 = nb1 * img.ny;
ggml_tensor * inp_even = ggml_view_3d(ctx0, inp_raw, img.nx, img.ny, 3, nb1, nb2, 0);
ggml_tensor * inp_odd = ggml_view_3d(ctx0, inp_raw, img.nx, img.ny, 3, nb1, nb2, nb2 * 3);
inp = ggml_add(ctx0,
ggml_conv_2d(ctx0, model.patch_embeddings_0, inp_even, patch_size, patch_size, 0, 0, 1, 1),
ggml_conv_2d(ctx0, model.patch_embeddings_1, inp_odd, patch_size, patch_size, 0, 0, 1, 1));
} else {
inp = ggml_add(ctx0,
ggml_conv_2d(ctx0, model.patch_embeddings_0, inp_raw, patch_size, patch_size, 0, 0, 1, 1),
ggml_conv_2d(ctx0, model.patch_embeddings_1, inp_raw, patch_size, patch_size, 0, 0, 1, 1));
}

@ngxson ngxson Apr 13, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I read this correctly, that means number of output token stays unchanged whether we input single image, or 2 frames

Comment thread tools/mtmd/clip.cpp
Comment on lines +3623 to 3638
//
// for 6-channel video input, same layout but with 6 planar channels

for (int b = 0; b < batch_size; b++) {
const int cur_nx = imgs.entries[b]->nx;
const int cur_ny = imgs.entries[b]->ny;
const int cur_n = cur_nx * cur_ny;

float * batch_entry = inp_raw.data() + b * (n_channels * cur_n);
for (int y = 0; y < cur_ny; y++) {
for (int x = 0; x < cur_nx; x++) {
size_t base_src = n_channels * (y * cur_nx + x);
size_t base_dst = y * cur_nx + x;
for (int c = 0; c < n_channels; c++) {
batch_entry[c * cur_n + base_dst] = imgs.entries[b]->buf[base_src + c];
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to enter multiple images via the batch dimension, rather than using 6 channels

Comment on lines +178 to +197
void set_position_mrope_3d(llama_pos pos_0, int nx, int ny, int nt, llama_seq_id seq_id) {
GGML_ASSERT(n_pos_per_embd == 4);
seq_id_0[0] = seq_id;
for (int t = 0; t < nt; t++) {
for (int y = 0; y < ny; y++) {
for (int x = 0; x < nx; x++) {
int i = t * ny * nx + y * nx + x;
pos[i ] = pos_0 + t;
pos[i + batch.n_tokens ] = pos_0 + y;
pos[i + batch.n_tokens * 2] = pos_0 + x;
pos[i + batch.n_tokens * 3] = 0;
}
}
}
for (int i = 0; i < batch.n_tokens; i++) {
batch.n_seq_id[i] = 1;
batch.seq_id [i] = seq_id_0.data();
batch.logits [i] = false;
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: to be replaced by #21851

@ngxson

ngxson commented Apr 13, 2026

Copy link
Copy Markdown
Collaborator

hmm ok I cannot push to this PR because it's created from an org account

closing this and move to a new one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants