feat: add video support to mtmd#20224
Conversation
|
I noticed upon further testing that #20284 seems to have degraded the performance of this implementation substantially; I'll collect more data and see whats going on. |
|
Great! Are there any test examples available? Is it ready to be tested now? |
|
I tried this branch but encountered some issues. Did I miss any steps? Here are the commands I used: ./build/bin/llama-mtmd-cli -m ./Qwen2.5-Omni-3B.gguf --mmproj ./mmproj-Qwen2.5-Omni-3B.gguf \
--image ./draw.mp4 --temp 0 -n 128 --flash-attn on -p "what is this?"The draw.mp4 file is from: Error output: |
ngxson
left a comment
There was a problem hiding this comment.
I'm opening some comments for discussion. I'm taking over this PR and will push commits directly here, so please do not push new commits to avoid conflicts.
| uint32_t nt = 1; // number of temporal positions (1 for images, > 1 for video) | ||
| bool use_mrope_pos = false; // use M-RoPE position counting (the whole image is 1 temporal position) | ||
| uint32_t n_tokens() const { return nx * ny; } | ||
| uint32_t n_tokens() const { return nt * nx * ny; } |
There was a problem hiding this comment.
this seems to be wrong for qwen. it merges 2 frame into one output, so the output token count should stays nx * ny
| if (is_video) { | ||
| const size_t nb1 = ggml_row_size(inp_raw->type, img.nx); | ||
| const size_t nb2 = nb1 * img.ny; | ||
| ggml_tensor * inp_even = ggml_view_3d(ctx0, inp_raw, img.nx, img.ny, 3, nb1, nb2, 0); | ||
| ggml_tensor * inp_odd = ggml_view_3d(ctx0, inp_raw, img.nx, img.ny, 3, nb1, nb2, nb2 * 3); | ||
| inp = ggml_add(ctx0, | ||
| ggml_conv_2d(ctx0, model.patch_embeddings_0, inp_even, patch_size, patch_size, 0, 0, 1, 1), | ||
| ggml_conv_2d(ctx0, model.patch_embeddings_1, inp_odd, patch_size, patch_size, 0, 0, 1, 1)); | ||
| } else { | ||
| inp = ggml_add(ctx0, | ||
| ggml_conv_2d(ctx0, model.patch_embeddings_0, inp_raw, patch_size, patch_size, 0, 0, 1, 1), | ||
| ggml_conv_2d(ctx0, model.patch_embeddings_1, inp_raw, patch_size, patch_size, 0, 0, 1, 1)); | ||
| } |
There was a problem hiding this comment.
if I read this correctly, that means number of output token stays unchanged whether we input single image, or 2 frames
| // | ||
| // for 6-channel video input, same layout but with 6 planar channels | ||
|
|
||
| for (int b = 0; b < batch_size; b++) { | ||
| const int cur_nx = imgs.entries[b]->nx; | ||
| const int cur_ny = imgs.entries[b]->ny; | ||
| const int cur_n = cur_nx * cur_ny; | ||
|
|
||
| float * batch_entry = inp_raw.data() + b * (n_channels * cur_n); | ||
| for (int y = 0; y < cur_ny; y++) { | ||
| for (int x = 0; x < cur_nx; x++) { | ||
| size_t base_src = n_channels * (y * cur_nx + x); | ||
| size_t base_dst = y * cur_nx + x; | ||
| for (int c = 0; c < n_channels; c++) { | ||
| batch_entry[c * cur_n + base_dst] = imgs.entries[b]->buf[base_src + c]; | ||
| } |
There was a problem hiding this comment.
probably better to enter multiple images via the batch dimension, rather than using 6 channels
| void set_position_mrope_3d(llama_pos pos_0, int nx, int ny, int nt, llama_seq_id seq_id) { | ||
| GGML_ASSERT(n_pos_per_embd == 4); | ||
| seq_id_0[0] = seq_id; | ||
| for (int t = 0; t < nt; t++) { | ||
| for (int y = 0; y < ny; y++) { | ||
| for (int x = 0; x < nx; x++) { | ||
| int i = t * ny * nx + y * nx + x; | ||
| pos[i ] = pos_0 + t; | ||
| pos[i + batch.n_tokens ] = pos_0 + y; | ||
| pos[i + batch.n_tokens * 2] = pos_0 + x; | ||
| pos[i + batch.n_tokens * 3] = 0; | ||
| } | ||
| } | ||
| } | ||
| for (int i = 0; i < batch.n_tokens; i++) { | ||
| batch.n_seq_id[i] = 1; | ||
| batch.seq_id [i] = seq_id_0.data(); | ||
| batch.logits [i] = false; | ||
| } | ||
| } |
|
hmm ok I cannot push to this PR because it's created from an org account closing this and move to a new one |
As discussed in #18389 (comment), this PR adds support for video to mtmd. Results below are from a ~5 minute video scene-by-scene summary from Qwen3.5