mtmd: refactor image preprocessing by ngxson · Pull Request #21031 · ggml-org/llama.cpp

ngxson · 2026-03-26T15:08:09Z

Overview

Refactor clip_image_preprocess to dedicated classes inherit from mtmd_image_preprocessor, making it separated from clip.cpp

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: No

ngxson · 2026-03-26T15:10:38Z

Test results (llama 4 failed due to OOM, no idea why but I guess no one really use llama 4)

[vision] OK:   ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[vision] OK:   ggml-org/LFM2-VL-450M-GGUF:Q8_0
[vision] OK:   ggml-org/granite-docling-258M-GGUF:Q8_0
[vision] OK:   ggml-org/LightOnOCR-1B-1025-GGUF:Q8_0
[vision] OK:   ggml-org/DeepSeek-OCR-GGUF:Q8_0
[audio]  OK:   ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M
[audio]  OK:   ggml-org/LFM2-Audio-1.5B-GGUF:Q8_0
[vision] OK:   ggml-org/pixtral-12b-GGUF:Q4_K_M
[vision] OK:   ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
[vision] OK:   ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen3-VL-2B-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[vision] OK:   ggml-org/GLM-4.6V-Flash-GGUF:Q4_K_M
[audio]  OK:   ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF:Q4_K_M
[audio]  OK:   ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M
[vision] FAIL: ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S

ngxson · 2026-03-26T17:41:15Z

@ggerganov this PR is mostly dedup / moving code around, could we get this merged soon? Thanks!

ngxson · 2026-03-26T17:43:19Z

-        } else if (proj == PROJECTOR_TYPE_ULTRAVOX) {
-            // [BEGIN_AUDIO] ... (embeddings) ...
-            aud_beg = "[BEGIN_AUDIO]";


note that there was a mistake on master, [BEGIN_AUDIO] is used by voxtral and not ultravox, it's fixed in this PR

The field image_resize_algo was introduced in upstream PR ggml-org#21031 (mtmd refactor) which we haven't merged yet. Our preprocessing pipeline uses explicit img_tool::resize() calls with direct algo parameters, so this field is not needed for Gemma4V support in our branch. Made-with: Cursor

* mtmd: refactor image pre-processing * correct some places * correct lfm2 * fix deepseek-ocr on server * add comment to clarify about mtmd_image_preprocessor_dyn_size

ngxson added 2 commits March 26, 2026 15:50

mtmd: refactor image pre-processing

906b46e

correct some places

f48a34a

ngxson requested a review from a team as a code owner March 26, 2026 15:08

ngxson requested review from CISC and ggerganov March 26, 2026 15:08

correct lfm2

ed67376

github-actions Bot added the examples label Mar 26, 2026

fix deepseek-ocr on server

eae3930

ngxson commented Mar 26, 2026

View reviewed changes

add comment to clarify about mtmd_image_preprocessor_dyn_size

d58d979

CISC approved these changes Mar 26, 2026

View reviewed changes

ggerganov approved these changes Mar 26, 2026

View reviewed changes

ngxson merged commit a73bbd5 into ggml-org:master Mar 26, 2026
44 of 45 checks passed

googol4u mentioned this pull request May 7, 2026

Misc. bug: Server: Qwen-VL vision accuracy degraded in fine-grained OCR tasks since b8545 (persists in latest), while llama-cli remains unaffected #22785

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtmd: refactor image preprocessing#21031

mtmd: refactor image preprocessing#21031
ngxson merged 5 commits into
ggml-org:masterfrom
ngxson:xsn/mtmd_refactor_image_preproc

ngxson commented Mar 26, 2026

Uh oh!

ngxson commented Mar 26, 2026

Uh oh!

ngxson commented Mar 26, 2026

Uh oh!

ngxson Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ngxson commented Mar 26, 2026

Overview

Requirements

Uh oh!

ngxson commented Mar 26, 2026

Uh oh!

ngxson commented Mar 26, 2026

Uh oh!

ngxson Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants