Skip to content

mtmd: refactor image preprocessing#21031

Merged
ngxson merged 5 commits into
ggml-org:masterfrom
ngxson:xsn/mtmd_refactor_image_preproc
Mar 26, 2026
Merged

mtmd: refactor image preprocessing#21031
ngxson merged 5 commits into
ggml-org:masterfrom
ngxson:xsn/mtmd_refactor_image_preproc

Conversation

@ngxson
Copy link
Copy Markdown
Contributor

@ngxson ngxson commented Mar 26, 2026

Overview

Refactor clip_image_preprocess to dedicated classes inherit from mtmd_image_preprocessor, making it separated from clip.cpp

Requirements

@ngxson ngxson requested a review from a team as a code owner March 26, 2026 15:08
@ngxson ngxson requested review from CISC and ggerganov March 26, 2026 15:08
@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Mar 26, 2026

Test results (llama 4 failed due to OOM, no idea why but I guess no one really use llama 4)

[vision] OK:   ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[vision] OK:   ggml-org/LFM2-VL-450M-GGUF:Q8_0
[vision] OK:   ggml-org/granite-docling-258M-GGUF:Q8_0
[vision] OK:   ggml-org/LightOnOCR-1B-1025-GGUF:Q8_0
[vision] OK:   ggml-org/DeepSeek-OCR-GGUF:Q8_0
[audio]  OK:   ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M
[audio]  OK:   ggml-org/LFM2-Audio-1.5B-GGUF:Q8_0
[vision] OK:   ggml-org/pixtral-12b-GGUF:Q4_K_M
[vision] OK:   ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
[vision] OK:   ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen3-VL-2B-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[vision] OK:   ggml-org/GLM-4.6V-Flash-GGUF:Q4_K_M
[audio]  OK:   ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF:Q4_K_M
[audio]  OK:   ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M
[vision] FAIL: ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Mar 26, 2026

@ggerganov this PR is mostly dedup / moving code around, could we get this merged soon? Thanks!

Comment thread tools/mtmd/mtmd.cpp
Comment on lines -372 to -374
} else if (proj == PROJECTOR_TYPE_ULTRAVOX) {
// [BEGIN_AUDIO] ... (embeddings) ...
aud_beg = "[BEGIN_AUDIO]";
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that there was a mistake on master, [BEGIN_AUDIO] is used by voxtral and not ultravox, it's fixed in this PR

@ngxson ngxson merged commit a73bbd5 into ggml-org:master Mar 26, 2026
44 of 45 checks passed
Vect0rM added a commit to AtomicBot-ai/atomic-llama-cpp-turboquant that referenced this pull request Apr 2, 2026
The field image_resize_algo was introduced in upstream PR ggml-org#21031 (mtmd
refactor) which we haven't merged yet. Our preprocessing pipeline uses
explicit img_tool::resize() calls with direct algo parameters, so this
field is not needed for Gemma4V support in our branch.

Made-with: Cursor
slartibardfast pushed a commit to slartibardfast/llama.cpp that referenced this pull request Apr 12, 2026
* mtmd: refactor image pre-processing

* correct some places

* correct lfm2

* fix deepseek-ocr on server

* add comment to clarify about mtmd_image_preprocessor_dyn_size
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* mtmd: refactor image pre-processing

* correct some places

* correct lfm2

* fix deepseek-ocr on server

* add comment to clarify about mtmd_image_preprocessor_dyn_size
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* mtmd: refactor image pre-processing

* correct some places

* correct lfm2

* fix deepseek-ocr on server

* add comment to clarify about mtmd_image_preprocessor_dyn_size
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
* mtmd: refactor image pre-processing

* correct some places

* correct lfm2

* fix deepseek-ocr on server

* add comment to clarify about mtmd_image_preprocessor_dyn_size
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants