mtmd: refactor preprocessing + support max/min pixels #16878

ngxson · 2025-10-30T23:33:59Z

Implement "smart resize", aka max_pixels / min_pixels for preprocessing. This will now allow us to limit the input by number of tokens, no more hard limit on image height/width. This is particularly useful if you have an image which have large height but small width.

The preprocessing of QwenVL model is also improved. We now only pad the bottom-right corner of the image, so the x/y coordinate of elements stay unchanged. This should improve the accuracy of bbox application.

ngxson · 2025-10-31T15:29:41Z

All tests passed

[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/LFM2-VL-450M-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/granite-docling-258M-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/LightOnOCR-1B-1025-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen3-VL-2B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S

ggerganov · 2025-10-31T15:47:08Z

The result could probably better with Q8_0 or F16 language model? (Please test if you can)

What command and image to use for this test?

ngxson · 2025-10-31T15:48:17Z

@ggerganov you can use the command here: #16825

the script for overlaying bbox (for visual inspection) is also included: https://gist.github.com/ngxson/039024fb2bdaf2e3c15db702f9fddaff

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

tarruda · 2025-10-31T16:22:41Z

In my Qwen3-VL-8b testing, it seems this branch decreases the accuracy: #16842 (comment)

ggerganov · 2025-10-31T18:34:50Z

tools/mtmd/clip.cpp

+                    const clip_image_size scaled_size = img_tool::calc_size_preserved_ratio(
+                        original_size,
+                        1, // no need to align here since we will composite onto canvas
+                        std::min(canvas.nx, canvas.ny)); // fit into the canvas


Should this be:

Suggested change

std::min(canvas.nx, canvas.ny)); // fit into the canvas

std::max(canvas.nx, canvas.ny)); // fit into the canvas

ggerganov · 2025-10-31T18:47:32Z

Is this missing a factor of 2?

diff --git a/tools/mtmd/clip.cpp b/tools/mtmd/clip.cpp
index 355b42435..62e64040e 100644
--- a/tools/mtmd/clip.cpp
+++ b/tools/mtmd/clip.cpp
@@ -4268,7 +4268,7 @@ int clip_n_output_tokens_x(const struct clip_ctx * ctx, struct clip_image_f32 *
     const auto & params = ctx->model.hparams;
     const int n_total = clip_n_output_tokens(ctx, img);
     if (ctx->proj_type() == PROJECTOR_TYPE_QWEN2VL || ctx->proj_type() == PROJECTOR_TYPE_QWEN25VL || ctx->proj_type() == PROJECTOR_TYPE_QWEN3VL) {
-        return img->nx / (params.patch_size * 2) + (int)(img->nx % params.patch_size > 0);
+        return img->nx / (params.patch_size * 2) + (int)(img->nx % (params.patch_size * 2) > 0);
     }
     return n_total;
 }
@@ -4276,7 +4276,7 @@ int clip_n_output_tokens_x(const struct clip_ctx * ctx, struct clip_image_f32 *
 int clip_n_output_tokens_y(const struct clip_ctx * ctx, struct clip_image_f32 * img) {
     const auto & params = ctx->model.hparams;
     if (ctx->proj_type() == PROJECTOR_TYPE_QWEN2VL || ctx->proj_type() == PROJECTOR_TYPE_QWEN25VL || ctx->proj_type() == PROJECTOR_TYPE_QWEN3VL) {
-        return img->ny / (params.patch_size * 2) + (int)(img->ny % params.patch_size > 0);
+        return img->ny / (params.patch_size * 2) + (int)(img->ny % (params.patch_size * 2) > 0);
     }
     return 1;
 }

ngxson · 2025-10-31T22:03:44Z

The code above doesn't have * 2 in the right term because it's for alignment, but we no longer need it anymore as the image is now always aligned.

The token count should also be correct:

def align_next_multiple(value, multiple):
  return ((value + multiple - 1) // multiple) * multiple

w = 320
h = 240
patch_size = 16
n_tok_x = align_next_multiple(w, patch_size*2) // (patch_size*2)
n_tok_y = align_next_multiple(h, patch_size*2) // (patch_size*2)
print(f"n_tok_x: {n_tok_x}, n_tok_y: {n_tok_y}, n_tok: {n_tok_x * n_tok_y}")

# print: n_tok_x: 10, n_tok_y: 8, n_tok: 80

ngxson · 2025-10-31T23:11:07Z

@tarruda the last commit should gives a pretty much accurate result even with small input image, please give it a try

tarruda · 2025-11-01T01:14:54Z

Looking better now @ngxson :

Comparing with my previous run, it is a bit worse in this invoice example because the "Quantity" column boxes seems a bit off to the right, but it is better on another example I have that is more complex than this invoice (can't share because it is a customer's screenshot, but basically an AS400 terminal with a lot more text).

Here's a computer desktop example:

And a nature picture:

Overall the bounding boxes on this branch look solid, but I can't say for sure that this is better or worse than what's on master because it had similar performance on the few examples I tested.

ngxson · 2025-11-01T14:19:36Z

I ran the test again and seems like all models are passed. This PR should now be ready. Could you do a review @ggerganov ?

For the custom --image-min/max-tokens arg, I'll add it in a follow-up PR (as well as the warning mentioned in #16842 (comment)). This will be a breaking change to public API of mtmd.h, so I prefer to keep it in a separate PR.

[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/LFM2-VL-450M-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/granite-docling-258M-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/LightOnOCR-1B-1025-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen3-VL-2B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S

ggerganov

I've been testing quite a bit with the Qwen3 VL models and I think they work OK. I am not sure how to test the Qwen2.5 VL models since your script does not support absolute coordinates.

Based on the the discussions and tests in #16842, here are few recommendations about mtmd that I think would be nice to do in the future:

The default number of minimum tokens for Qwen3 VL should be larger in order to work correctly with grounding tasks. Having 80 tokens vs 1024 image tokens is almost negligible for modern hardware (especially with #16837), so I don't think it's worth sparing tokens. There could be an option to change that (for example for edge/mobile use cases), but this should not be the default IMO
libmtmd could use some more detailed logging and debugging options in order to be able to trace what happens with the images (preprocessing, patch sizes, number of tokens, etc.)
The OCR tests seem really useful - we should have a version controlled script in the examples that can work both with absolute and relative coordinates

ngxson · 2025-11-01T14:50:40Z

The OCR tests seem really useful - we should have a version controlled script in the examples that can work both with absolute and relative coordinates

Agree, I think some kind of "cookbook" like what Qwen has could serves as both a test script + a demo

jhemmond · 2025-11-06T16:05:36Z

Reported extreme slowdown after this PR was merged into main, #17012.

I understand this PR was to give users the freedom to choose the max size/tokens for processed images. That said, something else was done that impacts performance significantly. When I match the old max size, I get multiple times slower process. See my comment here.

Wondering if this is Vulkan backend specific. Note that all builds prior to b6915 don't have this behavior.

* model : Granite docling + Idefics3 preprocessing (SmolVLM) (ggml-org#16206) * feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * add test model --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> # Conflicts: # convert_hf_to_gguf.py # convert_hf_to_gguf_update.py # gguf-py/gguf/constants.py # gguf-py/gguf/gguf_writer.py # src/llama-vocab.cpp # src/llama-vocab.h * mtmd : support home-cooked Mistral Small Omni (ggml-org#14928) * model : add LightOnOCR-1B model (ggml-org#16764) * model : add LightOnOCR-1B model * add test # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py * mtmd : fix idefics3 preprocessing (ggml-org#16806) * mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite * model: Add support for CogVLM model (ggml-org#15002) * Added GGUF mappings for CogVLM model * Add tensor mapping for CogVLM visual encoder * Add CogVLM to conversion script, no vision part yet * Added CogVLM vision model to conversion script * Add graph for CogVLM CLIP model * Add graph for CogVLM * Fixes for CogVLM. Now compiles. * Model now runs * Fixes for cogvlm graph * Account for graph context change after rebase * Changes for whitespace * Changes in convert script according to comments * Switch CogVLM LLM graph to merged QKV tensor * Use rope_type variable instead of direct definition * Change CogVLM CLIP encoder to use SWIGLU * Switch CogVLM CLIP to use merged QKV * Apply rebase edits and remove ggml_cont call that is now unnecessary * clean up --------- Co-authored-by: Xuan Son Nguyen <[email protected]> # Conflicts: # convert_hf_to_gguf.py # examples/mtmd/clip.cpp # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py # src/llama-arch.cpp # src/llama-arch.h # src/llama-model.cpp # src/llama-model.h * mtmd: refactor preprocessing + support max/min pixels (ggml-org#16878) * mtmd: refactor preprocessing + support max/min pixels * fix mlp type * implement mix/max pixels * improve hparams * better image preproc for qwen * fix * fix out of bound composite * fix (2) * fix token calculation * get_merge_kernel_size() * fix llama4 and lfm2 * gonna fix them all * use simple resize for qwen * qwen: increase min tokens * no resize if dst size == src size * restore to initial min/max tokens value for qwen # Conflicts: # examples/mtmd/clip.cpp * clip : use FA (ggml-org#16837) * clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <[email protected]> * model: add Janus Pro for image understanding (ggml-org#16906) * Add support for Janus Pro * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Address reviewer suggestions Co-authored-by: Sigbjørn Skjæret <[email protected]> * Add JANUS_PRO constant * Update clip model handling Co-authored-by: Xuan-Son Nguyen <[email protected]> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <[email protected]> * Refactor JANUS_PRO handling in clip.cpp Co-authored-by: Xuan-Son Nguyen <[email protected]> * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * em whitespace --------- Co-authored-by: Sigbjørn Skjæret <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]> # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py * mtmd: pad mask for qwen2.5vl (ggml-org#16954) * mtmd: pad mask for qwen2.5vl * improve * mtmd: add --image-min/max-tokens (ggml-org#16921) * mtmd: improve struct initialization (ggml-org#16981) * mtmd: allow QwenVL to process larger image by default (ggml-org#17020) * Disable flash attention * mtmd : fix embedding size for image input (ggml-org#17123) * mtmd: fix patch_size initialized to random value in audio models (ggml-org#17128) * mtmd: fix patch_size initialized to random value in audio models * add default hparams * add llama_model_n_embd_inp * Fix load qwen3 vl Change batch size * Add description * Fix cli build error --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Tianyue-Zhao <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Zhiyong Wang <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]> Co-authored-by: firecoperana <firecoperana>

* mtmd: refactor preprocessing + support max/min pixels * fix mlp type * implement mix/max pixels * improve hparams * better image preproc for qwen * fix * fix out of bound composite * fix (2) * fix token calculation * get_merge_kernel_size() * fix llama4 and lfm2 * gonna fix them all * use simple resize for qwen * qwen: increase min tokens * no resize if dst size == src size * restore to initial min/max tokens value for qwen

mtmd: refactor preprocessing + support max/min pixels

c1b189a

ngxson linked an issue Oct 30, 2025 that may be closed by this pull request

Improve pre-processed image size for QwenVL #16842

Closed

This comment was marked as outdated.

Sign in to view

github-actions bot added the examples label Oct 31, 2025

ngxson added 11 commits October 31, 2025 10:35

fix mlp type

13cd204

implement mix/max pixels

66d5c43

improve hparams

c53c566

better image preproc for qwen

68b1507

fix

7bd1a01

fix out of bound composite

2892e0f

fix (2)

4274417

fix token calculation

000d1d9

get_merge_kernel_size()

bfd03fb

fix llama4 and lfm2

2c0d960

gonna fix them all

4621d99

ngxson marked this pull request as ready for review October 31, 2025 15:29

ngxson mentioned this pull request Oct 31, 2025

Eval bug: Qwen 3 VL provides incorrect bounding boxes on sizes that are not 1000x1000px / non-square #16880

Closed

ngxson requested review from Copilot and ggerganov October 31, 2025 15:31

Copilot AI reviewed Oct 31, 2025

View reviewed changes

tarruda mentioned this pull request Oct 31, 2025

Improve pre-processed image size for QwenVL #16842

Closed

ggerganov reviewed Oct 31, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

use simple resize for qwen

bae84d4

qwen: increase min tokens

29c7268

ngxson added 2 commits November 1, 2025 11:42

no resize if dst size == src size

00ee52e

restore to initial min/max tokens value for qwen

a834e1c

ggerganov approved these changes Nov 1, 2025

View reviewed changes

ngxson merged commit cf659bb into ggml-org:master Nov 1, 2025
69 of 72 checks passed

wqerrewetw mentioned this pull request Nov 5, 2025

Eval bug: Qwen3-VL-8B freezes on image processing tasks #17012

Closed

sujitvasanth mentioned this pull request Nov 10, 2025

Qwen3-VL co-ordinate and bounding box errors (grounding errors) #17131

Closed

a-ghorbani mentioned this pull request Nov 13, 2025

[Bug]: Qwen3-VL very slow a-ghorbani/pocketpal-ai#485

Closed

	std::min(canvas.nx, canvas.ny)); // fit into the canvas
	std::max(canvas.nx, canvas.ny)); // fit into the canvas

mtmd: refactor preprocessing + support max/min pixels #16878

mtmd: refactor preprocessing + support max/min pixels #16878

Uh oh!

Conversation

ngxson commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

ngxson commented Oct 31, 2025

Uh oh!

ggerganov commented Oct 31, 2025

Uh oh!

ngxson commented Oct 31, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

tarruda commented Oct 31, 2025

Uh oh!

ggerganov Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Oct 31, 2025

Uh oh!

This comment was marked as outdated.

ngxson commented Oct 31, 2025

Uh oh!

ngxson commented Oct 31, 2025

Uh oh!

tarruda commented Nov 1, 2025

Uh oh!

ngxson commented Nov 1, 2025

Uh oh!

ggerganov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson commented Nov 1, 2025

Uh oh!

Uh oh!

jhemmond commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ngxson commented Oct 30, 2025 •

edited

Loading

ggerganov left a comment •

edited

Loading