Skip to content

mtmd: add mtmd_image_tokens_get_decoder_pos() API#21851

Merged
ngxson merged 3 commits into
ggml-org:masterfrom
ngxson:xsn/mtmd_get_decoder_pos_api
Apr 14, 2026
Merged

mtmd: add mtmd_image_tokens_get_decoder_pos() API#21851
ngxson merged 3 commits into
ggml-org:masterfrom
ngxson:xsn/mtmd_get_decoder_pos_api

Conversation

@ngxson

@ngxson ngxson commented Apr 13, 2026

Copy link
Copy Markdown
Collaborator

Overview

  • Add a new mtmd API: mtmd_image_tokens_get_decoder_pos()
  • Deprecate mtmd_image_tokens_get_nx/ny()

Additional information

Target support #21045

The mentioned PR proposes a new API mtmd_image_tokens_get_n_prefix which is not very flexible. mtmd_image_tokens_get_decoder_pos addresses this issue by providing non-linear t,x,y positions for each of the token.

In the case of falcon-ocr, mtmd_image_tokens_get_decoder_pos returns linear temporal t and x=0;y=0 for the first few text tokens, then switch to spatial position for image tokens:

  • <|image_cls|>: t=0,x=0,y=0
  • <|image_reg_1|>: t=1,x=0,y=0
  • ...
  • <|image_reg_4|>: t=4,x=0,y=0
  • image patch 0: t=5,x=0,y=0
  • image patch 1: t=5,x=1,y=0
  • ...
  • image patch N: t=5,x=nx,y=ny

Requirements

@ngxson ngxson requested review from a team and ggerganov as code owners April 13, 2026 13:19
@ngxson

ngxson commented Apr 13, 2026

Copy link
Copy Markdown
Collaborator Author

Tests seem ok:

[vision] OK:   ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[vision] OK:   ggml-org/LFM2-VL-450M-GGUF:Q8_0
[vision] OK:   ggml-org/granite-docling-258M-GGUF:Q8_0
[vision] OK:   ggml-org/LightOnOCR-1B-1025-GGUF:Q8_0
[vision] OK:   ggml-org/DeepSeek-OCR-GGUF:Q8_0
[vision] OK:   ggml-org/dots.ocr-GGUF:Q8_0
[vision] OK:   ggml-org/HunyuanOCR-GGUF:Q8_0
[vision] OK:   ggml-org/gemma-4-E2B-it-GGUF:Q8_0
[audio]  OK:   ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M
[audio]  OK:   ggml-org/LFM2-Audio-1.5B-GGUF:Q8_0
[audio]  OK:   ggml-org/gemma-4-E2B-it-GGUF:Q8_0

@ngxson ngxson requested a review from a team April 13, 2026 13:40
@github-actions github-actions Bot added testing Everything test related examples labels Apr 13, 2026
@ngxson

ngxson commented Apr 14, 2026

Copy link
Copy Markdown
Collaborator Author

pinging @ggml-org/maintainers for approval

@ngxson ngxson merged commit 707c0b7 into ggml-org:master Apr 14, 2026
41 of 47 checks passed
mengqin pushed a commit to mengqin/llama.cpp that referenced this pull request Apr 20, 2026
* mtmd: add mtmd_image_tokens_get_decoder_pos() API

* consistent naming

* fix build
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Apr 21, 2026
* mtmd: add mtmd_image_tokens_get_decoder_pos() API

* consistent naming

* fix build
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Apr 23, 2026
* mtmd: add mtmd_image_tokens_get_decoder_pos() API

* consistent naming

* fix build
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* mtmd: add mtmd_image_tokens_get_decoder_pos() API

* consistent naming

* fix build
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
* mtmd: add mtmd_image_tokens_get_decoder_pos() API

* consistent naming

* fix build
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* mtmd: add mtmd_image_tokens_get_decoder_pos() API

* consistent naming

* fix build
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* mtmd: add mtmd_image_tokens_get_decoder_pos() API

* consistent naming

* fix build
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* mtmd: add mtmd_image_tokens_get_decoder_pos() API

* consistent naming

* fix build
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants