mtmd, server: add "placeholder bitmap" for counting tokens , add */input_tokens API#23913
mtmd, server: add "placeholder bitmap" for counting tokens , add */input_tokens API#23913ngxson wants to merge 16 commits into
Conversation
|
not quite sure why the server linux CI fails (same on master branch), but I ran the test locally on my mac and it passes 100% |
Looks like the server process is getting killed with SIGILL (Illegal operation), |
Yep, this is a |
|
gentle ping @ggml-org/llama-server , I kinda need this to continue with other fixes |
|
@ggml-org/maintainers could someone give approval(s) for this PR please 🙏 need this one to unblock #21858 I've already tested it (tested with vision + audio input), plus unit tests are added so I guess that should ok the |
Overview
Tokenizing / preprocessing multimodal input is much more CPU-intensive than tokenizing, because it runs on single thread. This can be wasteful if the user just want to count the number of tokens occupied by an image/audio chunk, without actually using the underlay data.
This PR allow create a "placeholder" bitmap that only contains the dimension, no data buffer will be allocated. Preprocessing ops (i.e. image manipulation) will skip processing it
New server APIs are also added to demonstrate this (support counting both tools input tokens and multimodal input tokens)
/v1/chat/completions/input_tokens/v1/responses/input_tokensIn next PRs:
process_mtmd_promptis being used for counting tokensRequirements