Conversation
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
llama-cpp-2/src/mtmd.rs
Outdated
| Self { | ||
| use_gpu: false, | ||
| print_timings: true, | ||
| n_threads: 4, | ||
| media_marker: CString::new(mtmd_default_marker()).unwrap_or_default(), | ||
| } | ||
| } |
There was a problem hiding this comment.
this seems like it should rely on mtmd_context_params_default
llama-cpp-2/src/mtmd.rs
Outdated
| context.use_gpu = params.use_gpu; | ||
| context.print_timings = params.print_timings; | ||
| context.n_threads = params.n_threads; | ||
| context.media_marker = params.media_marker.as_ptr(); |
There was a problem hiding this comment.
destructure params so you get a compiler error here if new fields are added so we ensure we update this impl.
llama-cpp-2/src/mtmd.rs
Outdated
| if context.is_null() { | ||
| return Err(MtmdInitError::NullResult); | ||
| } |
llama-cpp-2/src/mtmd.rs
Outdated
| bitmaps: &[&MtmdBitmap], | ||
| ) -> Result<MtmdInputChunks, MtmdTokenizeError> { | ||
| let chunks = MtmdInputChunks::new(); | ||
| let text_cstring = CString::new(text.text).unwrap_or_default(); |
There was a problem hiding this comment.
silently eating the error here seems ike it could be improved on. I would return an Err if MtmdInputText can contain non-valid C-Strings or panic if containing valid CStrings is an invariant of MtmdInputText (and document the invariant if not done)
There was a problem hiding this comment.
Thanks, that's a good point. I added an explicit MtmdTokenizeError::CStringError now
d025465
llama-cpp-2/src/mtmd.rs
Outdated
| unsafe impl Send for MtmdContext {} | ||
| unsafe impl Sync for MtmdContext {} |
There was a problem hiding this comment.
These impls need some sort of comment on why this is safe.
There was a problem hiding this comment.
Good catch, I actually meant MtmdContextParams and that is already Send + Sync. Afais mtmd_context is not thread safe. Removed in d025465
llama-cpp-2/src/mtmd.rs
Outdated
| unsafe { CStr::from_ptr(ptr) } | ||
| .to_string_lossy() | ||
| .into_owned() | ||
| .into() |
There was a problem hiding this comment.
nit: into_owned + into makes this somewhat opaque on what this chain does. a named variable or explicit types would be nice.
| if chunk_ptr.is_null() { | ||
| None | ||
| } else { | ||
| // Note: We don't own this chunk, it's owned by the chunks collection |
There was a problem hiding this comment.
should this return a reference then?
There was a problem hiding this comment.
MtmdInputChunk is currently just a container for a llama.cpp mtmd_input_chunk pointer. We do not separately keep track of the chunks but just have a pointer to the mtmd_input_chunks.
Therefore we get the pointer to the llama.cpp chunk here and wrap it in a MtmdInputChunk . This owned field is just there because there is also a way to create an input chunk manually and then encode that one. In this case the underlying mtdm_input_chunk should be dropped when the container is dropped.
MarcusDunn
left a comment
There was a problem hiding this comment.
This looks good code-wise. Thanks for the really well documented and thought out PR.
I left some comments that should be addressed, but nothing major.
My main concern is that maintaining this will prevent updates to the main llama.h bindings as I expect this to break often as llama.cpp moves forward.
If ALL this code (bindings, wrappers, addition to our stub header file, etc) could be behind the feature flag such that I can update the library and break multi-modal support provided I do not use the feature that would be great.
I will merge this provided that breaks to mtmd would not stop updates to the submodule.
Again, thanks for the PR - this is great.
|
Hi, not sure if that helps, but I tried running this with To get the files: wget https://huggingface.co/Mungert/Qwen2.5-Omni-3B-GGUF/resolve/main/Qwen2.5-Omni-3B-q4_k_m.gguf
wget https://huggingface.co/Mungert/Qwen2.5-Omni-3B-GGUF/resolve/main/Qwen2.5-Omni-3B-q8_0.mmproj |
Thanks for reporting, @haixuanTao, I have indeed only tried it with Gemma models. I will take a look. |
|
No worries. I think that text only runs on llama-cpp-python version. |
* Remove unsafe Send, Sync * Cleanup error handling * Use default mtmd_context directly Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
The issue was that the default context length was too short and I did not pass the arg also to the Let me know, if there are any other issues, @haixuanTao |
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
|
@MarcusDunn, thanks for the thorough review. I think I addressed all your comments in d025465 and f149f11. Please let me know if there are still mtmd dependencies that I might have missed. I assume that the |
|
The include is no worries. I'll take a look this weekend. |
| /// Text input chunk | ||
| Text = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_TEXT as isize, | ||
| /// Image input chunk | ||
| Image = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_IMAGE as isize, | ||
| /// Audio input chunk | ||
| Audio = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_AUDIO as isize, | ||
| } |
There was a problem hiding this comment.
When the enum uses a u32 representation it's no longer required: #819
Thanks, good point
| llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_TEXT => MtmdInputChunkType::Text, | ||
| llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_IMAGE => MtmdInputChunkType::Image, | ||
| llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_AUDIO => MtmdInputChunkType::Audio, | ||
| _ => panic!("Unknown MTMD input chunk type"), |
There was a problem hiding this comment.
include the chunk_type in the panic message.
| /// Get audio bitrate in Hz (e.g., 16000 for Whisper). | ||
| /// Returns -1 if audio is not supported. | ||
| #[must_use] | ||
| pub fn get_audio_bitrate(&self) -> i32 { | ||
| unsafe { llama_cpp_sys_2::mtmd_get_audio_bitrate(self.context.as_ptr()) } | ||
| } |
There was a problem hiding this comment.
I would prefer if we returned a Option<u32 / i32> for this and dealt with the -1 case to keep it a bit more rusty
|
Can confirm the model mention above is now working wonders! Thanks a lot! |
|
More of an optimisation thing, but are we using the maximum number of batches for encoding/decoding images and audio? self.n_past = chunks.eval_chunks(&self.mtmd_ctx, context, 0, 0, 1, true)?; // batch size of image decodingIt seems that this could be higher than 1, but not an expert on this Related: ggml-org/llama.cpp#14527 if you search through the issue he's actually decoding at 64 n_tokens_batch |
|
FYI, I tried to pass image+audio at the same time and it didn't work as there were only one marker within the prompt :D Not an urgent bug |
I am able to batch by 10 using qwen-omni, reducing the decoding time from 120ms-> 50ms, but I truly have no idea why most of time when giving a image slice encoded in 326 ms
decoding image batch 1/3, n_tokens_batch = 20
image decoded (batch 1/3) in 52 ms
decoding image batch 2/3, n_tokens_batch = 20
image decoded (batch 2/3) in 53 ms
decoding image batch 3/3, n_tokens_batch = 20
image decoded (batch 3/3) in 53 ms
get_logits_ith: invalid logits id 0, reason: batch.logits[0] != true |
|
This stage took a massive amount of time in metal. dont know if it works correctly in cuda. |
|
It's super fast in cuda. But it's a llama.cpp thing apparently |
|
Related: ggml-org/llama.cpp#15426 |
thanks. experiencing exactly the same issue. tried many things. I thought it was my fault. I've been working on it since yesterday. I tried running it on different threads, but no matter what I did, my optimization only improved it by 2-3 seconds at most. Also, I was doing all the image/video processing on the GPU side (using wgpu). |
|
Have you tried lowering the resolution of the image? |
I did it for my tests/debugs (like 64x32 etc). This only causes a 1-2 second improvement and doesn't solve the problem. I also have a Mac M3 16GB. A single output takes 1 min on my side. |
|
ok I have a m4 pro and it's about 30sec more or less. On cloud 5090 cuda, its 120ms the encoding |
Thanks, @haixuanTao, for spotting this. I fixed the issue in #819 and added a new cli arg for it. It should now work with arbitrary batch sizes. Let me know if you still see the invalid logits error. Unfortunately I can't help with the metal performance problems as I don't own a compatible device. It sounds like an issue in llama.cpp though. |
Add bindings for the libmtmd multimodality support of llama-cpp. This allows to use models like e.g. Gemma3 with image or audio input. The bindings are only created with a new
mtmdfeature. To explain the usage, I added a simple example loosely based on themtmd-cli.cpp.To run the mtmd example, you first need to download the model gguf file and the multimodal projection file, e.g. for Gemma3 you may use:
To then run the example on CPU, provide an image file
my_image.jpgand run:Closes #744