Replies: 4 comments 2 replies
-
Hi @tc-mb, thanks for starting the discussion. It will take me some time to respond in a meaningful way. In the meantime, tagging @ngxson in case he has some thoughts or suggestions how to best support this. |
Beta Was this translation helpful? Give feedback.
-
@ngson if you could take the time to review this discussion. I'd appreciate your feedback. |
Beta Was this translation helpful? Give feedback.
-
Also, I'll share some actual iPad deployment results (using a modified version of llama.cpp for inference). |
Beta Was this translation helpful? Give feedback.
-
Hey @tc-mb , to response to your questions:
Yes, we already support audio+image input for some Omni models. The video support is a bit more tricky as each model seems to have a different strategy to interleave image and audio tokens, so we still haven't yet support this. Helps on this part would be appreciated. For the audio output capability, we're still working on this. Models generating audio based on semantic/acoustic tokens (like Mimi decoder) are trivial to add, but diffusion-based models (which generates mel spectrogram) will be much more difficult. For multi-head output from the backbone, this actually requires a new API from
It can be something under You can include internal components from |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I'm a member of the MIniCPM-V & MIniCPM-o teams.
In January of this year, we launched the MIniCPM-o 2.6 model. I'd like to merge the Omni model's capabilities into llama.cpp. I've already implemented a version, and the code used for iPad display is based on this modified CPP code. However, when I tried to merge it, I discovered that llama.cpp was undergoing code refactoring, and the code logic was already significantly different. Therefore, I haven't submitted a complete Omni code yet.
I still hope to bring Omni capabilities to the open source community through llama.cpp. However, the Omni model is quite complex, and the structure diagram is placed at the end, making modifications difficult and prone to accuracy issues. I'd like to ask the above questions in advance to minimize major structural changes during the Omni merge.
I noticed that in llama.cpp, mtmd now includes image capabilities, and whisper has also been added. It contains many of the basic components required by Omni (such as clips). However, Omni implements "live streaming" capabilities, which are significantly different in terms of upper-level scheduling. Especially since MTMD currently uses the input_chunks model, which significantly differs from the streaming model in terms of scheduling. Submitting code to MTMD as usual is extremely difficult. This is the main reason I'm asking this question.
I'd like to ask the following questions:
I look forward to your answers and appreciate your help. @ggerganov
MiniCPM-o 2.6 live streaming structure
Beta Was this translation helpful? Give feedback.
All reactions