Discuss how to design omni model #16552

tc-mb · 2025-10-13T04:26:48Z

tc-mb
Oct 13, 2025

Hi, I'm a member of the MIniCPM-V & MIniCPM-o teams.
In January of this year, we launched the MIniCPM-o 2.6 model. I'd like to merge the Omni model's capabilities into llama.cpp. I've already implemented a version, and the code used for iPad display is based on this modified CPP code. However, when I tried to merge it, I discovered that llama.cpp was undergoing code refactoring, and the code logic was already significantly different. Therefore, I haven't submitted a complete Omni code yet.

I still hope to bring Omni capabilities to the open source community through llama.cpp. However, the Omni model is quite complex, and the structure diagram is placed at the end, making modifications difficult and prone to accuracy issues. I'd like to ask the above questions in advance to minimize major structural changes during the Omni merge.

I noticed that in llama.cpp, mtmd now includes image capabilities, and whisper has also been added. It contains many of the basic components required by Omni (such as clips). However, Omni implements "live streaming" capabilities, which are significantly different in terms of upper-level scheduling. Especially since MTMD currently uses the input_chunks model, which significantly differs from the streaming model in terms of scheduling. Submitting code to MTMD as usual is extremely difficult. This is the main reason I'm asking this question.

I'd like to ask the following questions:

Does the llama.cpp community have any plans for Omni capabilities?
Will the Omni capabilities planned for llama.cpp also be included in MTMD?
Can I keep the Omni code in a separate folder to keep development simple and contribute it to the community? (I noticed that TTS is also provided in a separate folder.)
Should I copy and paste the visual components needed for Omni in MTMD or reference them whenever possible?

I look forward to your answers and appreciate your help. @ggerganov

MiniCPM-o 2.6 live streaming structure

ggerganov · 2025-10-14T17:37:50Z

ggerganov
Oct 14, 2025
Maintainer

Hi @tc-mb, thanks for starting the discussion. It will take me some time to respond in a meaningful way. In the meantime, tagging @ngxson in case he has some thoughts or suggestions how to best support this.

1 reply

tc-mb Oct 15, 2025
Author

Thank you for your reply. I'll patiently await your response.

tc-mb · 2025-10-15T04:56:42Z

tc-mb
Oct 15, 2025
Author

@ngson if you could take the time to review this discussion. I'd appreciate your feedback.

0 replies

tc-mb · 2025-10-15T04:56:48Z

tc-mb
Oct 15, 2025
Author

Also, I'll share some actual iPad deployment results (using a modified version of llama.cpp for inference).
https://www.youtube.com/watch?v=JFJg9KZ_iZk&t=3s

0 replies

ngxson · 2025-10-15T07:58:24Z

ngxson
Oct 15, 2025
Collaborator

Hey @tc-mb , to response to your questions:

Does the llama.cpp community have any plans for Omni capabilities?

Will the Omni capabilities planned for llama.cpp also be included in MTMD?

Yes, we already support audio+image input for some Omni models. The video support is a bit more tricky as each model seems to have a different strategy to interleave image and audio tokens, so we still haven't yet support this. Helps on this part would be appreciated.

For the audio output capability, we're still working on this. Models generating audio based on semantic/acoustic tokens (like Mimi decoder) are trivial to add, but diffusion-based models (which generates mel spectrogram) will be much more difficult.

For multi-head output from the backbone, this actually requires a new API from libllama, so probably we should discuss this separately with @ggerganov

Can I keep the Omni code in a separate folder to keep development simple and contribute it to the community? (I noticed that TTS is also provided in a separate folder.)

Should I copy and paste the visual components needed for Omni in MTMD or reference them whenever possible?

It can be something under tools/mtmd/omni/...

You can include internal components from clip.h and mtmd.h

1 reply

ngxson Oct 15, 2025
Collaborator

Note that the "Omni-Modality Streaming Backbone" is just a normal LLM, so it can run using libllama, I think it will require a special output head instead of lm_head

IIRC the "Streaming speech decoder" is the diffusion-based model which generates mel spectrogram (in MiniCPM-Omni). It should be implemented inside libmtmd

Please confirm if what I said is correct @tc-mb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discuss how to design omni model #16552

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Discuss how to design omni model #16552

Uh oh!

tc-mb Oct 13, 2025

Replies: 4 comments · 2 replies

Uh oh!

ggerganov Oct 14, 2025 Maintainer

Uh oh!

tc-mb Oct 15, 2025 Author

Uh oh!

tc-mb Oct 15, 2025 Author

Uh oh!

tc-mb Oct 15, 2025 Author

Uh oh!

Uh oh!

ngxson Oct 15, 2025 Collaborator

Uh oh!

ngxson Oct 15, 2025 Collaborator

tc-mb
Oct 13, 2025

Replies: 4 comments 2 replies

ggerganov
Oct 14, 2025
Maintainer

tc-mb Oct 15, 2025
Author

tc-mb
Oct 15, 2025
Author

tc-mb
Oct 15, 2025
Author

ngxson
Oct 15, 2025
Collaborator

ngxson Oct 15, 2025
Collaborator