Skip to content

mtmd : add Nemotron 3 Nano Omni support (parakeet)#22520

Draft
danbev wants to merge 19 commits into
ggml-org:masterfrom
danbev:nemotron-3-omni-mtmd-audio
Draft

mtmd : add Nemotron 3 Nano Omni support (parakeet)#22520
danbev wants to merge 19 commits into
ggml-org:masterfrom
danbev:nemotron-3-omni-mtmd-audio

Conversation

@danbev

@danbev danbev commented Apr 29, 2026

Copy link
Copy Markdown
Member

Overview

This is a work in progress. It will not be merged until the whisper.cpp/parakeet.cpp PR has been merged. Working on both allows for discovering improvements/painpoints which can feedback both ways

This commit adds support for the subsampling and encoder part of Nemotron Nemo 3 omni model.

Additional information

The Parakeet subsampling/encoder were taken from parakeet.cpp which is currently a pull request against whisper.cpp. I've tried to copy the code as close as possible to hopefully enable easy patching between these two project later.

Refs: ggml-org/whisper.cpp#3735


For testing a converted model can be found here and can be run using the following command:

llama-mtmd-cli -hf danbev/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16-mtmd-GGUF --no-warmup --audio jfk.wav -p "Transcribe this audio clip, only the trancription and nothing else."

This commit adds support for the subsampling and encoder part of
Nemotron Nemo 3 omni model.

The Parakeet subsampling/encoder were taken from parakeet.cpp which
is currently a pull request against whisper.cpp. I've tried to copy the
code a close as possible to hopefully enable easy patching between the
these two project later.

Refs: ggml-org/whisper.cpp#3735

@ngxson ngxson left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, I'm leaving some early-review comments

Comment thread convert_hf_to_gguf.py Outdated
Comment thread tools/mtmd/mtmd-audio.cpp Outdated
@github-actions github-actions Bot added examples python python script changes labels Apr 29, 2026
This commit removes the generation of the relative positional tensor in
the model conversion script and instead computes it in the encoder
graph. This is only done for the window of positions required for the
current audio sample.
Comment thread tools/mtmd/clip.h Outdated
Comment thread tools/mtmd/mtmd-audio.cpp Outdated
danbev added 2 commits April 30, 2026 14:50
This commit adds a function to get access to the clip_model. It also
removes the two functions clip_get_mel_filter_tensor, and
clip_get_window_tensor(const struct clip_ctx * ctx) which can now use
clip_get_model to access the model tensors that it needs.

@ngxson ngxson left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good so far

Comment thread tools/mtmd/mtmd-audio.cpp Outdated
Comment thread tools/mtmd/clip.cpp Outdated
@danbev danbev self-assigned this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants