Add parakeet-cpp: NVIDIA Parakeet ASR + Sortformer diarization in pure C++/ggml#11
Conversation
…review) Address @gianni-cor review on PR ggml-org#11: switch the bundled ggml filename prefix from `libparakeet-ggml-*` to `libspeech-ggml-*` so the QVAC speech stack (whisper, parakeet, chatterbox, supertonic, ...) can co-vendor a single ggml file set instead of each library shipping its own copy. - parakeet-cpp/CMakeLists.txt: OUTPUT_NAME prefix `parakeet-` -> `speech-`, GGML_BACKEND_DL_PROJECT_PREFIX macro `"parakeet-"` -> `"speech-"`, option blurb + status message updated. - parakeet-cpp/README.md, patches/README.md, scripts/setup-ggml.sh, patches/ggml-backend-reg-filename-prefix.patch: doc / comment / example updated to reference the new `speech-` prefix. Verified: setup-ggml.sh re-applies all patches cleanly; CMake configure prints `bundled ggml libraries will be emitted as libspeech-ggml-*`; build emits libspeech-ggml{,-base,-cpu,-blas,-metal}.{0,0.9.11}.dylib; parakeet binary's otool -L now references `libspeech-ggml*` exclusively. Co-authored-by: Cursor <cursoragent@cursor.com>
gianni-cor
left a comment
There was a problem hiding this comment.
Big +1 on landing this. The follow-up commit c6c3fd7 switches the bundled-ggml prefix to libspeech-ggml-*, which is exactly what I asked for in the prior review and makes the speech stack (whisper, parakeet, chatterbox, supertonic, …) co-vendor a single ggml file set. Verified locally:
scripts/setup-ggml.shclonesggml@58c38058, applies all three patches cleanly, and is safe to re-run (resets to pristine + re-applies — i.e. functionally idempotent rather than no-op-on-second-run, but you end at the same state).cmake -S parakeet-cpp -B parakeet-cpp/buildconfigures cleanly and the top-level whisper.cpp build is untouched (the subtree is opt-in).
This is a high-quality, well-scoped contribution. The C++ surface is small and well-documented (include/parakeet/), the engine is PIMPL'd for ABI stability, the CTest harnesses auto-disable when fixtures are missing, and the three ggml patches all default to upstream-byte-equal behaviour when their triggers are off. Approving so the PR is unblocked. The notes below are mostly polish or footguns to address in a follow-up; none of them block landing this subtree.
Approving — must-fix-soonish (correctness/footguns)
-
parakeet-cpp/src/parakeet_log.cpp—g_user_datais not atomic.std::atomic<ggml_log_callback> g_callback{nullptr}; void * g_user_data = nullptr;log_set_callbackwrites tog_user_data(non-atomic) andg_callback(atomic release).log_implreadsg_callback(acquire) and theng_user_data. A concurrentlog_set_callbackracing with alog_implcan deliver the new callback with the olduser_data(or vice-versa). Make itstd::atomic<void *>or, simpler, pack(cb, user_data)into a singleshared_ptr<struct {…}>and do an atomic swap. -
parakeet-cpp/src/parakeet_engine.cpp—SortformerStreamSessionborrows the engine'smel_state.DiarizationResult diar; { const float * win = ring.data() + off; diar = engine_impl_diarize_helper(*engine_impl, win, n, opts.sample_rate, diopts); }engine_impl_diarize_helperwrites toengine_impl->mel_state.StreamSession::Implalready carries its ownMelState mel_state(correctly), butSortformerStreamSession::Impluses the engine's. Twodiarize_starts on one Engine — even if you serialise theirfeed_pcm_*calls — would alias the mel scratch in surprising ways. Either moveMelStateontoSortformerStreamSession::Impllike the other path, or document that "one diarize stream session per Engine instance" is required (similar toEngine::transcribe*which is documented as single-threaded per instance). -
parakeet-cpp/src/parakeet_ctc.h—run_encoder(capture_intermediates=true)is a default-on footgun.int run_encoder(ParakeetCtcModel & model, const float * mel, int n_mel_frames, int n_mels, EncoderOutputs & out, int max_layers = -1, bool capture_intermediates = true);
Every internal caller (engine.cpp, prewarm, all transcribe/stream/diarize paths, profile harnesses) passes
false. The 5+ MB device→host copy the comment warns about ships only to new external callers who didn't read the doc. Consider flipping the default tofalse, or splitting intorun_encoder(fast, onlyencoder_out+ CTClogits) andrun_encoder_capture(parity-test convenience). Today the only thing keeping the default attrueis the per-stage parity harnesses, and they all pass the flag explicitly anyway. -
parakeet-cpp/include/parakeet/export.h—PARAKEET_APIis empty under STATIC.#pragma once ... #ifdef PARAKEET_SHARED ... #else # define PARAKEET_API #endif
With the default static build,
PARAKEET_APIexpands to nothing. The parakeet target is alsoCXX_VISIBILITY_PRESET hidden, so symbols compiled intolibparakeet.acarry hidden visibility. That's fine for "static lib → executable", but consumers wrappinglibparakeet.ainto their own shared library (e.g. a Node addon for QVAC) cannot re-export the API surface even with their own__attribute__((visibility("default")))wrappers, because the inner symbol is already marked hidden. Either:- flip
PARAKEET_APIto__attribute__((visibility("default")))always (ELF) / nothing (static-link Windows), so static-lib symbols land at default visibility regardless of build mode, or - explicitly document in
export.hthat consumers wrapping the static lib in a shared object must compile parakeet with-DPARAKEET_SHARED -DPARAKEET_BUILDto get the right visibility.
- flip
-
parakeet-cpp/src/parakeet_engine.cpp— divide-by-zero waiting to happen.inline double encoder_frame_stride_ms(const ParakeetCtcModel & model) { const int hop = model.mel_cfg.hop_length; const int sub = model.encoder_cfg.subsampling_factor > 0 ? model.encoder_cfg.subsampling_factor : 8; const int sr = model.mel_cfg.sample_rate > 0 ? model.mel_cfg.sample_rate : 16000; return 1000.0 * (double) (hop * sub) / (double) sr; }
subandsrare fallback-defended;hopis not. If a future GGUF arrives withhop_lengthmissing or 0 the function returns 0 ms and downstream code (frames_per_window = chunk_ms / 0,frame_samples = round(sr * 0 / 1000)) divides by zero. Same one-liner:const int hop = model.mel_cfg.hop_length > 0 ? model.mel_cfg.hop_length : 160;
Approving — nice-to-fix polish
-
parakeet-cpp/src/parakeet_engine.cpp—~StreamSession()try { bool = true; } catch(...).StreamSession::~StreamSession() { if (pimpl_ && !pimpl_->finalized && !pimpl_->cancelled) { try { pimpl_->cancelled = true; } catch (...) {} } }
Assigning a bool can't throw, so the try/catch is dead code. Same pattern in
~SortformerStreamSession(). -
Streaming detokenize is O(n²) per session.
const size_t prev_cumulative_len = result.text.size(); result.token_ids.insert(result.token_ids.end(), win_tokens.begin(), win_tokens.end()); result.text = detokenize(pimpl_->model.vocab, result.token_ids); const std::string win_text = result.text.substr(prev_cumulative_len);
result.text = detokenize(model.vocab, result.token_ids)re-detokenises the whole cumulative token list each chunk. Same shape inStreamSession::Impl::process_window(line ~847). For typical 30s utterances this is fine; for hour-long live captioning sessions it adds a quadratic tail. Cheap fix: detokenise onlywin_tokensand append, mirroring howcumulative_token_idsis grown. -
parakeet-cpp/src/main.cpp—#define setenvpollutes the TU.#ifdef _WIN32 static int parakeet_setenv(const char * name, const char * value, int /*overwrite*/) { return _putenv_s(name, value); } #define setenv parakeet_setenv #endif
Macro-redefining a libc symbol in a translation unit is fragile (any later
<cstdlib>re-include path reaches a tokenisedsetenv). Inline-rename the call sites toparakeet_setenvand drop the#define, or wrap asinline int compat_setenv(...) { ... }and call that. -
PARAKEET_FLASH_ATTNON-by-default vsPARAKEET_EXPERIMENTAL_FLASH_ATTNmacro.if (GGML_METAL) set(PARAKEET_FLASH_ATTN_DEFAULT ON) else() set(PARAKEET_FLASH_ATTN_DEFAULT OFF) endif() option(PARAKEET_FLASH_ATTN "parakeet: enable fused flash-attn in MHA (default ON for Metal; OFF elsewhere pending per-backend A/B)" ${PARAKEET_FLASH_ATTN_DEFAULT}) if (PARAKEET_FLASH_ATTN) target_compile_definitions(parakeet PRIVATE PARAKEET_EXPERIMENTAL_FLASH_ATTN) endif()The CMake option is shipped on-by-default on Metal but the C++ side gates on
PARAKEET_EXPERIMENTAL_FLASH_ATTN. If the path is good enough to be the Metal default, drop theEXPERIMENTAL_prefix from the macro. Otherwise gate the option default to OFF until the prefix goes away. -
parakeet-cpp/scripts/download-all-models.sh— no integrity verification.fetch() { local url="$1" dest="$2" if [[ -f "$dest" ]]; then local sz; sz=$(stat -f%z "$dest" 2>/dev/null || stat -c%s "$dest") echo " exists: $dest ($(bytes_human "$sz")) — skipping" return 0 fi mkdir -p "$(dirname "$dest")" echo " fetching: $url" echo " -> $dest" curl -L --fail --progress-bar -o "$dest.tmp" "$url" mv "$dest.tmp" "$dest" ... }
A corrupted partial download silently succeeds — the failure surfaces later in
convert-nemo-to-gguf.pywith a confusing "tar: unexpected EOF" rather than at fetch time. Pin to a specific HF revision (/resolve/<sha>/...) and add asha256sum -cstep, or at minimum a size sanity check vs an expected number. -
parakeet-cpp/scripts/convert-nemo-to-gguf.py—--hf-repodefaults to CTC 0.6B.p.add_argument("--hf-repo", default="nvidia/parakeet-ctc-0.6b", help="HF model id to download from if --ckpt is missing.")
Self-documented as a footgun in the docstring (line 27-29). Auto-derive
--hf-repofrom the--ckptfilename when it follows the<model>.nemoconvention (e.g.parakeet-tdt-0.6b-v3.nemo→nvidia/parakeet-tdt-0.6b-v3), and fall back to error-out instead of CTC when the filename doesn't match a known prefix. Saves a class of "I downloaded the wrong weights" support tickets. -
parakeet-cpp/test/test_streaming.cppandtest_decoder_determinism.cpp— fragile WAV reader.FILE * f = std::fopen(opts.wav_path.c_str(), "rb"); ... std::fseek(f, 0, SEEK_END); long sz = std::ftell(f); std::fseek(f, 44, SEEK_SET); std::vector<int16_t> i16((sz - 44) / 2); std::fread(i16.data(), 2, i16.size(), f);
Hard-codes a 44-byte WAV header. Fine for the committed
test/samples/*.wavfixtures, but any RIFF chunk besides the canonicalfmt(e.g.LIST INFO, BWFbext) shifts the data offset and the test silently mis-parses samples. Either route throughparakeet::load_wav_mono_f32(already linked viaparakeet), or scan for thedatachunk header. -
parakeet-cpp/src/energy_vad.h— fixed 64 KB buffer per session.int window_pos_ = 0; // write index into window_sq_ float window_sq_[16000]; // big enough for window_ms <= 1 s @ 16 kHz
EnergyVadis heap-allocated perStreamSessionso this is 64 KB per live session — fine for typical use, but the 1-s @ 16 kHz cap is an implicit limit hidden in the field declaration. Either move tostd::vector<float>sized at construction (cheap; one allocation per session) or document the cap on the constructor.
Documentation polish
-
parakeet-cpp/README.mdstill reads as the standalone repo's README.## 1. Clone and build ```bash git clone <this-repo> parakeet.cpp cd parakeet.cpp ./scripts/setup-ggml.sh cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
When this is consumed as a subtree under `qvac-ext-lib-whisper.cpp/parakeet-cpp/`, the `git clone` instruction doesn't apply and the CWD is `parakeet-cpp/`, not `parakeet.cpp/`. Add a header note clarifying this is the in-tree subtree variant, or update §1 to do `cd parakeet-cpp` after a top-level repo clone. -
Top-level
README.md(whisper.cpp) doesn't mention parakeet-cpp.grep -i parakeet README.mdreturns nothing. The PR description correctly notes the existing whisper build/headers/vcpkg surface is untouched, but a one-line discovery pointer ("This repository also vendorsparakeet-cpp/— see parakeet-cpp/README.md") would help new contributors / users who land on the top-level README and would otherwise miss the new subtree entirely.
What I liked
- Three ggml patches are tight, well-documented, and individually drop-out-able once upstream catches up. The byte-equal-to-upstream guarantee when the trigger is off is exactly the right contract for a downstream fork.
EncoderGraphLRU cache (k_encoder_graph_cache_max = 3) + shape-keyed(T_mel, n_layers, all_valid)lookup is the right shape for streaming workloads.- Mel preprocess optimisations (real-FFT pack, thread_local twiddle cache,
MelStatereuse) are clean and clearly bench-driven. EngineOptions::prewarm+test-decoder-determinism--prewarmgate is a great pattern — test the prewarm contract instead of trusting it.- Adreno-6xx CPU fallback policy and the
PARAKEET_ALLOW_ADRENO_6XX=1opt-out are exactly the right shape for a real production environment. parakeet_apply_ggml_prefix+ the companionggml-backend-reg-filename-prefix.patchcleanly solves the in-process ggml-collision problem the QVAC speech stack will hit.BackendDevice+backend_name()reflect the resolved backend after fallbacks, not the requested one — matches what consumers actually need to log.- Test labels (
unit/fixture/perf/gpu) +parakeet_register_test(REQUIRES …)auto-disable for missing fixtures keepsctestgreen on a fresh checkout.
Approving with the polish notes above.
gianni-cor
left a comment
There was a problem hiding this comment.
Posting items 1-5 from my approval as inline review comments so they're easier to thread / triage in the file diff. Same content as the approval body, just relocated to the offending lines.
| namespace { | ||
|
|
||
| std::atomic<ggml_log_callback> g_callback{nullptr}; | ||
| void * g_user_data = nullptr; |
There was a problem hiding this comment.
Race condition: g_user_data is non-atomic.
log_set_callback writes g_user_data (non-atomic) before storing g_callback with release ordering; log_impl reads g_callback (acquire) and then g_user_data. Concurrent log_set_callback vs log_impl can deliver the new callback with the old user_data (or vice versa) — the publish of (cb, user_data) is not atomic as a pair.
Fix options:
- Make it
std::atomic<void *> g_user_data{nullptr}and storeuser_databeforecb(so the acquire-load ofcbsynchronises with both writes). Reads still race-free as long aslog_implloadscbfirst and thenuser_data. - Or pack
(cb, user_data)into a singlestruct Sink { ggml_log_callback cb; void * ud; }, hold it in astd::shared_ptr<Sink>, and usestd::atomic_store/std::atomic_loadon the shared_ptr (orstd::atomic<std::shared_ptr<Sink>>in C++20). Atomic swap of the pair makes the race impossible.
Low-impact in the QVAC use case (we're unlikely to call parakeet_log_set concurrently with logging in flight), but the public C entry point in <parakeet/log.h> invites that pattern from host applications.
| DiarizationResult diar; | ||
| { | ||
| const float * win = ring.data() + off; | ||
| diar = engine_impl_diarize_helper(*engine_impl, win, n, opts.sample_rate, diopts); |
There was a problem hiding this comment.
SortformerStreamSession aliases the engine's MelState.
engine_impl_diarize_helper writes to engine_impl->mel_state. StreamSession::Impl already carries its own MelState mel_state member (line 743) for exactly this reason — the encoder + decoder pipelines run independently of the parent Engine's mel scratch. The Sortformer streaming path doesn't follow that pattern: every process_chunk here clobbers the engine-owned state.
This matters in two scenarios that the public API allows:
- Two
diarize_startsessions on the same Engine. Even with serialisedfeed_pcm_*calls, both sessions'process_chunkwould alias the samemel_statebuffer. Engine::diarize()while aSortformerStreamSessionis running. The engine's own one-shot diarize path also usespimpl_->mel_state(line ~552), so adiarize_samples()racing with a streaming session'sfeed_pcm_f32()triggers the same alias.
Fix: lift MelState mel_state onto SortformerStreamSession::Impl (mirrors what StreamSession::Impl does) and pass it through to engine_impl_diarize_helper via an extra param, or document the constraint as "one stream/diarize call per Engine instance at a time" and audit engine.h to make that explicit alongside the existing single-thread-per-Engine note.
Fix #1 is preferable; it's a 4-line change and removes the constraint entirely.
| int n_mels, | ||
| EncoderOutputs & out, | ||
| int max_layers = -1, | ||
| bool capture_intermediates = true); |
There was a problem hiding this comment.
Default capture_intermediates = true is a footgun for new external callers.
Every internal caller passes false (engine.cpp transcribe, transcribe_samples_stream, diarize, prewarm; main.cpp run_once; live-mic / live-mic-attributed via Engine). The 5+ MB device→host roundtrip the comment above warns about only ships to new callers — exactly the population this header surface targets — who didn't read the doc.
The per-stage parity harnesses (test-encoder, test-tdt-encoder-parity, test-sortformer-parity, test-encoder-capture-parity) all pass capture_intermediates=true explicitly today, so flipping the default is safe.
Two equivalent fixes, pick one:
int run_encoder(ParakeetCtcModel & model,
const float * mel, int n_mel_frames, int n_mels,
EncoderOutputs & out,
int max_layers = -1,
bool capture_intermediates = false); // flip defaultor split into two functions so the choice is at the call site:
int run_encoder(...); // production: encoder_out + CTC logits only
int run_encoder_capture(...); // parity: + per-stage host copiesThe split shape also makes it easier to grep for parity-only call sites in tests.
| # define PARAKEET_API __attribute__((visibility("default"))) | ||
| # endif | ||
| #else | ||
| # define PARAKEET_API |
There was a problem hiding this comment.
Static-build symbols are emitted with hidden visibility.
Under the default static build (PARAKEET_SHARED undefined), PARAKEET_API expands to nothing. The parakeet target is compiled with CXX_VISIBILITY_PRESET hidden + VISIBILITY_INLINES_HIDDEN ON (CMakeLists.txt line 302-303), so symbols compiled into libparakeet.a carry STV_HIDDEN.
That's fine for static lib → executable. Breaks for our QVAC use case: a Node addon (or any host shared library) that links libparakeet.a and tries to re-export the API surface to JavaScript via its own __attribute__((visibility("default"))) wrappers. The inner parakeet::Engine::transcribe symbol is already marked hidden in the .o, so the addon's wrapper compiles and links but dlsym / N-API can't resolve the indirect call back into the static lib's hidden symbols on some toolchains.
Two fixes, pick one:
-
Make
PARAKEET_APIalways visible on ELF/Mach-O even in static builds — change the#elsebranch to:#else # if defined(__GNUC__) || defined(__clang__) # define PARAKEET_API __attribute__((visibility("default"))) # else # define PARAKEET_API # endif #endif
No effect when the static lib lands in an executable (visibility is irrelevant for the final link); makes the symbols re-exportable from a wrapping shared object.
-
Document in this header (and in
README.md§1) that consumers wrappinglibparakeet.ain their own.so/.dylibmust compile parakeet with-DPARAKEET_SHARED -DPARAKEET_BUILD.
Option 1 is the right answer for the QVAC speech stack — every addon that consumes this should not need to know the internal visibility convention.
| // happen to land at 80 ms (16 kHz x hop=160 x sub=8) but new GGUFs may | ||
| // differ -- e.g. a 24 kHz checkpoint or a 4x subsampling variant. | ||
| inline double encoder_frame_stride_ms(const ParakeetCtcModel & model) { | ||
| const int hop = model.mel_cfg.hop_length; |
There was a problem hiding this comment.
Divide-by-zero waiting to happen when a future GGUF lacks hop_length.
sub and sr below are > 0 ? : default-guarded; hop is not. If a future GGUF is converted with parakeet.preproc.hop_length missing or 0, this returns 0.0, then downstream:
transcribe_samples_stream:frames_per_window = floor(chunk_ms / 0)→ UB / infStreamSession::process_window:frame_samples = round(sr * 0 / 1000) = 0, thenleft_drop_frames = center_start_sample / 0→ SIGFPE
One-line fix matching the other two fields:
const int hop = model.mel_cfg.hop_length > 0 ? model.mel_cfg.hop_length : 160;No current GGUF triggers this; the converter writes parakeet.preproc.hop_length=160 for every shipped checkpoint. Catching it here means a future converter / model variant fails the load with a clean error instead of crashing inside the streaming math.
Android app packaging keeps native libraries compressed inside the APK
with no on-disk directory to scan (AGP's `useLegacyPackaging=false`
default since 3.6). The directory-iterator pass in
`ggml_backend_load_best` therefore finds nothing on Android and the
existing per-search_path `fs::exists` filename fallback also returns
false, leaving the loader to return nullptr and the consumer to fail
`init_cpu_backend()`.
For backends that ship as a single library (Vulkan / OpenCL / ...)
the bare `lib<prefix>ggml-<name>.so` filename is enough to resolve
via Android's in-APK linker lookup, but with
`GGML_CPU_ALL_VARIANTS=ON` (the qvac-registry-vcpkg whisper-cpp port
default for Android per QVAC-18993) the CPU backend ships only as
per-arch variants -- there is no plain `libggml-cpu.so` for the
fallback to compose, so the CPU backend silently never registers.
Enumerate the known per-arch Android variants as additional candidate
names for the "cpu" backend and run each through the standard
`ggml_backend_score` selection so the device's HWCAP picks the right
tier (armv8.0 baseline through armv9.2_2; matches the variants list
emitted by `ggml_add_cpu_backend_variant()` in ggml/src/CMakeLists.txt
around lines 410-416).
Fast-path for the size-1 candidate case (every backend on every
non-Android platform, plus Vulkan / OpenCL / Metal / ... on Android):
single load_backend call, identical cost to the previous code path.
The score-then-reload loop only runs when there's an actual choice
to make.
Mirrors qvac-ext-ggml@speech commit 9562ed04 ("ggml-backend: android
per-arch CPU variant dlopen fallback", @GustavoA1604, PR #11). Carried
here as a separate commit on top of the v1.8.4.3 upstream-sync branch
so the whisper-cpp vcpkg port can ship Android dynamic-backend mode
without a port-level patch (`patches/0002-...`).
Validated by an NDK r29 cross-compile of bundled ggml + whisper.cpp
with -DGGML_BACKEND_DL=ON -DBUILD_SHARED_LIBS=OFF
-DGGML_CPU_ALL_VARIANTS=ON -DGGML_CPU_REPACK=ON:
- all 7 per-arch libggml-cpu-android_armv*_*.so produced clean;
- `strings ggml-backend-reg.cpp.o | grep cpu-android_armv`
confirms the __ANDROID__ block compiles into the dispatcher
object.
Co-authored-by: Cursor <cursoragent@cursor.com>
…review) Address @gianni-cor review on PR #11: switch the bundled ggml filename prefix from `libparakeet-ggml-*` to `libspeech-ggml-*` so the QVAC speech stack (whisper, parakeet, chatterbox, supertonic, ...) can co-vendor a single ggml file set instead of each library shipping its own copy. - parakeet-cpp/CMakeLists.txt: OUTPUT_NAME prefix `parakeet-` -> `speech-`, GGML_BACKEND_DL_PROJECT_PREFIX macro `"parakeet-"` -> `"speech-"`, option blurb + status message updated. - parakeet-cpp/README.md, patches/README.md, scripts/setup-ggml.sh, patches/ggml-backend-reg-filename-prefix.patch: doc / comment / example updated to reference the new `speech-` prefix. Verified: setup-ggml.sh re-applies all patches cleanly; CMake configure prints `bundled ggml libraries will be emitted as libspeech-ggml-*`; build emits libspeech-ggml{,-base,-cpu,-blas,-metal}.{0,0.9.11}.dylib; parakeet binary's otool -L now references `libspeech-ggml*` exclusively. Co-authored-by: Cursor <cursoragent@cursor.com>
Add parakeet-cpp: NVIDIA Parakeet ASR + Sortformer diarization in pure C++/ggml
Android app packaging keeps native libraries compressed inside the APK
with no on-disk directory to scan (AGP's `useLegacyPackaging=false`
default since 3.6). The directory-iterator pass in
`ggml_backend_load_best` therefore finds nothing on Android and the
existing per-search_path `fs::exists` filename fallback also returns
false, leaving the loader to return nullptr and the consumer to fail
`init_cpu_backend()`.
For backends that ship as a single library (Vulkan / OpenCL / ...)
the bare `lib<prefix>ggml-<name>.so` filename is enough to resolve
via Android's in-APK linker lookup, but with
`GGML_CPU_ALL_VARIANTS=ON` (the qvac-registry-vcpkg whisper-cpp port
default for Android per QVAC-18993) the CPU backend ships only as
per-arch variants -- there is no plain `libggml-cpu.so` for the
fallback to compose, so the CPU backend silently never registers.
Enumerate the known per-arch Android variants as additional candidate
names for the "cpu" backend and run each through the standard
`ggml_backend_score` selection so the device's HWCAP picks the right
tier (armv8.0 baseline through armv9.2_2; matches the variants list
emitted by `ggml_add_cpu_backend_variant()` in ggml/src/CMakeLists.txt
around lines 410-416).
Fast-path for the size-1 candidate case (every backend on every
non-Android platform, plus Vulkan / OpenCL / Metal / ... on Android):
single load_backend call, identical cost to the previous code path.
The score-then-reload loop only runs when there's an actual choice
to make.
Mirrors qvac-ext-ggml@speech commit 9562ed04 ("ggml-backend: android
per-arch CPU variant dlopen fallback", @GustavoA1604, PR #11). Carried
here as a separate commit on top of the v1.8.4.3 upstream-sync branch
so the whisper-cpp vcpkg port can ship Android dynamic-backend mode
without a port-level patch (`patches/0002-...`).
Validated by an NDK r29 cross-compile of bundled ggml + whisper.cpp
with -DGGML_BACKEND_DL=ON -DBUILD_SHARED_LIBS=OFF
-DGGML_CPU_ALL_VARIANTS=ON -DGGML_CPU_REPACK=ON:
- all 7 per-arch libggml-cpu-android_armv*_*.so produced clean;
- `strings ggml-backend-reg.cpp.o | grep cpu-android_armv`
confirms the __ANDROID__ block compiles into the dispatcher
object.
Co-authored-by: Cursor <cursoragent@cursor.com>
Note: out of the 125k lines addition, most of it (91k) are due to adding parakeet-cpp/examples/miniaudio.h, which mirrors the whisper.cpp/examples/miniaudio.h file. I didn't want to add a symbolic link to keep projects separated.
Summary
Vendors
parakeet-cppunderparakeet-cpp/— a pure C++/ggml inference port of the NVIDIA Parakeet family (FastConformer ASR + Sortformer diarization). Oneparakeet::Engineloads CTC, TDT, EOU, or Sortformer GGUFs and dispatches by metadata; no Python / PyTorch / ONNX Runtime at runtime.The subtree is fully self-contained:
parakeet-cpp/CMakeLists.txt(PARAKEET_BUILD_LIBRARY/PARAKEET_BUILD_EXECUTABLES/PARAKEET_BUILD_TESTS/PARAKEET_BUILD_EXAMPLES); the top-levelwhisper.cppbuild is not touched, so existing whisper consumers keep their current shape.ggmlviaadd_subdirectory(ggml)afterscripts/setup-ggml.shclones it at the pinned commit (58c38058) and applies the three local patches underparakeet-cpp/patches/(see below).install(EXPORT parakeet-cpp-targets NAMESPACE parakeet::)+parakeet-cppConfig.cmake.in— so downstream code resolvesparakeet::parakeetandggml::ggmlfrom afind_package(parakeet-cpp CONFIG REQUIRED).What ships
parakeet-ctc-{0.6b,1.1b}), TDT (parakeet-tdt-{0.6b-v3,1.1b}, multilingual), EOU (parakeet_realtime_eou_120m-v1,<EOU>turn-detection token), Sortformer (diar_sortformer_4spk-v1/diar_streaming_sortformer_4spk-v2, up to 4 speakers)--diarization-model.StreamEventumbrella (EndOfTurnfrom EOU's<EOU>,VadStateChangedfrom Sortformer probs and an opt-in CPU energy-VAD on CTC/TDT).parakeetCLI with--bench/--profile/ JSONL emit / OpenCL knobs;live-mic,live-mic-attributedexamples on miniaudio.test-vk-vs-cpu. CTest labels:unit,fixture,perf,gpu.ggml_backend_load_all()+ registry walk; Adreno-6xx OpenCL fallback to CPU is preserved.scripts/convert-nemo-to-gguf.py(.nemo→.ggufwith f32 / f16 / q8_0 / q5_0 / q4_0);scripts/dump-{ctc,tdt,eou,sortformer}-reference.pyfor the parity harnesses.parakeet-cpp/README.md(quickstart, GPU build matrix, benchmarks vs onnxruntime, CMake knobs),parakeet-cpp/PROGRESS.md(full development history).ggml patches
Three patches live under
parakeet-cpp/patches/and are applied in lex order byscripts/setup-ggml.sh. Each is a strict no-op when its trigger is not active, so they don't disturb stock ggml consumers.ggml-backend-reg-filename-prefix.patch— adds a compile-timeGGML_BACKEND_DL_PROJECT_PREFIXmacro tobackend_filename_prefix()so a host project that renames its bundledlibggml-*files (parakeet does this viaPARAKEET_GGML_LIB_PREFIX=ON, default) does not break runtime backend discovery underGGML_BACKEND_DL=ON. Macro undefined ⇒ behaviour byte-equal to upstream.ggml-opencl-allow-non-adreno.patch— opt-in (GGML_OPENCL_ALLOW_UNKNOWN_GPU=1) relax of the Adreno/Intel-only device whitelist so dev hosts on NVIDIA / AMD / Intel iGPU can build and parity-test the OpenCL backend without an Adreno device. Adreno production path unchanged.ggml-opencl-program-binary-cache.patch— persistent on-disk cache for compiled OpenCL kernel binaries viaclCreateProgramWithBinary, keyed on(src, opts, driver, dev)FNV-1a-64 hashes; honours$GGML_OPENCL_CACHE_DIR(with$XDG_CACHE_HOME/ggml/opencl→$HOME/.cache/ggml/openclfallbacks). Removes the multi-second cold-startclBuildProgramwave on Adreno / Mesa / Mali.parakeet-cpp/patches/README.mddocuments each patch and the drop conditions.Layout
The
parakeet-cpp/.gitignorekeeps the clonedggml/,models/,artifacts/*.npy, and anybuild*/trees out of git.Why land it under
qvac-ext-lib-whisper.cpp/Centralises the in-house speech stack alongside whisper.cpp; both libraries share the same ggml backend ecosystem and the same packaging / vcpkg pipeline, so reviewers, CI, and consumers (vcpkg port, addon hosts) only need to track one upstream repo. The subtree is opt-in at build time and does not change the existing whisper build, headers, or vcpkg surface.
Build
From
parakeet-cpp/:GPU backends are configure-time:
ctest --test-dir build --output-on-failureafter the optionalscripts/dump-*-reference.pystep. Missing fixtures auto-disable individual tests so a fresh checkout still gives a green run.Consumption
The matching vcpkg overlay port (
parakeet-cppinqvac-registry-vcpkg) consumes the standalone GitHub mirror today; once this PR lands we can flip the port to point at this subtree directly if desired.Test plan
parakeet-cpp/scripts/setup-ggml.shapplies all three patches cleanly and is idempotent on re-run.cmake -S parakeet-cpp -B parakeet-cpp/build -DCMAKE_BUILD_TYPE=Release && cmake --build parakeet-cpp/buildsucceeds with no targets renamed in whisper's build.ctest --test-dir parakeet-cpp/build --output-on-failurepasses (unit+fixturelabels) with the standard CTC / TDT / Sortformer GGUFs + NeMo.npyreferences staged underparakeet-cpp/{models,artifacts}/.parakeet-cpp/build/parakeet --model models/parakeet-ctc-0.6b.q8_0.gguf --wav test/samples/jfk.wavproduces the expected JFK transcript.cmake -S . -B build && cmake --build build) still succeeds unchanged — the new subtree is not pulled into it.