[BCI] QVAC-17071 feat: add BCI neural signal support (variable conv1 kernel + windowed attention)#10
Merged
Merged
Conversation
added 5 commits
April 16, 2026 19:57
Read n_audio_conv1_kernel from model hparams to allow BCI models to use a non-standard first convolution kernel size. Standard whisper models default to kernel size 3. Made-with: Cursor
- Add n_audio_window_size and n_audio_last_window_layer hparams - When present, encoder self-attention is restricted to a local window for layers up to last_window_layer - Bypass flash attention when windowed mask is active (Metal FA does not support custom F32 masks); flash attention remains enabled for non-BCI models and for the decoder - Populate window_mask data on the encoder graph (not the cross graph) - Add proper SOS token (language + transcribe) initialization for BCI models Backward-compatible: n_audio_window_size defaults to 0 and n_audio_last_window_layer defaults to -1, disabling windowed attention entirely for standard whisper models. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
5326bf7 to
bbb3535
Compare
GustavoA1604
requested changes
Apr 16, 2026
added 2 commits
April 18, 2026 10:46
Address review feedback: 1. Guard read_safe for BCI-specific hparams (n_audio_conv1_kernel, n_audio_window_size, n_audio_last_window_layer) behind a n_mels > 256 check. Standard whisper models have n_mels <= 128 and do not contain these fields — reading them unconditionally would corrupt the file position and break model loading. 2. Add explicit is_bci flag to hparams struct, set when BCI fields are detected during loading. 3. Use is_bci flag (instead of n_audio_window_size > 0) to guard the BCI-specific decoder SOS token initialization. 4. Log BCI-specific hparams when a BCI model is detected. Made-with: Cursor
The windowed attention mask values depend only on n_ctx and window_size, both fixed after model load. Move the O(n_ctx^2) computation from whisper_encode_internal (called every encode) to whisper_init_state (called once). The encode path now just copies the precomputed data to the graph tensor. Made-with: Cursor
ogad-tether
requested changes
Apr 20, 2026
ogad-tether
left a comment
There was a problem hiding this comment.
Thanks for addressing the earlier review notes. I found three follow-ups that still look blocking before this is safe to merge:
window_mask_datais precomputed for the full modeln_audio_ctx, but the encoder graph usesexp_n_audio_ctxwhen callers setparams.audio_ctx. That makes the laterggml_backend_tensor_set()size mismatch and can overflow the smaller graph tensor.- The install/include path change to
include/whisperwas not carried through towhisper.pc, which still emits-I${prefix}/include, so installed pkg-config consumers will stop findingwhisper.h. - The exported CMake target still links
Threads::Threads, butwhisper-config.cmakeno longer resolves that dependency on Windows, sofind_package(whisper)can import a target with an undefined dependency there.
Could you follow up on those before merging?
added 2 commits
April 20, 2026 14:52
…, Threads 1. Fix window_mask_data / exp_n_audio_ctx mismatch: the precomputed mask uses hparams.n_audio_ctx, but the graph tensor is sized from exp_n_audio_ctx when params.audio_ctx is overridden. Now falls back to recomputing the mask at the effective n_ctx when sizes differ, preventing a buffer overflow into the smaller tensor. 2. Update whisper.pc.in: the install interface was changed to include/whisper but the pkg-config includedir still pointed to include/. Consumers using pkg-config would not find whisper.h. 3. Fix whisper-config.cmake.in: the whisper target publicly links Threads::Threads but find_dependency(Threads) was skipped on Windows, leaving downstream find_package(whisper) with an unresolved imported target. Now always resolve Threads.
…ash attention 1. Cache fallback mask recompute: when exp_n_audio_ctx overrides the default n_audio_ctx, the window mask is now recomputed once and cached in wstate (keyed on window_mask_n_ctx) instead of allocating a new std::vector on every whisper_encode_internal call. 2. Per-layer flash attention: layers above last_window_layer no longer need the windowed attention mask. The flash attention path is now used for those layers even when BCI windowed attention is active, instead of globally falling back to the softmax path for the entire encoder. 3. Use std::abs instead of C abs in both init-time and encode-time mask computation paths.
ogad-tether
approved these changes
Apr 20, 2026
jpgaribotti
requested changes
Apr 20, 2026
This was referenced Apr 20, 2026
…alidation 1. Extract compute_window_mask() helper on whisper_state to eliminate the duplicated O(n_ctx^2) mask fill loop that appeared in both whisper_init_state and whisper_encode_internal. Both call sites now use the single helper, preventing future drift. 2. Guard the encode-time mask block with hparams.is_bci before doing the ggml_graph_get_tensor lookup. Cheaper and more explicit than relying on the tensor name string to determine whether BCI windowed attention is active. 3. Add hparams.is_bci to the graph builder guard for window_mask tensor creation, aligning it with the other BCI code paths. 4. Add validation for BCI hparams after reading from file: n_audio_conv1_kernel must be > 0, n_audio_window_size must be >= 0. Log an error and return false on invalid values instead of proceeding with garbage. 5. Add comment explaining the n_mels > 256 threshold used to discriminate BCI models from standard whisper models, and noting that a dedicated file-format marker should be introduced if this assumption ever breaks. Made-with: Cursor
GustavoA1604
approved these changes
Apr 20, 2026
This was referenced Apr 21, 2026
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 14, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs Three independent wins bundled into one round, strict TDD on each — new CPU-only unit test for every change, RED → impl → GREEN → end-to-end validation on real hardware. == tetherto#10 — Auto-pick UMA bias == Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs because UMA reports the entire system RAM (120+ GB) as free VRAM, while a discrete RTX 5090 reports 32 GB. Silent 40x realtime regression for any operator following the help text "auto-pick adapter with most free VRAM". Extended `resolve_vulkan_device_index` with an optional third arg: int resolve_vulkan_device_index(int requested, const std::vector<size_t> & free_vram_per_device, const std::vector<bool> & is_uma_per_device = {}); Empty UMA list -> round-3 behaviour preserved verbatim. Non-empty + at least one discrete -> argmax over the DISCRETE subset. All-UMA falls back to round-3 argmax. Explicit `requested >= 0` passthrough is UMA-agnostic. Caller wiring (in `init_supertonic_backend`) collects UMA flags via the public `ggml_backend_dev_get_props()` API on `ggml_backend_vk_reg()` - sets `is_uma = true` for `GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`. `test_supertonic_vulkan_device_select.cpp` extended with 6 new test functions / 14 new checks covering the round-12 behaviour matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete, multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit- index-ignores-UMA-bias, mismatched-length-throws). == tetherto#6 — Text-encoder speech-prompted-attention GPU bridge == Master's Metal-port branch (PR tetherto#15) built `speech_prompted_merged_cache` (one ggml graph for QKV projection + head-split + flash-attn + out-proj end-to-end on GPU) but never wired its run path. Production text-encoder stayed on the pre-Phase-A4 two-cache pattern with host-side Q/V download -> pack -> re-upload between the QKV cache and the flash-attn cache. Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the dispatch in `speech_prompted_attention_ggml`: if (!model_prefers_cpu_kernels(m)) { thread_local speech_prompted_merged_cache merged_caches[2]; // rebuild on key change, then: run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc); return; } // ... legacy two-cache CPU path unchanged Eliminates per call: - 2 GPU->host downloads (q_out, v_out) - 3 host->GPU uploads (q_pack, k_pack, v_pack) - 1 graph dispatch - All host pack work (q_pack / k_pack / v_pack head-split) = 5 sync points x 2 layers = 10 sync points / synth at the text encoder alone. CPU stays on the legacy two-cache path: master's `dense_matmul_time_ggml` CPU fast path uses cblas + the host- side head-split is a free memcpy; switching CPU to merged would pull the matmul through the slower ggml conv1d fallback and gain nothing (no sync points exist on CPU). `test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins: - run_speech_prompted_merged_cache symbol via SFINAE - speech_prompted_merged_cache struct field contract (x_in, style_in, out, idx, L) via SFINAE - free-default-cache trip-wire (catches a buggy free path that segfaults on never-built `thread_local` cache slots at process exit) 6 / 6 CPU-only checks pass. End-to-end equivalence vs. the legacy two-cache path verified by the existing model-fixture parity tests (`test-supertonic-text-encoder-trace`, `test-supertonic-pipeline`). == tetherto#5 — Pinned-host-buffer per-step input scratchpad == Round 3 shipped the capability probe `supertonic_backend_supports_pinned_host_buffer`, which returns `true` iff `ggml_backend_vk_host_buffer_type()` is non-null on the resolved backend. The actual per-engine input-scratchpad refactor that USES the host-pinned buffer to skip ggml-vulkan's internal staging-buffer hop was deferred. Round 12 tetherto#5 lands the helper: ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer( const supertonic_model & model, ggml_context * input_ctx); Returns nullptr on null model.backend / null input_ctx / non- Vulkan backend / API miss. Otherwise allocates the entire input_ctx tensor set from `ggml_backend_vk_host_buffer_type()` via `ggml_backend_alloc_ctx_tensors_from_buft`. Caller owns the returned buffer; frees at cache destruction via `ggml_backend_buffer_free`. Applied via a dual-context allocation pattern at the two highest-frequency per-step input sites: - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in - ve_front_block_graph_cache: x_in + mask_in + t_emb_in Total: 9 per-step input tensors moved to host-pinned memory. Each `ggml_backend_tensor_set` on these tensors skips one internal staging-buffer hop on Vulkan (BAR-mapped GPU memory written directly by the host without an intermediate copy). Dual-context pattern: 1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots 2. Create x_in / temb_in / etc. in input_ctx 3. Try host-pinned alloc; fall back to default backend buffer via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)` 4. Build the rest of the graph in cache.ctx; gallocr handles intermediates + outputs, skipping the pre-allocated inputs via the `tensor->buffer != nullptr` check Free order: gallocr -> main ctx -> input_buf -> input_ctx (reversed order would dangle gallocr pointers into freed input tensor metadata) CPU / Metal / OpenCL safety: helper returns nullptr; callers fall back to default backend buffer. Identical CPU behaviour to pre-round-12; only Vulkan gains. `test_supertonic_pinned_host_buffer.cpp` (NEW) pins: - Helper symbol existence (SFINAE) - nullptr return on CPU backend (idempotent across repeats) - Null-pointer safety on null model.backend / null input_ctx 11 / 11 CPU-only checks pass. == Combined perf snapshot on RTX 5090 == Long-prompt bench (173 chars, ~15s of audio): Round 11 baseline: 76.11 ms / 5 steps (123x realtime) Round 12 (all three): 27.99 ms / 5 steps (537x realtime) ^ 2.7x faster Vector estimator step: 12.7 ms -> 3.28 ms (3.9x faster) Prewarm cold-start: 330 ms -> 21 ms (15x faster) Short-prompt bench (Hello-world class, ~3s audio): Round 11 baseline: 44.08 ms (74x realtime) Round 12: 23.31 ms (394x realtime) Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU): Round 11 `--vulkan-device -1`: picks RADV -> 178 ms (7x realtime) Round 12 `--vulkan-device -1`: picks RTX 5090 -> 28 ms (537x realtime) ^ 6.4x faster for users following help text == Test plan == CPU build: cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF cmake --build tts-cpp/build -j ctest --test-dir tts-cpp/build -L unit -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text- encoder-gpu-bridge, +1 pinned-host-buffer) Vulkan build: cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON cmake --build tts-cpp/build-vulkan -j ctest --test-dir tts-cpp/build-vulkan -L unit -> 24 / 24 PASS End-to-end synth verified on all 4 backends (CPU, Vulkan RTX 5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter writes a valid WAV. Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 15, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs Three independent wins bundled into one round, strict TDD on each — new CPU-only unit test for every change, RED → impl → GREEN → end-to-end validation on real hardware. == tetherto#10 — Auto-pick UMA bias == Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs because UMA reports the entire system RAM (120+ GB) as free VRAM, while a discrete RTX 5090 reports 32 GB. Silent 40x realtime regression for any operator following the help text "auto-pick adapter with most free VRAM". Extended `resolve_vulkan_device_index` with an optional third arg: int resolve_vulkan_device_index(int requested, const std::vector<size_t> & free_vram_per_device, const std::vector<bool> & is_uma_per_device = {}); Empty UMA list -> round-3 behaviour preserved verbatim. Non-empty + at least one discrete -> argmax over the DISCRETE subset. All-UMA falls back to round-3 argmax. Explicit `requested >= 0` passthrough is UMA-agnostic. Caller wiring (in `init_supertonic_backend`) collects UMA flags via the public `ggml_backend_dev_get_props()` API on `ggml_backend_vk_reg()` - sets `is_uma = true` for `GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`. `test_supertonic_vulkan_device_select.cpp` extended with 6 new test functions / 14 new checks covering the round-12 behaviour matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete, multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit- index-ignores-UMA-bias, mismatched-length-throws). == tetherto#6 — Text-encoder speech-prompted-attention GPU bridge == Master's Metal-port branch (PR tetherto#15) built `speech_prompted_merged_cache` (one ggml graph for QKV projection + head-split + flash-attn + out-proj end-to-end on GPU) but never wired its run path. Production text-encoder stayed on the pre-Phase-A4 two-cache pattern with host-side Q/V download -> pack -> re-upload between the QKV cache and the flash-attn cache. Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the dispatch in `speech_prompted_attention_ggml`: if (!model_prefers_cpu_kernels(m)) { thread_local speech_prompted_merged_cache merged_caches[2]; // rebuild on key change, then: run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc); return; } // ... legacy two-cache CPU path unchanged Eliminates per call: - 2 GPU->host downloads (q_out, v_out) - 3 host->GPU uploads (q_pack, k_pack, v_pack) - 1 graph dispatch - All host pack work (q_pack / k_pack / v_pack head-split) = 5 sync points x 2 layers = 10 sync points / synth at the text encoder alone. CPU stays on the legacy two-cache path: master's `dense_matmul_time_ggml` CPU fast path uses cblas + the host- side head-split is a free memcpy; switching CPU to merged would pull the matmul through the slower ggml conv1d fallback and gain nothing (no sync points exist on CPU). `test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins: - run_speech_prompted_merged_cache symbol via SFINAE - speech_prompted_merged_cache struct field contract (x_in, style_in, out, idx, L) via SFINAE - free-default-cache trip-wire (catches a buggy free path that segfaults on never-built `thread_local` cache slots at process exit) 6 / 6 CPU-only checks pass. End-to-end equivalence vs. the legacy two-cache path verified by the existing model-fixture parity tests (`test-supertonic-text-encoder-trace`, `test-supertonic-pipeline`). == tetherto#5 — Pinned-host-buffer per-step input scratchpad == Round 3 shipped the capability probe `supertonic_backend_supports_pinned_host_buffer`, which returns `true` iff `ggml_backend_vk_host_buffer_type()` is non-null on the resolved backend. The actual per-engine input-scratchpad refactor that USES the host-pinned buffer to skip ggml-vulkan's internal staging-buffer hop was deferred. Round 12 tetherto#5 lands the helper: ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer( const supertonic_model & model, ggml_context * input_ctx); Returns nullptr on null model.backend / null input_ctx / non- Vulkan backend / API miss. Otherwise allocates the entire input_ctx tensor set from `ggml_backend_vk_host_buffer_type()` via `ggml_backend_alloc_ctx_tensors_from_buft`. Caller owns the returned buffer; frees at cache destruction via `ggml_backend_buffer_free`. Applied via a dual-context allocation pattern at the two highest-frequency per-step input sites: - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in - ve_front_block_graph_cache: x_in + mask_in + t_emb_in Total: 9 per-step input tensors moved to host-pinned memory. Each `ggml_backend_tensor_set` on these tensors skips one internal staging-buffer hop on Vulkan (BAR-mapped GPU memory written directly by the host without an intermediate copy). Dual-context pattern: 1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots 2. Create x_in / temb_in / etc. in input_ctx 3. Try host-pinned alloc; fall back to default backend buffer via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)` 4. Build the rest of the graph in cache.ctx; gallocr handles intermediates + outputs, skipping the pre-allocated inputs via the `tensor->buffer != nullptr` check Free order: gallocr -> main ctx -> input_buf -> input_ctx (reversed order would dangle gallocr pointers into freed input tensor metadata) CPU / Metal / OpenCL safety: helper returns nullptr; callers fall back to default backend buffer. Identical CPU behaviour to pre-round-12; only Vulkan gains. `test_supertonic_pinned_host_buffer.cpp` (NEW) pins: - Helper symbol existence (SFINAE) - nullptr return on CPU backend (idempotent across repeats) - Null-pointer safety on null model.backend / null input_ctx 11 / 11 CPU-only checks pass. == Combined perf snapshot on RTX 5090 == Long-prompt bench (173 chars, ~15s of audio): Round 11 baseline: 76.11 ms / 5 steps (123x realtime) Round 12 (all three): 27.99 ms / 5 steps (537x realtime) ^ 2.7x faster Vector estimator step: 12.7 ms -> 3.28 ms (3.9x faster) Prewarm cold-start: 330 ms -> 21 ms (15x faster) Short-prompt bench (Hello-world class, ~3s audio): Round 11 baseline: 44.08 ms (74x realtime) Round 12: 23.31 ms (394x realtime) Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU): Round 11 `--vulkan-device -1`: picks RADV -> 178 ms (7x realtime) Round 12 `--vulkan-device -1`: picks RTX 5090 -> 28 ms (537x realtime) ^ 6.4x faster for users following help text == Test plan == CPU build: cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF cmake --build tts-cpp/build -j ctest --test-dir tts-cpp/build -L unit -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text- encoder-gpu-bridge, +1 pinned-host-buffer) Vulkan build: cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON cmake --build tts-cpp/build-vulkan -j ctest --test-dir tts-cpp/build-vulkan -L unit -> 24 / 24 PASS End-to-end synth verified on all 4 backends (CPU, Vulkan RTX 5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter writes a valid WAV. Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 15, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs Three independent wins bundled into one round, strict TDD on each — new CPU-only unit test for every change, RED → impl → GREEN → end-to-end validation on real hardware. == tetherto#10 — Auto-pick UMA bias == Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs because UMA reports the entire system RAM (120+ GB) as free VRAM, while a discrete RTX 5090 reports 32 GB. Silent 40x realtime regression for any operator following the help text "auto-pick adapter with most free VRAM". Extended `resolve_vulkan_device_index` with an optional third arg: int resolve_vulkan_device_index(int requested, const std::vector<size_t> & free_vram_per_device, const std::vector<bool> & is_uma_per_device = {}); Empty UMA list -> round-3 behaviour preserved verbatim. Non-empty + at least one discrete -> argmax over the DISCRETE subset. All-UMA falls back to round-3 argmax. Explicit `requested >= 0` passthrough is UMA-agnostic. Caller wiring (in `init_supertonic_backend`) collects UMA flags via the public `ggml_backend_dev_get_props()` API on `ggml_backend_vk_reg()` - sets `is_uma = true` for `GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`. `test_supertonic_vulkan_device_select.cpp` extended with 6 new test functions / 14 new checks covering the round-12 behaviour matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete, multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit- index-ignores-UMA-bias, mismatched-length-throws). == tetherto#6 — Text-encoder speech-prompted-attention GPU bridge == Master's Metal-port branch (PR tetherto#15) built `speech_prompted_merged_cache` (one ggml graph for QKV projection + head-split + flash-attn + out-proj end-to-end on GPU) but never wired its run path. Production text-encoder stayed on the pre-Phase-A4 two-cache pattern with host-side Q/V download -> pack -> re-upload between the QKV cache and the flash-attn cache. Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the dispatch in `speech_prompted_attention_ggml`: if (!model_prefers_cpu_kernels(m)) { thread_local speech_prompted_merged_cache merged_caches[2]; // rebuild on key change, then: run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc); return; } // ... legacy two-cache CPU path unchanged Eliminates per call: - 2 GPU->host downloads (q_out, v_out) - 3 host->GPU uploads (q_pack, k_pack, v_pack) - 1 graph dispatch - All host pack work (q_pack / k_pack / v_pack head-split) = 5 sync points x 2 layers = 10 sync points / synth at the text encoder alone. CPU stays on the legacy two-cache path: master's `dense_matmul_time_ggml` CPU fast path uses cblas + the host- side head-split is a free memcpy; switching CPU to merged would pull the matmul through the slower ggml conv1d fallback and gain nothing (no sync points exist on CPU). `test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins: - run_speech_prompted_merged_cache symbol via SFINAE - speech_prompted_merged_cache struct field contract (x_in, style_in, out, idx, L) via SFINAE - free-default-cache trip-wire (catches a buggy free path that segfaults on never-built `thread_local` cache slots at process exit) 6 / 6 CPU-only checks pass. End-to-end equivalence vs. the legacy two-cache path verified by the existing model-fixture parity tests (`test-supertonic-text-encoder-trace`, `test-supertonic-pipeline`). == tetherto#5 — Pinned-host-buffer per-step input scratchpad == Round 3 shipped the capability probe `supertonic_backend_supports_pinned_host_buffer`, which returns `true` iff `ggml_backend_vk_host_buffer_type()` is non-null on the resolved backend. The actual per-engine input-scratchpad refactor that USES the host-pinned buffer to skip ggml-vulkan's internal staging-buffer hop was deferred. Round 12 tetherto#5 lands the helper: ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer( const supertonic_model & model, ggml_context * input_ctx); Returns nullptr on null model.backend / null input_ctx / non- Vulkan backend / API miss. Otherwise allocates the entire input_ctx tensor set from `ggml_backend_vk_host_buffer_type()` via `ggml_backend_alloc_ctx_tensors_from_buft`. Caller owns the returned buffer; frees at cache destruction via `ggml_backend_buffer_free`. Applied via a dual-context allocation pattern at the two highest-frequency per-step input sites: - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in - ve_front_block_graph_cache: x_in + mask_in + t_emb_in Total: 9 per-step input tensors moved to host-pinned memory. Each `ggml_backend_tensor_set` on these tensors skips one internal staging-buffer hop on Vulkan (BAR-mapped GPU memory written directly by the host without an intermediate copy). Dual-context pattern: 1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots 2. Create x_in / temb_in / etc. in input_ctx 3. Try host-pinned alloc; fall back to default backend buffer via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)` 4. Build the rest of the graph in cache.ctx; gallocr handles intermediates + outputs, skipping the pre-allocated inputs via the `tensor->buffer != nullptr` check Free order: gallocr -> main ctx -> input_buf -> input_ctx (reversed order would dangle gallocr pointers into freed input tensor metadata) CPU / Metal / OpenCL safety: helper returns nullptr; callers fall back to default backend buffer. Identical CPU behaviour to pre-round-12; only Vulkan gains. `test_supertonic_pinned_host_buffer.cpp` (NEW) pins: - Helper symbol existence (SFINAE) - nullptr return on CPU backend (idempotent across repeats) - Null-pointer safety on null model.backend / null input_ctx 11 / 11 CPU-only checks pass. == Combined perf snapshot on RTX 5090 == Long-prompt bench (173 chars, ~15s of audio): Round 11 baseline: 76.11 ms / 5 steps (123x realtime) Round 12 (all three): 27.99 ms / 5 steps (537x realtime) ^ 2.7x faster Vector estimator step: 12.7 ms -> 3.28 ms (3.9x faster) Prewarm cold-start: 330 ms -> 21 ms (15x faster) Short-prompt bench (Hello-world class, ~3s audio): Round 11 baseline: 44.08 ms (74x realtime) Round 12: 23.31 ms (394x realtime) Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU): Round 11 `--vulkan-device -1`: picks RADV -> 178 ms (7x realtime) Round 12 `--vulkan-device -1`: picks RTX 5090 -> 28 ms (537x realtime) ^ 6.4x faster for users following help text == Test plan == CPU build: cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF cmake --build tts-cpp/build -j ctest --test-dir tts-cpp/build -L unit -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text- encoder-gpu-bridge, +1 pinned-host-buffer) Vulkan build: cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON cmake --build tts-cpp/build-vulkan -j ctest --test-dir tts-cpp/build-vulkan -L unit -> 24 / 24 PASS End-to-end synth verified on all 4 backends (CPU, Vulkan RTX 5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter writes a valid WAV. Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 18, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs Three independent wins bundled into one round, strict TDD on each — new CPU-only unit test for every change, RED → impl → GREEN → end-to-end validation on real hardware. == tetherto#10 — Auto-pick UMA bias == Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs because UMA reports the entire system RAM (120+ GB) as free VRAM, while a discrete RTX 5090 reports 32 GB. Silent 40x realtime regression for any operator following the help text "auto-pick adapter with most free VRAM". Extended `resolve_vulkan_device_index` with an optional third arg: int resolve_vulkan_device_index(int requested, const std::vector<size_t> & free_vram_per_device, const std::vector<bool> & is_uma_per_device = {}); Empty UMA list -> round-3 behaviour preserved verbatim. Non-empty + at least one discrete -> argmax over the DISCRETE subset. All-UMA falls back to round-3 argmax. Explicit `requested >= 0` passthrough is UMA-agnostic. Caller wiring (in `init_supertonic_backend`) collects UMA flags via the public `ggml_backend_dev_get_props()` API on `ggml_backend_vk_reg()` - sets `is_uma = true` for `GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`. `test_supertonic_vulkan_device_select.cpp` extended with 6 new test functions / 14 new checks covering the round-12 behaviour matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete, multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit- index-ignores-UMA-bias, mismatched-length-throws). == tetherto#6 — Text-encoder speech-prompted-attention GPU bridge == Master's Metal-port branch (PR tetherto#15) built `speech_prompted_merged_cache` (one ggml graph for QKV projection + head-split + flash-attn + out-proj end-to-end on GPU) but never wired its run path. Production text-encoder stayed on the pre-Phase-A4 two-cache pattern with host-side Q/V download -> pack -> re-upload between the QKV cache and the flash-attn cache. Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the dispatch in `speech_prompted_attention_ggml`: if (!model_prefers_cpu_kernels(m)) { thread_local speech_prompted_merged_cache merged_caches[2]; // rebuild on key change, then: run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc); return; } // ... legacy two-cache CPU path unchanged Eliminates per call: - 2 GPU->host downloads (q_out, v_out) - 3 host->GPU uploads (q_pack, k_pack, v_pack) - 1 graph dispatch - All host pack work (q_pack / k_pack / v_pack head-split) = 5 sync points x 2 layers = 10 sync points / synth at the text encoder alone. CPU stays on the legacy two-cache path: master's `dense_matmul_time_ggml` CPU fast path uses cblas + the host- side head-split is a free memcpy; switching CPU to merged would pull the matmul through the slower ggml conv1d fallback and gain nothing (no sync points exist on CPU). `test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins: - run_speech_prompted_merged_cache symbol via SFINAE - speech_prompted_merged_cache struct field contract (x_in, style_in, out, idx, L) via SFINAE - free-default-cache trip-wire (catches a buggy free path that segfaults on never-built `thread_local` cache slots at process exit) 6 / 6 CPU-only checks pass. End-to-end equivalence vs. the legacy two-cache path verified by the existing model-fixture parity tests (`test-supertonic-text-encoder-trace`, `test-supertonic-pipeline`). == tetherto#5 — Pinned-host-buffer per-step input scratchpad == Round 3 shipped the capability probe `supertonic_backend_supports_pinned_host_buffer`, which returns `true` iff `ggml_backend_vk_host_buffer_type()` is non-null on the resolved backend. The actual per-engine input-scratchpad refactor that USES the host-pinned buffer to skip ggml-vulkan's internal staging-buffer hop was deferred. Round 12 tetherto#5 lands the helper: ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer( const supertonic_model & model, ggml_context * input_ctx); Returns nullptr on null model.backend / null input_ctx / non- Vulkan backend / API miss. Otherwise allocates the entire input_ctx tensor set from `ggml_backend_vk_host_buffer_type()` via `ggml_backend_alloc_ctx_tensors_from_buft`. Caller owns the returned buffer; frees at cache destruction via `ggml_backend_buffer_free`. Applied via a dual-context allocation pattern at the two highest-frequency per-step input sites: - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in - ve_front_block_graph_cache: x_in + mask_in + t_emb_in Total: 9 per-step input tensors moved to host-pinned memory. Each `ggml_backend_tensor_set` on these tensors skips one internal staging-buffer hop on Vulkan (BAR-mapped GPU memory written directly by the host without an intermediate copy). Dual-context pattern: 1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots 2. Create x_in / temb_in / etc. in input_ctx 3. Try host-pinned alloc; fall back to default backend buffer via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)` 4. Build the rest of the graph in cache.ctx; gallocr handles intermediates + outputs, skipping the pre-allocated inputs via the `tensor->buffer != nullptr` check Free order: gallocr -> main ctx -> input_buf -> input_ctx (reversed order would dangle gallocr pointers into freed input tensor metadata) CPU / Metal / OpenCL safety: helper returns nullptr; callers fall back to default backend buffer. Identical CPU behaviour to pre-round-12; only Vulkan gains. `test_supertonic_pinned_host_buffer.cpp` (NEW) pins: - Helper symbol existence (SFINAE) - nullptr return on CPU backend (idempotent across repeats) - Null-pointer safety on null model.backend / null input_ctx 11 / 11 CPU-only checks pass. == Combined perf snapshot on RTX 5090 == Long-prompt bench (173 chars, ~15s of audio): Round 11 baseline: 76.11 ms / 5 steps (123x realtime) Round 12 (all three): 27.99 ms / 5 steps (537x realtime) ^ 2.7x faster Vector estimator step: 12.7 ms -> 3.28 ms (3.9x faster) Prewarm cold-start: 330 ms -> 21 ms (15x faster) Short-prompt bench (Hello-world class, ~3s audio): Round 11 baseline: 44.08 ms (74x realtime) Round 12: 23.31 ms (394x realtime) Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU): Round 11 `--vulkan-device -1`: picks RADV -> 178 ms (7x realtime) Round 12 `--vulkan-device -1`: picks RTX 5090 -> 28 ms (537x realtime) ^ 6.4x faster for users following help text == Test plan == CPU build: cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF cmake --build tts-cpp/build -j ctest --test-dir tts-cpp/build -L unit -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text- encoder-gpu-bridge, +1 pinned-host-buffer) Vulkan build: cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON cmake --build tts-cpp/build-vulkan -j ctest --test-dir tts-cpp/build-vulkan -L unit -> 24 / 24 PASS End-to-end synth verified on all 4 backends (CPU, Vulkan RTX 5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter writes a valid WAV. Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 19, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs Three independent wins bundled into one round, strict TDD on each — new CPU-only unit test for every change, RED → impl → GREEN → end-to-end validation on real hardware. == tetherto#10 — Auto-pick UMA bias == Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs because UMA reports the entire system RAM (120+ GB) as free VRAM, while a discrete RTX 5090 reports 32 GB. Silent 40x realtime regression for any operator following the help text "auto-pick adapter with most free VRAM". Extended `resolve_vulkan_device_index` with an optional third arg: int resolve_vulkan_device_index(int requested, const std::vector<size_t> & free_vram_per_device, const std::vector<bool> & is_uma_per_device = {}); Empty UMA list -> round-3 behaviour preserved verbatim. Non-empty + at least one discrete -> argmax over the DISCRETE subset. All-UMA falls back to round-3 argmax. Explicit `requested >= 0` passthrough is UMA-agnostic. Caller wiring (in `init_supertonic_backend`) collects UMA flags via the public `ggml_backend_dev_get_props()` API on `ggml_backend_vk_reg()` - sets `is_uma = true` for `GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`. `test_supertonic_vulkan_device_select.cpp` extended with 6 new test functions / 14 new checks covering the round-12 behaviour matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete, multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit- index-ignores-UMA-bias, mismatched-length-throws). == tetherto#6 — Text-encoder speech-prompted-attention GPU bridge == Master's Metal-port branch (PR tetherto#15) built `speech_prompted_merged_cache` (one ggml graph for QKV projection + head-split + flash-attn + out-proj end-to-end on GPU) but never wired its run path. Production text-encoder stayed on the pre-Phase-A4 two-cache pattern with host-side Q/V download -> pack -> re-upload between the QKV cache and the flash-attn cache. Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the dispatch in `speech_prompted_attention_ggml`: if (!model_prefers_cpu_kernels(m)) { thread_local speech_prompted_merged_cache merged_caches[2]; // rebuild on key change, then: run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc); return; } // ... legacy two-cache CPU path unchanged Eliminates per call: - 2 GPU->host downloads (q_out, v_out) - 3 host->GPU uploads (q_pack, k_pack, v_pack) - 1 graph dispatch - All host pack work (q_pack / k_pack / v_pack head-split) = 5 sync points x 2 layers = 10 sync points / synth at the text encoder alone. CPU stays on the legacy two-cache path: master's `dense_matmul_time_ggml` CPU fast path uses cblas + the host- side head-split is a free memcpy; switching CPU to merged would pull the matmul through the slower ggml conv1d fallback and gain nothing (no sync points exist on CPU). `test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins: - run_speech_prompted_merged_cache symbol via SFINAE - speech_prompted_merged_cache struct field contract (x_in, style_in, out, idx, L) via SFINAE - free-default-cache trip-wire (catches a buggy free path that segfaults on never-built `thread_local` cache slots at process exit) 6 / 6 CPU-only checks pass. End-to-end equivalence vs. the legacy two-cache path verified by the existing model-fixture parity tests (`test-supertonic-text-encoder-trace`, `test-supertonic-pipeline`). == tetherto#5 — Pinned-host-buffer per-step input scratchpad == Round 3 shipped the capability probe `supertonic_backend_supports_pinned_host_buffer`, which returns `true` iff `ggml_backend_vk_host_buffer_type()` is non-null on the resolved backend. The actual per-engine input-scratchpad refactor that USES the host-pinned buffer to skip ggml-vulkan's internal staging-buffer hop was deferred. Round 12 tetherto#5 lands the helper: ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer( const supertonic_model & model, ggml_context * input_ctx); Returns nullptr on null model.backend / null input_ctx / non- Vulkan backend / API miss. Otherwise allocates the entire input_ctx tensor set from `ggml_backend_vk_host_buffer_type()` via `ggml_backend_alloc_ctx_tensors_from_buft`. Caller owns the returned buffer; frees at cache destruction via `ggml_backend_buffer_free`. Applied via a dual-context allocation pattern at the two highest-frequency per-step input sites: - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in - ve_front_block_graph_cache: x_in + mask_in + t_emb_in Total: 9 per-step input tensors moved to host-pinned memory. Each `ggml_backend_tensor_set` on these tensors skips one internal staging-buffer hop on Vulkan (BAR-mapped GPU memory written directly by the host without an intermediate copy). Dual-context pattern: 1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots 2. Create x_in / temb_in / etc. in input_ctx 3. Try host-pinned alloc; fall back to default backend buffer via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)` 4. Build the rest of the graph in cache.ctx; gallocr handles intermediates + outputs, skipping the pre-allocated inputs via the `tensor->buffer != nullptr` check Free order: gallocr -> main ctx -> input_buf -> input_ctx (reversed order would dangle gallocr pointers into freed input tensor metadata) CPU / Metal / OpenCL safety: helper returns nullptr; callers fall back to default backend buffer. Identical CPU behaviour to pre-round-12; only Vulkan gains. `test_supertonic_pinned_host_buffer.cpp` (NEW) pins: - Helper symbol existence (SFINAE) - nullptr return on CPU backend (idempotent across repeats) - Null-pointer safety on null model.backend / null input_ctx 11 / 11 CPU-only checks pass. == Combined perf snapshot on RTX 5090 == Long-prompt bench (173 chars, ~15s of audio): Round 11 baseline: 76.11 ms / 5 steps (123x realtime) Round 12 (all three): 27.99 ms / 5 steps (537x realtime) ^ 2.7x faster Vector estimator step: 12.7 ms -> 3.28 ms (3.9x faster) Prewarm cold-start: 330 ms -> 21 ms (15x faster) Short-prompt bench (Hello-world class, ~3s audio): Round 11 baseline: 44.08 ms (74x realtime) Round 12: 23.31 ms (394x realtime) Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU): Round 11 `--vulkan-device -1`: picks RADV -> 178 ms (7x realtime) Round 12 `--vulkan-device -1`: picks RTX 5090 -> 28 ms (537x realtime) ^ 6.4x faster for users following help text == Test plan == CPU build: cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF cmake --build tts-cpp/build -j ctest --test-dir tts-cpp/build -L unit -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text- encoder-gpu-bridge, +1 pinned-host-buffer) Vulkan build: cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON cmake --build tts-cpp/build-vulkan -j ctest --test-dir tts-cpp/build-vulkan -L unit -> 24 / 24 PASS End-to-end synth verified on all 4 backends (CPU, Vulkan RTX 5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter writes a valid WAV. Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604
pushed a commit
that referenced
this pull request
May 19, 2026
Resolves the review comments on the merged AOSC v2.1 PR (#22, merge commit e6ba38c). All eight changes are minimal and behaviour-preserving except the v2.1 detection upgrade (now strict-tag with shape fallback) and the degenerate-config guard (silence-only fallback instead of UB-adjacent boost arithmetic). Reviewer comments classified as "perf only / out of scope / would only add a TODO" are intentionally not addressed in this commit -- see the plan file referenced in the PR description. src/parakeet_sortformer.cpp -- `compress_speaker_cache` - Early-return when `spkcache_len_per_spk <= 0` (`num_spks * A_sil >= spkcache_len`). The downstream boost/top-K stages are mostly defended (`boost_topk_scores` already returns early on non-positive k), but the function was otherwise running a no-op pass that produced an all-silence cache via the slow path. Fall back to an explicit silence-only profile and bail. - Renamed `streaming_update`'s `chunk_pre_encode_lc` parameter to `committed_chunk_pre_encode`. The call site already advances past the left context (`chunk_pre_committed = ... + lc * D`), so the old `_lc` suffix was misleading. `int lc` stays -- it's used inside the function to index into `preds_full`, which still contains the left-context preds. - Replaced the magic `-1.0e30f` / `+1.0e30f` sentinels (4 sites) with named constants `k_score_neg_inf` / `k_score_pos_inf` backed by `std::numeric_limits<float>::{lowest,max}()`. Dropped the inline "-inf is UB with current FP flags" comments: IEEE 754 +/-inf is well-defined; the original concern (avoiding NaN-on-arithmetic) still holds because we only store and compare the sentinels. src/parakeet_engine.cpp - On the AOSC path, skip the `for (cur_full) remap_id(...)` loop and the `prev_chunk_full_segments = std::move(cur_full)` store: `compute_slot_remap_` is never consulted when `cache_active` is true (AOSC anchors slot identity through the speaker cache), so the work was dead. - Switched v2.1 detection from pure-shape to "prefer the converter's `parakeet.model_variant` GGUF tag; fall back to `(n_layers == 17, n_mels == 128)` for legacy GGUFs". This prevents a future v2.2/v3 variant that happens to share v2.1's encoder shape from silently opting into AOSC. include/parakeet/diarization.h - Moved the v1-vs-v2.1 detection rationale comment out of parakeet_engine.cpp and into the `SortformerStreamingOptions:: spkcache_enable` block, with a paragraph on the tag-first / shape-fallback policy. src/parakeet_ctc.{h,cpp} - Added `std::string ParakeetCtcModel::model_variant` (optional GGUF metadata mirror; empty on legacy GGUFs). - Loader reads `parakeet.model_variant` next to the existing `parakeet.model.type` read; absent key -> empty string -> detection falls back to shape. scripts/convert-nemo-to-gguf.py - New `detect_sortformer_variant(ckpt: Path)` derives a stable variant tag from the source .nemo filename (`sortformer-v1` / `sortformer-streaming-v2` / `sortformer-streaming-v2.1-aosc`); empty string for unknown checkpoints. - Sortformer branch of `write_gguf` writes `parakeet.model_variant` when the tag is non-empty. - `write_gguf` signature extended with `ckpt: Path`; only the one internal call site adjusted. scripts/download-all-models.sh - Added the diar_streaming_sortformer_4spk-v2.1 fetch block (the AOSC fine-tune that this PR's tests target); bumped the budget comment from "~14 GiB" to "~14.5 GiB" and listed v2.1 in the contents line. CMakeLists.txt + test/test_sortformer_streaming.cpp - Streaming ctest now consumes `${_qvp_sfsv21_q8_gguf}` (was `${_qvp_sfs_q8_gguf}`, the v2 model). The in-binary default GGUF path is the matching v2.1 q8_0. Aligns the test with the line-299 comment that says the binary "reflects the production v2.1 AOSC config out of the box". test/test_utils.h (new) + test/test_sortformer_{streaming,aosc_speakers}.cpp - Extracted the two 40-line `load_wav_pcm16le_mono` / `file_exists` duplicates into a shared inline header in the `parakeet_test` namespace. The duplicate copies and the "duplicated here on purpose" comment block in test_sortformer_aosc_speakers.cpp are gone; both tests `#include "test_utils.h"` and use `using parakeet_test::...`. Build + ctest verification - `cmake --build build -j` clean (no new warnings). - `ctest -R 'test-sortformer-(streaming |aosc-speakers)'`: test-sortformer-streaming ........ Passed 8.23 s test-sortformer-aosc-speakers-abcba . Passed 33.80 s test-sortformer-aosc-speakers-abcdba Passed 36.91 s The locally-symlinked v2.1 GGUF predates the `parakeet.model_variant` key, so the AOSC tests passing here also verifies the shape-fallback path. Re-running the converter on the v2.1 .nemo will populate the new key for the strict-tag path. Reviewer comments deferred / skipped (rationale): - Encoder graph cache thrashing during FIFO ramp-up (#4): perf only; proper fix wants pre-build-at-diarize_start + silence padding or a mask argument, not minimal. Tracked for a follow-up perf PR. - WAV fixtures committed as ~11 MB binaries (#8): project-wide Git LFS adoption decision, not a code change. - `ring.erase` O(n) under AOSC's aggressive trim (#10): pre-existing on the v1 path; wants a std::deque refactor, out of scope. - `encoder_ms` attribution surprising (#12): code is correct and matches sibling paths; the user explicitly opted against comment-only "clarifications".
gianni-cor
pushed a commit
that referenced
this pull request
May 28, 2026
[BCI] QVAC-17071 feat: add BCI neural signal support (variable conv1 kernel + windowed attention)
gianni-cor
pushed a commit
that referenced
this pull request
May 28, 2026
Resolves the review comments on the merged AOSC v2.1 PR (#22, merge commit e6ba38c). All eight changes are minimal and behaviour-preserving except the v2.1 detection upgrade (now strict-tag with shape fallback) and the degenerate-config guard (silence-only fallback instead of UB-adjacent boost arithmetic). Reviewer comments classified as "perf only / out of scope / would only add a TODO" are intentionally not addressed in this commit -- see the plan file referenced in the PR description. src/parakeet_sortformer.cpp -- `compress_speaker_cache` - Early-return when `spkcache_len_per_spk <= 0` (`num_spks * A_sil >= spkcache_len`). The downstream boost/top-K stages are mostly defended (`boost_topk_scores` already returns early on non-positive k), but the function was otherwise running a no-op pass that produced an all-silence cache via the slow path. Fall back to an explicit silence-only profile and bail. - Renamed `streaming_update`'s `chunk_pre_encode_lc` parameter to `committed_chunk_pre_encode`. The call site already advances past the left context (`chunk_pre_committed = ... + lc * D`), so the old `_lc` suffix was misleading. `int lc` stays -- it's used inside the function to index into `preds_full`, which still contains the left-context preds. - Replaced the magic `-1.0e30f` / `+1.0e30f` sentinels (4 sites) with named constants `k_score_neg_inf` / `k_score_pos_inf` backed by `std::numeric_limits<float>::{lowest,max}()`. Dropped the inline "-inf is UB with current FP flags" comments: IEEE 754 +/-inf is well-defined; the original concern (avoiding NaN-on-arithmetic) still holds because we only store and compare the sentinels. src/parakeet_engine.cpp - On the AOSC path, skip the `for (cur_full) remap_id(...)` loop and the `prev_chunk_full_segments = std::move(cur_full)` store: `compute_slot_remap_` is never consulted when `cache_active` is true (AOSC anchors slot identity through the speaker cache), so the work was dead. - Switched v2.1 detection from pure-shape to "prefer the converter's `parakeet.model_variant` GGUF tag; fall back to `(n_layers == 17, n_mels == 128)` for legacy GGUFs". This prevents a future v2.2/v3 variant that happens to share v2.1's encoder shape from silently opting into AOSC. include/parakeet/diarization.h - Moved the v1-vs-v2.1 detection rationale comment out of parakeet_engine.cpp and into the `SortformerStreamingOptions:: spkcache_enable` block, with a paragraph on the tag-first / shape-fallback policy. src/parakeet_ctc.{h,cpp} - Added `std::string ParakeetCtcModel::model_variant` (optional GGUF metadata mirror; empty on legacy GGUFs). - Loader reads `parakeet.model_variant` next to the existing `parakeet.model.type` read; absent key -> empty string -> detection falls back to shape. scripts/convert-nemo-to-gguf.py - New `detect_sortformer_variant(ckpt: Path)` derives a stable variant tag from the source .nemo filename (`sortformer-v1` / `sortformer-streaming-v2` / `sortformer-streaming-v2.1-aosc`); empty string for unknown checkpoints. - Sortformer branch of `write_gguf` writes `parakeet.model_variant` when the tag is non-empty. - `write_gguf` signature extended with `ckpt: Path`; only the one internal call site adjusted. scripts/download-all-models.sh - Added the diar_streaming_sortformer_4spk-v2.1 fetch block (the AOSC fine-tune that this PR's tests target); bumped the budget comment from "~14 GiB" to "~14.5 GiB" and listed v2.1 in the contents line. CMakeLists.txt + test/test_sortformer_streaming.cpp - Streaming ctest now consumes `${_qvp_sfsv21_q8_gguf}` (was `${_qvp_sfs_q8_gguf}`, the v2 model). The in-binary default GGUF path is the matching v2.1 q8_0. Aligns the test with the line-299 comment that says the binary "reflects the production v2.1 AOSC config out of the box". test/test_utils.h (new) + test/test_sortformer_{streaming,aosc_speakers}.cpp - Extracted the two 40-line `load_wav_pcm16le_mono` / `file_exists` duplicates into a shared inline header in the `parakeet_test` namespace. The duplicate copies and the "duplicated here on purpose" comment block in test_sortformer_aosc_speakers.cpp are gone; both tests `#include "test_utils.h"` and use `using parakeet_test::...`. Build + ctest verification - `cmake --build build -j` clean (no new warnings). - `ctest -R 'test-sortformer-(streaming |aosc-speakers)'`: test-sortformer-streaming ........ Passed 8.23 s test-sortformer-aosc-speakers-abcba . Passed 33.80 s test-sortformer-aosc-speakers-abcdba Passed 36.91 s The locally-symlinked v2.1 GGUF predates the `parakeet.model_variant` key, so the AOSC tests passing here also verifies the shape-fallback path. Re-running the converter on the v2.1 .nemo will populate the new key for the strict-tag path. Reviewer comments deferred / skipped (rationale): - Encoder graph cache thrashing during FIFO ramp-up (#4): perf only; proper fix wants pre-build-at-diarize_start + silence padding or a mask argument, not minimal. Tracked for a follow-up perf PR. - WAV fixtures committed as ~11 MB binaries (#8): project-wide Git LFS adoption decision, not a code change. - `ring.erase` O(n) under AOSC's aggressive trim (#10): pre-existing on the v1 path; wants a std::deque refactor, out of scope. - `encoder_ms` attribution surprising (#12): code is correct and matches sibling paths; the user explicitly opted against comment-only "clarifications".
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds two changes to whisper.cpp to support brain-computer interface (BCI) neural signal transcription. Based on v1.8.4.1.
1. Variable conv1 kernel size
n_audio_conv1_kernelfrom model hparams (defaults to 3 for standard whisper models)2. Windowed self-attention for encoder layers
n_audio_window_sizeandn_audio_last_window_layerhparamslast_window_layerBackward compatibility
Both changes are backward-compatible:
n_audio_conv1_kerneldefaults to3(standard whisper behavior)n_audio_window_sizedefaults to0andn_audio_last_window_layerdefaults to-1, which disables windowed attention entirelyContext
Required by the new @qvac/bci-whispercpp addon: tetherto/qvac#1583
Test plan
Test results
BCI package (
@qvac/bci-whispercpp)Standard whisper package (
@qvac/transcription-whispercpp)