Skip to content

[BCI] QVAC-17071 feat: add BCI neural signal support (variable conv1 kernel + windowed attention)#10

Merged
sharmaraju352 merged 10 commits into
masterfrom
feat/bci-patches-v184
Apr 20, 2026
Merged

[BCI] QVAC-17071 feat: add BCI neural signal support (variable conv1 kernel + windowed attention)#10
sharmaraju352 merged 10 commits into
masterfrom
feat/bci-patches-v184

Conversation

@sharmaraju352

@sharmaraju352 sharmaraju352 commented Apr 16, 2026

Copy link
Copy Markdown

Summary

Adds two changes to whisper.cpp to support brain-computer interface (BCI) neural signal transcription. Based on v1.8.4.1.

1. Variable conv1 kernel size

  • Reads n_audio_conv1_kernel from model hparams (defaults to 3 for standard whisper models)
  • Allows BCI models to use a different first convolution kernel size

2. Windowed self-attention for encoder layers

  • Adds n_audio_window_size and n_audio_last_window_layer hparams
  • When present, encoder self-attention is restricted to a local window for layers up to last_window_layer
  • When windowed attention is active, the encoder bypasses flash attention and uses the standard softmax path (Metal flash attention does not support custom F32 masks)
  • Flash attention remains enabled for non-BCI models and for the decoder
  • Adds proper SOS token (language + transcribe) initialization for BCI models

Backward compatibility

Both changes are backward-compatible:

  • n_audio_conv1_kernel defaults to 3 (standard whisper behavior)
  • n_audio_window_size defaults to 0 and n_audio_last_window_layer defaults to -1, which disables windowed attention entirely
  • Standard whisper models are completely unaffected

Context

Required by the new @qvac/bci-whispercpp addon: tetherto/qvac#1583

Test plan

  • Standard whisper transcription still works (no regression)
  • BCI model loads and transcribes neural signals correctly on v1.8.4.1
  • Verified locally: 10.4% average WER across 5 BCI test samples (identical to v1.7.6 baseline)

Test results

BCI package (@qvac/bci-whispercpp)

  • 4/4 integration tests pass (9/9 assertions)
  • 10.4% average WER across 5 neural signal samples — identical output to v1.7.6 baseline

Standard whisper package (@qvac/transcription-whispercpp)

  • Integration: 10/10 tests pass (40/40 assertions)
  • Unit: 13/13 tests pass (57/57 assertions)
  • No regression in standard audio transcription (s16le + f32le formats verified)

@sharmaraju352 sharmaraju352 requested review from a team as code owners April 16, 2026 10:22
Raju added 5 commits April 16, 2026 19:57
Read n_audio_conv1_kernel from model hparams to allow BCI models
to use a non-standard first convolution kernel size. Standard
whisper models default to kernel size 3.

Made-with: Cursor
- Add n_audio_window_size and n_audio_last_window_layer hparams
- When present, encoder self-attention is restricted to a local window
  for layers up to last_window_layer
- Bypass flash attention when windowed mask is active (Metal FA does
  not support custom F32 masks); flash attention remains enabled for
  non-BCI models and for the decoder
- Populate window_mask data on the encoder graph (not the cross graph)
- Add proper SOS token (language + transcribe) initialization for BCI
  models

Backward-compatible: n_audio_window_size defaults to 0 and
n_audio_last_window_layer defaults to -1, disabling windowed
attention entirely for standard whisper models.

Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
@sharmaraju352 sharmaraju352 force-pushed the feat/bci-patches-v184 branch from 5326bf7 to bbb3535 Compare April 16, 2026 14:28
Comment thread src/whisper.cpp Outdated
Comment thread src/whisper.cpp Outdated
Comment thread src/whisper.cpp Outdated
Raju added 2 commits April 18, 2026 10:46
Address review feedback:

1. Guard read_safe for BCI-specific hparams (n_audio_conv1_kernel,
   n_audio_window_size, n_audio_last_window_layer) behind a
   n_mels > 256 check. Standard whisper models have n_mels <= 128
   and do not contain these fields — reading them unconditionally
   would corrupt the file position and break model loading.

2. Add explicit is_bci flag to hparams struct, set when BCI fields
   are detected during loading.

3. Use is_bci flag (instead of n_audio_window_size > 0) to guard
   the BCI-specific decoder SOS token initialization.

4. Log BCI-specific hparams when a BCI model is detected.

Made-with: Cursor
The windowed attention mask values depend only on n_ctx and
window_size, both fixed after model load. Move the O(n_ctx^2)
computation from whisper_encode_internal (called every encode)
to whisper_init_state (called once). The encode path now just
copies the precomputed data to the graph tensor.

Made-with: Cursor

@ogad-tether ogad-tether left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the earlier review notes. I found three follow-ups that still look blocking before this is safe to merge:

  • window_mask_data is precomputed for the full model n_audio_ctx, but the encoder graph uses exp_n_audio_ctx when callers set params.audio_ctx. That makes the later ggml_backend_tensor_set() size mismatch and can overflow the smaller graph tensor.
  • The install/include path change to include/whisper was not carried through to whisper.pc, which still emits -I${prefix}/include, so installed pkg-config consumers will stop finding whisper.h.
  • The exported CMake target still links Threads::Threads, but whisper-config.cmake no longer resolves that dependency on Windows, so find_package(whisper) can import a target with an undefined dependency there.

Could you follow up on those before merging?

Comment thread src/whisper.cpp
Comment thread src/CMakeLists.txt
Comment thread cmake/whisper-config.cmake.in Outdated
Raju added 2 commits April 20, 2026 14:52
…, Threads

1. Fix window_mask_data / exp_n_audio_ctx mismatch: the precomputed
   mask uses hparams.n_audio_ctx, but the graph tensor is sized from
   exp_n_audio_ctx when params.audio_ctx is overridden. Now falls back
   to recomputing the mask at the effective n_ctx when sizes differ,
   preventing a buffer overflow into the smaller tensor.

2. Update whisper.pc.in: the install interface was changed to
   include/whisper but the pkg-config includedir still pointed to
   include/. Consumers using pkg-config would not find whisper.h.

3. Fix whisper-config.cmake.in: the whisper target publicly links
   Threads::Threads but find_dependency(Threads) was skipped on
   Windows, leaving downstream find_package(whisper) with an
   unresolved imported target. Now always resolve Threads.
…ash attention

1. Cache fallback mask recompute: when exp_n_audio_ctx overrides the
   default n_audio_ctx, the window mask is now recomputed once and
   cached in wstate (keyed on window_mask_n_ctx) instead of allocating
   a new std::vector on every whisper_encode_internal call.

2. Per-layer flash attention: layers above last_window_layer no longer
   need the windowed attention mask. The flash attention path is now
   used for those layers even when BCI windowed attention is active,
   instead of globally falling back to the softmax path for the entire
   encoder.

3. Use std::abs instead of C abs in both init-time and encode-time
   mask computation paths.
@ogad-tether ogad-tether self-requested a review April 20, 2026 10:21
Comment thread src/whisper.cpp
Comment thread src/whisper.cpp Outdated
Comment thread src/whisper.cpp
…alidation

1. Extract compute_window_mask() helper on whisper_state to eliminate
   the duplicated O(n_ctx^2) mask fill loop that appeared in both
   whisper_init_state and whisper_encode_internal. Both call sites
   now use the single helper, preventing future drift.

2. Guard the encode-time mask block with hparams.is_bci before doing
   the ggml_graph_get_tensor lookup. Cheaper and more explicit than
   relying on the tensor name string to determine whether BCI
   windowed attention is active.

3. Add hparams.is_bci to the graph builder guard for window_mask
   tensor creation, aligning it with the other BCI code paths.

4. Add validation for BCI hparams after reading from file:
   n_audio_conv1_kernel must be > 0, n_audio_window_size must be >= 0.
   Log an error and return false on invalid values instead of
   proceeding with garbage.

5. Add comment explaining the n_mels > 256 threshold used to
   discriminate BCI models from standard whisper models, and noting
   that a dedicated file-format marker should be introduced if this
   assumption ever breaks.

Made-with: Cursor
@sharmaraju352 sharmaraju352 merged commit 2b1e04f into master Apr 20, 2026
67 of 74 checks passed
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 14, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on
each — new CPU-only unit test for every change, RED → impl →
GREEN → end-to-end validation on real hardware.

== tetherto#10 — Auto-pick UMA bias ==

Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs
because UMA reports the entire system RAM (120+ GB) as free
VRAM, while a discrete RTX 5090 reports 32 GB.  Silent 40x
realtime regression for any operator following the help text
"auto-pick adapter with most free VRAM".

Extended `resolve_vulkan_device_index` with an optional third
arg:
  int resolve_vulkan_device_index(int requested,
                                  const std::vector<size_t> & free_vram_per_device,
                                  const std::vector<bool>   & is_uma_per_device = {});

Empty UMA list -> round-3 behaviour preserved verbatim.
Non-empty + at least one discrete -> argmax over the DISCRETE
subset.  All-UMA falls back to round-3 argmax.  Explicit
`requested >= 0` passthrough is UMA-agnostic.

Caller wiring (in `init_supertonic_backend`) collects UMA
flags via the public `ggml_backend_dev_get_props()` API on
`ggml_backend_vk_reg()` - sets `is_uma = true` for
`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.

`test_supertonic_vulkan_device_select.cpp` extended with 6 new
test functions / 14 new checks covering the round-12 behaviour
matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete,
multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-
index-ignores-UMA-bias, mismatched-length-throws).

== tetherto#6 — Text-encoder speech-prompted-attention GPU bridge ==

Master's Metal-port branch (PR tetherto#15) built
`speech_prompted_merged_cache` (one ggml graph for QKV projection
+ head-split + flash-attn + out-proj end-to-end on GPU) but
never wired its run path.  Production text-encoder stayed on
the pre-Phase-A4 two-cache pattern with host-side Q/V download
-> pack -> re-upload between the QKV cache and the flash-attn
cache.

Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the
dispatch in `speech_prompted_attention_ggml`:

  if (!model_prefers_cpu_kernels(m)) {
      thread_local speech_prompted_merged_cache merged_caches[2];
      // rebuild on key change, then:
      run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
      return;
  }
  // ... legacy two-cache CPU path unchanged

Eliminates per call:
  - 2 GPU->host downloads (q_out, v_out)
  - 3 host->GPU uploads (q_pack, k_pack, v_pack)
  - 1 graph dispatch
  - All host pack work (q_pack / k_pack / v_pack head-split)
= 5 sync points x 2 layers = 10 sync points / synth at the
text encoder alone.

CPU stays on the legacy two-cache path: master's
`dense_matmul_time_ggml` CPU fast path uses cblas + the host-
side head-split is a free memcpy; switching CPU to merged
would pull the matmul through the slower ggml conv1d fallback
and gain nothing (no sync points exist on CPU).

`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins:
  - run_speech_prompted_merged_cache symbol via SFINAE
  - speech_prompted_merged_cache struct field contract
    (x_in, style_in, out, idx, L) via SFINAE
  - free-default-cache trip-wire (catches a buggy free path
    that segfaults on never-built `thread_local` cache slots
    at process exit)

6 / 6 CPU-only checks pass.  End-to-end equivalence vs. the
legacy two-cache path verified by the existing model-fixture
parity tests (`test-supertonic-text-encoder-trace`,
`test-supertonic-pipeline`).

== tetherto#5 — Pinned-host-buffer per-step input scratchpad ==

Round 3 shipped the capability probe
`supertonic_backend_supports_pinned_host_buffer`, which returns
`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
the resolved backend.  The actual per-engine input-scratchpad
refactor that USES the host-pinned buffer to skip ggml-vulkan's
internal staging-buffer hop was deferred.

Round 12 tetherto#5 lands the helper:

  ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
      const supertonic_model & model,
      ggml_context * input_ctx);

Returns nullptr on null model.backend / null input_ctx / non-
Vulkan backend / API miss.  Otherwise allocates the entire
input_ctx tensor set from `ggml_backend_vk_host_buffer_type()`
via `ggml_backend_alloc_ctx_tensors_from_buft`.  Caller owns
the returned buffer; frees at cache destruction via
`ggml_backend_buffer_free`.

Applied via a dual-context allocation pattern at the two
highest-frequency per-step input sites:

  - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in
  - ve_front_block_graph_cache: x_in + mask_in + t_emb_in

Total: 9 per-step input tensors moved to host-pinned memory.
Each `ggml_backend_tensor_set` on these tensors skips one
internal staging-buffer hop on Vulkan (BAR-mapped GPU memory
written directly by the host without an intermediate copy).

Dual-context pattern:
  1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots
  2. Create x_in / temb_in / etc. in input_ctx
  3. Try host-pinned alloc; fall back to default backend buffer
     via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
  4. Build the rest of the graph in cache.ctx; gallocr handles
     intermediates + outputs, skipping the pre-allocated inputs
     via the `tensor->buffer != nullptr` check
  Free order: gallocr -> main ctx -> input_buf -> input_ctx
  (reversed order would dangle gallocr pointers into freed
  input tensor metadata)

CPU / Metal / OpenCL safety: helper returns nullptr; callers
fall back to default backend buffer.  Identical CPU behaviour
to pre-round-12; only Vulkan gains.

`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
  - Helper symbol existence (SFINAE)
  - nullptr return on CPU backend (idempotent across repeats)
  - Null-pointer safety on null model.backend / null input_ctx

11 / 11 CPU-only checks pass.

== Combined perf snapshot on RTX 5090 ==

Long-prompt bench (173 chars, ~15s of audio):
  Round 11 baseline:        76.11 ms / 5 steps  (123x realtime)
  Round 12 (all three):     27.99 ms / 5 steps  (537x realtime)
                            ^ 2.7x faster
  Vector estimator step:    12.7 ms -> 3.28 ms  (3.9x faster)
  Prewarm cold-start:       330 ms -> 21 ms     (15x faster)

Short-prompt bench (Hello-world class, ~3s audio):
  Round 11 baseline:        44.08 ms (74x realtime)
  Round 12:                 23.31 ms (394x realtime)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
  Round 11 `--vulkan-device -1`:  picks RADV -> 178 ms (7x realtime)
  Round 12 `--vulkan-device -1`:  picks RTX 5090 -> 28 ms (537x realtime)
                                  ^ 6.4x faster for users following help text

== Test plan ==

CPU build:
  cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
  cmake --build tts-cpp/build -j
  ctest --test-dir tts-cpp/build -L unit
  -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text-
     encoder-gpu-bridge, +1 pinned-host-buffer)

Vulkan build:
  cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
  cmake --build tts-cpp/build-vulkan -j
  ctest --test-dir tts-cpp/build-vulkan -L unit
  -> 24 / 24 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter
writes a valid WAV.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on
each — new CPU-only unit test for every change, RED → impl →
GREEN → end-to-end validation on real hardware.

== tetherto#10 — Auto-pick UMA bias ==

Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs
because UMA reports the entire system RAM (120+ GB) as free
VRAM, while a discrete RTX 5090 reports 32 GB.  Silent 40x
realtime regression for any operator following the help text
"auto-pick adapter with most free VRAM".

Extended `resolve_vulkan_device_index` with an optional third
arg:
  int resolve_vulkan_device_index(int requested,
                                  const std::vector<size_t> & free_vram_per_device,
                                  const std::vector<bool>   & is_uma_per_device = {});

Empty UMA list -> round-3 behaviour preserved verbatim.
Non-empty + at least one discrete -> argmax over the DISCRETE
subset.  All-UMA falls back to round-3 argmax.  Explicit
`requested >= 0` passthrough is UMA-agnostic.

Caller wiring (in `init_supertonic_backend`) collects UMA
flags via the public `ggml_backend_dev_get_props()` API on
`ggml_backend_vk_reg()` - sets `is_uma = true` for
`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.

`test_supertonic_vulkan_device_select.cpp` extended with 6 new
test functions / 14 new checks covering the round-12 behaviour
matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete,
multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-
index-ignores-UMA-bias, mismatched-length-throws).

== tetherto#6 — Text-encoder speech-prompted-attention GPU bridge ==

Master's Metal-port branch (PR tetherto#15) built
`speech_prompted_merged_cache` (one ggml graph for QKV projection
+ head-split + flash-attn + out-proj end-to-end on GPU) but
never wired its run path.  Production text-encoder stayed on
the pre-Phase-A4 two-cache pattern with host-side Q/V download
-> pack -> re-upload between the QKV cache and the flash-attn
cache.

Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the
dispatch in `speech_prompted_attention_ggml`:

  if (!model_prefers_cpu_kernels(m)) {
      thread_local speech_prompted_merged_cache merged_caches[2];
      // rebuild on key change, then:
      run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
      return;
  }
  // ... legacy two-cache CPU path unchanged

Eliminates per call:
  - 2 GPU->host downloads (q_out, v_out)
  - 3 host->GPU uploads (q_pack, k_pack, v_pack)
  - 1 graph dispatch
  - All host pack work (q_pack / k_pack / v_pack head-split)
= 5 sync points x 2 layers = 10 sync points / synth at the
text encoder alone.

CPU stays on the legacy two-cache path: master's
`dense_matmul_time_ggml` CPU fast path uses cblas + the host-
side head-split is a free memcpy; switching CPU to merged
would pull the matmul through the slower ggml conv1d fallback
and gain nothing (no sync points exist on CPU).

`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins:
  - run_speech_prompted_merged_cache symbol via SFINAE
  - speech_prompted_merged_cache struct field contract
    (x_in, style_in, out, idx, L) via SFINAE
  - free-default-cache trip-wire (catches a buggy free path
    that segfaults on never-built `thread_local` cache slots
    at process exit)

6 / 6 CPU-only checks pass.  End-to-end equivalence vs. the
legacy two-cache path verified by the existing model-fixture
parity tests (`test-supertonic-text-encoder-trace`,
`test-supertonic-pipeline`).

== tetherto#5 — Pinned-host-buffer per-step input scratchpad ==

Round 3 shipped the capability probe
`supertonic_backend_supports_pinned_host_buffer`, which returns
`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
the resolved backend.  The actual per-engine input-scratchpad
refactor that USES the host-pinned buffer to skip ggml-vulkan's
internal staging-buffer hop was deferred.

Round 12 tetherto#5 lands the helper:

  ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
      const supertonic_model & model,
      ggml_context * input_ctx);

Returns nullptr on null model.backend / null input_ctx / non-
Vulkan backend / API miss.  Otherwise allocates the entire
input_ctx tensor set from `ggml_backend_vk_host_buffer_type()`
via `ggml_backend_alloc_ctx_tensors_from_buft`.  Caller owns
the returned buffer; frees at cache destruction via
`ggml_backend_buffer_free`.

Applied via a dual-context allocation pattern at the two
highest-frequency per-step input sites:

  - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in
  - ve_front_block_graph_cache: x_in + mask_in + t_emb_in

Total: 9 per-step input tensors moved to host-pinned memory.
Each `ggml_backend_tensor_set` on these tensors skips one
internal staging-buffer hop on Vulkan (BAR-mapped GPU memory
written directly by the host without an intermediate copy).

Dual-context pattern:
  1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots
  2. Create x_in / temb_in / etc. in input_ctx
  3. Try host-pinned alloc; fall back to default backend buffer
     via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
  4. Build the rest of the graph in cache.ctx; gallocr handles
     intermediates + outputs, skipping the pre-allocated inputs
     via the `tensor->buffer != nullptr` check
  Free order: gallocr -> main ctx -> input_buf -> input_ctx
  (reversed order would dangle gallocr pointers into freed
  input tensor metadata)

CPU / Metal / OpenCL safety: helper returns nullptr; callers
fall back to default backend buffer.  Identical CPU behaviour
to pre-round-12; only Vulkan gains.

`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
  - Helper symbol existence (SFINAE)
  - nullptr return on CPU backend (idempotent across repeats)
  - Null-pointer safety on null model.backend / null input_ctx

11 / 11 CPU-only checks pass.

== Combined perf snapshot on RTX 5090 ==

Long-prompt bench (173 chars, ~15s of audio):
  Round 11 baseline:        76.11 ms / 5 steps  (123x realtime)
  Round 12 (all three):     27.99 ms / 5 steps  (537x realtime)
                            ^ 2.7x faster
  Vector estimator step:    12.7 ms -> 3.28 ms  (3.9x faster)
  Prewarm cold-start:       330 ms -> 21 ms     (15x faster)

Short-prompt bench (Hello-world class, ~3s audio):
  Round 11 baseline:        44.08 ms (74x realtime)
  Round 12:                 23.31 ms (394x realtime)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
  Round 11 `--vulkan-device -1`:  picks RADV -> 178 ms (7x realtime)
  Round 12 `--vulkan-device -1`:  picks RTX 5090 -> 28 ms (537x realtime)
                                  ^ 6.4x faster for users following help text

== Test plan ==

CPU build:
  cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
  cmake --build tts-cpp/build -j
  ctest --test-dir tts-cpp/build -L unit
  -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text-
     encoder-gpu-bridge, +1 pinned-host-buffer)

Vulkan build:
  cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
  cmake --build tts-cpp/build-vulkan -j
  ctest --test-dir tts-cpp/build-vulkan -L unit
  -> 24 / 24 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter
writes a valid WAV.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on
each — new CPU-only unit test for every change, RED → impl →
GREEN → end-to-end validation on real hardware.

== tetherto#10 — Auto-pick UMA bias ==

Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs
because UMA reports the entire system RAM (120+ GB) as free
VRAM, while a discrete RTX 5090 reports 32 GB.  Silent 40x
realtime regression for any operator following the help text
"auto-pick adapter with most free VRAM".

Extended `resolve_vulkan_device_index` with an optional third
arg:
  int resolve_vulkan_device_index(int requested,
                                  const std::vector<size_t> & free_vram_per_device,
                                  const std::vector<bool>   & is_uma_per_device = {});

Empty UMA list -> round-3 behaviour preserved verbatim.
Non-empty + at least one discrete -> argmax over the DISCRETE
subset.  All-UMA falls back to round-3 argmax.  Explicit
`requested >= 0` passthrough is UMA-agnostic.

Caller wiring (in `init_supertonic_backend`) collects UMA
flags via the public `ggml_backend_dev_get_props()` API on
`ggml_backend_vk_reg()` - sets `is_uma = true` for
`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.

`test_supertonic_vulkan_device_select.cpp` extended with 6 new
test functions / 14 new checks covering the round-12 behaviour
matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete,
multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-
index-ignores-UMA-bias, mismatched-length-throws).

== tetherto#6 — Text-encoder speech-prompted-attention GPU bridge ==

Master's Metal-port branch (PR tetherto#15) built
`speech_prompted_merged_cache` (one ggml graph for QKV projection
+ head-split + flash-attn + out-proj end-to-end on GPU) but
never wired its run path.  Production text-encoder stayed on
the pre-Phase-A4 two-cache pattern with host-side Q/V download
-> pack -> re-upload between the QKV cache and the flash-attn
cache.

Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the
dispatch in `speech_prompted_attention_ggml`:

  if (!model_prefers_cpu_kernels(m)) {
      thread_local speech_prompted_merged_cache merged_caches[2];
      // rebuild on key change, then:
      run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
      return;
  }
  // ... legacy two-cache CPU path unchanged

Eliminates per call:
  - 2 GPU->host downloads (q_out, v_out)
  - 3 host->GPU uploads (q_pack, k_pack, v_pack)
  - 1 graph dispatch
  - All host pack work (q_pack / k_pack / v_pack head-split)
= 5 sync points x 2 layers = 10 sync points / synth at the
text encoder alone.

CPU stays on the legacy two-cache path: master's
`dense_matmul_time_ggml` CPU fast path uses cblas + the host-
side head-split is a free memcpy; switching CPU to merged
would pull the matmul through the slower ggml conv1d fallback
and gain nothing (no sync points exist on CPU).

`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins:
  - run_speech_prompted_merged_cache symbol via SFINAE
  - speech_prompted_merged_cache struct field contract
    (x_in, style_in, out, idx, L) via SFINAE
  - free-default-cache trip-wire (catches a buggy free path
    that segfaults on never-built `thread_local` cache slots
    at process exit)

6 / 6 CPU-only checks pass.  End-to-end equivalence vs. the
legacy two-cache path verified by the existing model-fixture
parity tests (`test-supertonic-text-encoder-trace`,
`test-supertonic-pipeline`).

== tetherto#5 — Pinned-host-buffer per-step input scratchpad ==

Round 3 shipped the capability probe
`supertonic_backend_supports_pinned_host_buffer`, which returns
`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
the resolved backend.  The actual per-engine input-scratchpad
refactor that USES the host-pinned buffer to skip ggml-vulkan's
internal staging-buffer hop was deferred.

Round 12 tetherto#5 lands the helper:

  ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
      const supertonic_model & model,
      ggml_context * input_ctx);

Returns nullptr on null model.backend / null input_ctx / non-
Vulkan backend / API miss.  Otherwise allocates the entire
input_ctx tensor set from `ggml_backend_vk_host_buffer_type()`
via `ggml_backend_alloc_ctx_tensors_from_buft`.  Caller owns
the returned buffer; frees at cache destruction via
`ggml_backend_buffer_free`.

Applied via a dual-context allocation pattern at the two
highest-frequency per-step input sites:

  - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in
  - ve_front_block_graph_cache: x_in + mask_in + t_emb_in

Total: 9 per-step input tensors moved to host-pinned memory.
Each `ggml_backend_tensor_set` on these tensors skips one
internal staging-buffer hop on Vulkan (BAR-mapped GPU memory
written directly by the host without an intermediate copy).

Dual-context pattern:
  1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots
  2. Create x_in / temb_in / etc. in input_ctx
  3. Try host-pinned alloc; fall back to default backend buffer
     via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
  4. Build the rest of the graph in cache.ctx; gallocr handles
     intermediates + outputs, skipping the pre-allocated inputs
     via the `tensor->buffer != nullptr` check
  Free order: gallocr -> main ctx -> input_buf -> input_ctx
  (reversed order would dangle gallocr pointers into freed
  input tensor metadata)

CPU / Metal / OpenCL safety: helper returns nullptr; callers
fall back to default backend buffer.  Identical CPU behaviour
to pre-round-12; only Vulkan gains.

`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
  - Helper symbol existence (SFINAE)
  - nullptr return on CPU backend (idempotent across repeats)
  - Null-pointer safety on null model.backend / null input_ctx

11 / 11 CPU-only checks pass.

== Combined perf snapshot on RTX 5090 ==

Long-prompt bench (173 chars, ~15s of audio):
  Round 11 baseline:        76.11 ms / 5 steps  (123x realtime)
  Round 12 (all three):     27.99 ms / 5 steps  (537x realtime)
                            ^ 2.7x faster
  Vector estimator step:    12.7 ms -> 3.28 ms  (3.9x faster)
  Prewarm cold-start:       330 ms -> 21 ms     (15x faster)

Short-prompt bench (Hello-world class, ~3s audio):
  Round 11 baseline:        44.08 ms (74x realtime)
  Round 12:                 23.31 ms (394x realtime)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
  Round 11 `--vulkan-device -1`:  picks RADV -> 178 ms (7x realtime)
  Round 12 `--vulkan-device -1`:  picks RTX 5090 -> 28 ms (537x realtime)
                                  ^ 6.4x faster for users following help text

== Test plan ==

CPU build:
  cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
  cmake --build tts-cpp/build -j
  ctest --test-dir tts-cpp/build -L unit
  -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text-
     encoder-gpu-bridge, +1 pinned-host-buffer)

Vulkan build:
  cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
  cmake --build tts-cpp/build-vulkan -j
  ctest --test-dir tts-cpp/build-vulkan -L unit
  -> 24 / 24 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter
writes a valid WAV.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 18, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on
each — new CPU-only unit test for every change, RED → impl →
GREEN → end-to-end validation on real hardware.

== tetherto#10 — Auto-pick UMA bias ==

Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs
because UMA reports the entire system RAM (120+ GB) as free
VRAM, while a discrete RTX 5090 reports 32 GB.  Silent 40x
realtime regression for any operator following the help text
"auto-pick adapter with most free VRAM".

Extended `resolve_vulkan_device_index` with an optional third
arg:
  int resolve_vulkan_device_index(int requested,
                                  const std::vector<size_t> & free_vram_per_device,
                                  const std::vector<bool>   & is_uma_per_device = {});

Empty UMA list -> round-3 behaviour preserved verbatim.
Non-empty + at least one discrete -> argmax over the DISCRETE
subset.  All-UMA falls back to round-3 argmax.  Explicit
`requested >= 0` passthrough is UMA-agnostic.

Caller wiring (in `init_supertonic_backend`) collects UMA
flags via the public `ggml_backend_dev_get_props()` API on
`ggml_backend_vk_reg()` - sets `is_uma = true` for
`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.

`test_supertonic_vulkan_device_select.cpp` extended with 6 new
test functions / 14 new checks covering the round-12 behaviour
matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete,
multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-
index-ignores-UMA-bias, mismatched-length-throws).

== tetherto#6 — Text-encoder speech-prompted-attention GPU bridge ==

Master's Metal-port branch (PR tetherto#15) built
`speech_prompted_merged_cache` (one ggml graph for QKV projection
+ head-split + flash-attn + out-proj end-to-end on GPU) but
never wired its run path.  Production text-encoder stayed on
the pre-Phase-A4 two-cache pattern with host-side Q/V download
-> pack -> re-upload between the QKV cache and the flash-attn
cache.

Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the
dispatch in `speech_prompted_attention_ggml`:

  if (!model_prefers_cpu_kernels(m)) {
      thread_local speech_prompted_merged_cache merged_caches[2];
      // rebuild on key change, then:
      run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
      return;
  }
  // ... legacy two-cache CPU path unchanged

Eliminates per call:
  - 2 GPU->host downloads (q_out, v_out)
  - 3 host->GPU uploads (q_pack, k_pack, v_pack)
  - 1 graph dispatch
  - All host pack work (q_pack / k_pack / v_pack head-split)
= 5 sync points x 2 layers = 10 sync points / synth at the
text encoder alone.

CPU stays on the legacy two-cache path: master's
`dense_matmul_time_ggml` CPU fast path uses cblas + the host-
side head-split is a free memcpy; switching CPU to merged
would pull the matmul through the slower ggml conv1d fallback
and gain nothing (no sync points exist on CPU).

`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins:
  - run_speech_prompted_merged_cache symbol via SFINAE
  - speech_prompted_merged_cache struct field contract
    (x_in, style_in, out, idx, L) via SFINAE
  - free-default-cache trip-wire (catches a buggy free path
    that segfaults on never-built `thread_local` cache slots
    at process exit)

6 / 6 CPU-only checks pass.  End-to-end equivalence vs. the
legacy two-cache path verified by the existing model-fixture
parity tests (`test-supertonic-text-encoder-trace`,
`test-supertonic-pipeline`).

== tetherto#5 — Pinned-host-buffer per-step input scratchpad ==

Round 3 shipped the capability probe
`supertonic_backend_supports_pinned_host_buffer`, which returns
`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
the resolved backend.  The actual per-engine input-scratchpad
refactor that USES the host-pinned buffer to skip ggml-vulkan's
internal staging-buffer hop was deferred.

Round 12 tetherto#5 lands the helper:

  ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
      const supertonic_model & model,
      ggml_context * input_ctx);

Returns nullptr on null model.backend / null input_ctx / non-
Vulkan backend / API miss.  Otherwise allocates the entire
input_ctx tensor set from `ggml_backend_vk_host_buffer_type()`
via `ggml_backend_alloc_ctx_tensors_from_buft`.  Caller owns
the returned buffer; frees at cache destruction via
`ggml_backend_buffer_free`.

Applied via a dual-context allocation pattern at the two
highest-frequency per-step input sites:

  - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in
  - ve_front_block_graph_cache: x_in + mask_in + t_emb_in

Total: 9 per-step input tensors moved to host-pinned memory.
Each `ggml_backend_tensor_set` on these tensors skips one
internal staging-buffer hop on Vulkan (BAR-mapped GPU memory
written directly by the host without an intermediate copy).

Dual-context pattern:
  1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots
  2. Create x_in / temb_in / etc. in input_ctx
  3. Try host-pinned alloc; fall back to default backend buffer
     via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
  4. Build the rest of the graph in cache.ctx; gallocr handles
     intermediates + outputs, skipping the pre-allocated inputs
     via the `tensor->buffer != nullptr` check
  Free order: gallocr -> main ctx -> input_buf -> input_ctx
  (reversed order would dangle gallocr pointers into freed
  input tensor metadata)

CPU / Metal / OpenCL safety: helper returns nullptr; callers
fall back to default backend buffer.  Identical CPU behaviour
to pre-round-12; only Vulkan gains.

`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
  - Helper symbol existence (SFINAE)
  - nullptr return on CPU backend (idempotent across repeats)
  - Null-pointer safety on null model.backend / null input_ctx

11 / 11 CPU-only checks pass.

== Combined perf snapshot on RTX 5090 ==

Long-prompt bench (173 chars, ~15s of audio):
  Round 11 baseline:        76.11 ms / 5 steps  (123x realtime)
  Round 12 (all three):     27.99 ms / 5 steps  (537x realtime)
                            ^ 2.7x faster
  Vector estimator step:    12.7 ms -> 3.28 ms  (3.9x faster)
  Prewarm cold-start:       330 ms -> 21 ms     (15x faster)

Short-prompt bench (Hello-world class, ~3s audio):
  Round 11 baseline:        44.08 ms (74x realtime)
  Round 12:                 23.31 ms (394x realtime)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
  Round 11 `--vulkan-device -1`:  picks RADV -> 178 ms (7x realtime)
  Round 12 `--vulkan-device -1`:  picks RTX 5090 -> 28 ms (537x realtime)
                                  ^ 6.4x faster for users following help text

== Test plan ==

CPU build:
  cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
  cmake --build tts-cpp/build -j
  ctest --test-dir tts-cpp/build -L unit
  -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text-
     encoder-gpu-bridge, +1 pinned-host-buffer)

Vulkan build:
  cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
  cmake --build tts-cpp/build-vulkan -j
  ctest --test-dir tts-cpp/build-vulkan -L unit
  -> 24 / 24 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter
writes a valid WAV.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 19, 2026
… + text-encoder GPU bridge + pinned-host-buffer per-step inputs

Three independent wins bundled into one round, strict TDD on
each — new CPU-only unit test for every change, RED → impl →
GREEN → end-to-end validation on real hardware.

== tetherto#10 — Auto-pick UMA bias ==

Round 3's `argmax(free_vram)` picks UMA iGPUs on hybrid rigs
because UMA reports the entire system RAM (120+ GB) as free
VRAM, while a discrete RTX 5090 reports 32 GB.  Silent 40x
realtime regression for any operator following the help text
"auto-pick adapter with most free VRAM".

Extended `resolve_vulkan_device_index` with an optional third
arg:
  int resolve_vulkan_device_index(int requested,
                                  const std::vector<size_t> & free_vram_per_device,
                                  const std::vector<bool>   & is_uma_per_device = {});

Empty UMA list -> round-3 behaviour preserved verbatim.
Non-empty + at least one discrete -> argmax over the DISCRETE
subset.  All-UMA falls back to round-3 argmax.  Explicit
`requested >= 0` passthrough is UMA-agnostic.

Caller wiring (in `init_supertonic_backend`) collects UMA
flags via the public `ggml_backend_dev_get_props()` API on
`ggml_backend_vk_reg()` - sets `is_uma = true` for
`GGML_BACKEND_DEVICE_TYPE_IGPU` / `_CPU` / `_ACCEL`.

`test_supertonic_vulkan_device_select.cpp` extended with 6 new
test functions / 14 new checks covering the round-12 behaviour
matrix (empty-UMA-preserves-round3, hybrid-prefer-discrete,
multi-discrete-argmax-over-subset, all-UMA-falls-back, explicit-
index-ignores-UMA-bias, mismatched-length-throws).

== tetherto#6 — Text-encoder speech-prompted-attention GPU bridge ==

Master's Metal-port branch (PR tetherto#15) built
`speech_prompted_merged_cache` (one ggml graph for QKV projection
+ head-split + flash-attn + out-proj end-to-end on GPU) but
never wired its run path.  Production text-encoder stayed on
the pre-Phase-A4 two-cache pattern with host-side Q/V download
-> pack -> re-upload between the QKV cache and the flash-attn
cache.

Round 12 tetherto#6 adds `run_speech_prompted_merged_cache` and the
dispatch in `speech_prompted_attention_ggml`:

  if (!model_prefers_cpu_kernels(m)) {
      thread_local speech_prompted_merged_cache merged_caches[2];
      // rebuild on key change, then:
      run_speech_prompted_merged_cache(merged, m, x_lc, L, style_ttl, out_lc);
      return;
  }
  // ... legacy two-cache CPU path unchanged

Eliminates per call:
  - 2 GPU->host downloads (q_out, v_out)
  - 3 host->GPU uploads (q_pack, k_pack, v_pack)
  - 1 graph dispatch
  - All host pack work (q_pack / k_pack / v_pack head-split)
= 5 sync points x 2 layers = 10 sync points / synth at the
text encoder alone.

CPU stays on the legacy two-cache path: master's
`dense_matmul_time_ggml` CPU fast path uses cblas + the host-
side head-split is a free memcpy; switching CPU to merged
would pull the matmul through the slower ggml conv1d fallback
and gain nothing (no sync points exist on CPU).

`test_supertonic_text_encoder_gpu_bridge.cpp` (NEW) pins:
  - run_speech_prompted_merged_cache symbol via SFINAE
  - speech_prompted_merged_cache struct field contract
    (x_in, style_in, out, idx, L) via SFINAE
  - free-default-cache trip-wire (catches a buggy free path
    that segfaults on never-built `thread_local` cache slots
    at process exit)

6 / 6 CPU-only checks pass.  End-to-end equivalence vs. the
legacy two-cache path verified by the existing model-fixture
parity tests (`test-supertonic-text-encoder-trace`,
`test-supertonic-pipeline`).

== tetherto#5 — Pinned-host-buffer per-step input scratchpad ==

Round 3 shipped the capability probe
`supertonic_backend_supports_pinned_host_buffer`, which returns
`true` iff `ggml_backend_vk_host_buffer_type()` is non-null on
the resolved backend.  The actual per-engine input-scratchpad
refactor that USES the host-pinned buffer to skip ggml-vulkan's
internal staging-buffer hop was deferred.

Round 12 tetherto#5 lands the helper:

  ggml_backend_buffer_t try_alloc_inputs_in_pinned_host_buffer(
      const supertonic_model & model,
      ggml_context * input_ctx);

Returns nullptr on null model.backend / null input_ctx / non-
Vulkan backend / API miss.  Otherwise allocates the entire
input_ctx tensor set from `ggml_backend_vk_host_buffer_type()`
via `ggml_backend_alloc_ctx_tensors_from_buft`.  Caller owns
the returned buffer; frees at cache destruction via
`ggml_backend_buffer_free`.

Applied via a dual-context allocation pattern at the two
highest-frequency per-step input sites:

  - vector_group_graph_cache (x 3 for g1/g2/g3): x_in + temb_in
  - ve_front_block_graph_cache: x_in + mask_in + t_emb_in

Total: 9 per-step input tensors moved to host-pinned memory.
Each `ggml_backend_tensor_set` on these tensors skips one
internal staging-buffer hop on Vulkan (BAR-mapped GPU memory
written directly by the host without an intermediate copy).

Dual-context pattern:
  1. Create input_ctx (no_alloc=true) with ~8 tensor-overhead slots
  2. Create x_in / temb_in / etc. in input_ctx
  3. Try host-pinned alloc; fall back to default backend buffer
     via `ggml_backend_alloc_ctx_tensors(input_ctx, backend)`
  4. Build the rest of the graph in cache.ctx; gallocr handles
     intermediates + outputs, skipping the pre-allocated inputs
     via the `tensor->buffer != nullptr` check
  Free order: gallocr -> main ctx -> input_buf -> input_ctx
  (reversed order would dangle gallocr pointers into freed
  input tensor metadata)

CPU / Metal / OpenCL safety: helper returns nullptr; callers
fall back to default backend buffer.  Identical CPU behaviour
to pre-round-12; only Vulkan gains.

`test_supertonic_pinned_host_buffer.cpp` (NEW) pins:
  - Helper symbol existence (SFINAE)
  - nullptr return on CPU backend (idempotent across repeats)
  - Null-pointer safety on null model.backend / null input_ctx

11 / 11 CPU-only checks pass.

== Combined perf snapshot on RTX 5090 ==

Long-prompt bench (173 chars, ~15s of audio):
  Round 11 baseline:        76.11 ms / 5 steps  (123x realtime)
  Round 12 (all three):     27.99 ms / 5 steps  (537x realtime)
                            ^ 2.7x faster
  Vector estimator step:    12.7 ms -> 3.28 ms  (3.9x faster)
  Prewarm cold-start:       330 ms -> 21 ms     (15x faster)

Short-prompt bench (Hello-world class, ~3s audio):
  Round 11 baseline:        44.08 ms (74x realtime)
  Round 12:                 23.31 ms (394x realtime)

Auto-pick on hybrid rig (RTX 5090 + AMD RADV iGPU):
  Round 11 `--vulkan-device -1`:  picks RADV -> 178 ms (7x realtime)
  Round 12 `--vulkan-device -1`:  picks RTX 5090 -> 28 ms (537x realtime)
                                  ^ 6.4x faster for users following help text

== Test plan ==

CPU build:
  cmake -S tts-cpp -B tts-cpp/build -DTTS_CPP_USE_SYSTEM_GGML=OFF
  cmake --build tts-cpp/build -j
  ctest --test-dir tts-cpp/build -L unit
  -> 24 / 24 PASS, 0 regressions (was 22 / 22 in round 11; +1 text-
     encoder-gpu-bridge, +1 pinned-host-buffer)

Vulkan build:
  cmake -S tts-cpp -B tts-cpp/build-vulkan -DTTS_CPP_USE_SYSTEM_GGML=OFF -DGGML_VULKAN=ON
  cmake --build tts-cpp/build-vulkan -j
  ctest --test-dir tts-cpp/build-vulkan -L unit
  -> 24 / 24 PASS

End-to-end synth verified on all 4 backends (CPU, Vulkan RTX
5090, Vulkan RADV iGPU, Vulkan Mesa lavapipe) - every adapter
writes a valid WAV.

Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604 pushed a commit that referenced this pull request May 19, 2026
Resolves the review comments on the merged AOSC v2.1 PR
(#22, merge commit e6ba38c). All
eight changes are minimal and behaviour-preserving except the v2.1
detection upgrade (now strict-tag with shape fallback) and the
degenerate-config guard (silence-only fallback instead of UB-adjacent
boost arithmetic). Reviewer comments classified as "perf only / out
of scope / would only add a TODO" are intentionally not addressed in
this commit -- see the plan file referenced in the PR description.

src/parakeet_sortformer.cpp -- `compress_speaker_cache`
  - Early-return when `spkcache_len_per_spk <= 0`
    (`num_spks * A_sil >= spkcache_len`). The downstream boost/top-K
    stages are mostly defended (`boost_topk_scores` already returns
    early on non-positive k), but the function was otherwise running
    a no-op pass that produced an all-silence cache via the slow
    path. Fall back to an explicit silence-only profile and bail.
  - Renamed `streaming_update`'s `chunk_pre_encode_lc` parameter to
    `committed_chunk_pre_encode`. The call site already advances
    past the left context (`chunk_pre_committed = ... + lc * D`),
    so the old `_lc` suffix was misleading. `int lc` stays -- it's
    used inside the function to index into `preds_full`, which
    still contains the left-context preds.
  - Replaced the magic `-1.0e30f` / `+1.0e30f` sentinels (4 sites)
    with named constants `k_score_neg_inf` / `k_score_pos_inf`
    backed by `std::numeric_limits<float>::{lowest,max}()`. Dropped
    the inline "-inf is UB with current FP flags" comments: IEEE
    754 +/-inf is well-defined; the original concern (avoiding
    NaN-on-arithmetic) still holds because we only store and
    compare the sentinels.

src/parakeet_engine.cpp
  - On the AOSC path, skip the `for (cur_full) remap_id(...)` loop
    and the `prev_chunk_full_segments = std::move(cur_full)` store:
    `compute_slot_remap_` is never consulted when `cache_active` is
    true (AOSC anchors slot identity through the speaker cache), so
    the work was dead.
  - Switched v2.1 detection from pure-shape to "prefer the
    converter's `parakeet.model_variant` GGUF tag; fall back to
    `(n_layers == 17, n_mels == 128)` for legacy GGUFs". This
    prevents a future v2.2/v3 variant that happens to share v2.1's
    encoder shape from silently opting into AOSC.

include/parakeet/diarization.h
  - Moved the v1-vs-v2.1 detection rationale comment out of
    parakeet_engine.cpp and into the `SortformerStreamingOptions::
    spkcache_enable` block, with a paragraph on the tag-first /
    shape-fallback policy.

src/parakeet_ctc.{h,cpp}
  - Added `std::string ParakeetCtcModel::model_variant` (optional
    GGUF metadata mirror; empty on legacy GGUFs).
  - Loader reads `parakeet.model_variant` next to the existing
    `parakeet.model.type` read; absent key -> empty string ->
    detection falls back to shape.

scripts/convert-nemo-to-gguf.py
  - New `detect_sortformer_variant(ckpt: Path)` derives a stable
    variant tag from the source .nemo filename
    (`sortformer-v1` / `sortformer-streaming-v2` /
    `sortformer-streaming-v2.1-aosc`); empty string for unknown
    checkpoints.
  - Sortformer branch of `write_gguf` writes
    `parakeet.model_variant` when the tag is non-empty.
  - `write_gguf` signature extended with `ckpt: Path`; only the
    one internal call site adjusted.

scripts/download-all-models.sh
  - Added the diar_streaming_sortformer_4spk-v2.1 fetch block (the
    AOSC fine-tune that this PR's tests target); bumped the budget
    comment from "~14 GiB" to "~14.5 GiB" and listed v2.1 in the
    contents line.

CMakeLists.txt + test/test_sortformer_streaming.cpp
  - Streaming ctest now consumes `${_qvp_sfsv21_q8_gguf}` (was
    `${_qvp_sfs_q8_gguf}`, the v2 model). The in-binary default
    GGUF path is the matching v2.1 q8_0. Aligns the test with the
    line-299 comment that says the binary "reflects the production
    v2.1 AOSC config out of the box".

test/test_utils.h (new) + test/test_sortformer_{streaming,aosc_speakers}.cpp
  - Extracted the two 40-line `load_wav_pcm16le_mono` / `file_exists`
    duplicates into a shared inline header in the `parakeet_test`
    namespace. The duplicate copies and the "duplicated here on
    purpose" comment block in test_sortformer_aosc_speakers.cpp
    are gone; both tests `#include "test_utils.h"` and use
    `using parakeet_test::...`.

Build + ctest verification
  - `cmake --build build -j` clean (no new warnings).
  - `ctest -R 'test-sortformer-(streaming |aosc-speakers)'`:
      test-sortformer-streaming ........  Passed   8.23 s
      test-sortformer-aosc-speakers-abcba . Passed  33.80 s
      test-sortformer-aosc-speakers-abcdba  Passed  36.91 s
    The locally-symlinked v2.1 GGUF predates the `parakeet.model_variant`
    key, so the AOSC tests passing here also verifies the shape-fallback
    path. Re-running the converter on the v2.1 .nemo will populate
    the new key for the strict-tag path.

Reviewer comments deferred / skipped (rationale):
  - Encoder graph cache thrashing during FIFO ramp-up (#4): perf
    only; proper fix wants pre-build-at-diarize_start + silence
    padding or a mask argument, not minimal. Tracked for a follow-up
    perf PR.
  - WAV fixtures committed as ~11 MB binaries (#8): project-wide
    Git LFS adoption decision, not a code change.
  - `ring.erase` O(n) under AOSC's aggressive trim (#10): pre-existing
    on the v1 path; wants a std::deque refactor, out of scope.
  - `encoder_ms` attribution surprising (#12): code is correct and
    matches sibling paths; the user explicitly opted against
    comment-only "clarifications".
gianni-cor pushed a commit that referenced this pull request May 28, 2026
[BCI] QVAC-17071 feat: add BCI neural signal support (variable conv1 kernel + windowed attention)
gianni-cor pushed a commit that referenced this pull request May 28, 2026
Resolves the review comments on the merged AOSC v2.1 PR
(#22, merge commit e6ba38c). All
eight changes are minimal and behaviour-preserving except the v2.1
detection upgrade (now strict-tag with shape fallback) and the
degenerate-config guard (silence-only fallback instead of UB-adjacent
boost arithmetic). Reviewer comments classified as "perf only / out
of scope / would only add a TODO" are intentionally not addressed in
this commit -- see the plan file referenced in the PR description.

src/parakeet_sortformer.cpp -- `compress_speaker_cache`
  - Early-return when `spkcache_len_per_spk <= 0`
    (`num_spks * A_sil >= spkcache_len`). The downstream boost/top-K
    stages are mostly defended (`boost_topk_scores` already returns
    early on non-positive k), but the function was otherwise running
    a no-op pass that produced an all-silence cache via the slow
    path. Fall back to an explicit silence-only profile and bail.
  - Renamed `streaming_update`'s `chunk_pre_encode_lc` parameter to
    `committed_chunk_pre_encode`. The call site already advances
    past the left context (`chunk_pre_committed = ... + lc * D`),
    so the old `_lc` suffix was misleading. `int lc` stays -- it's
    used inside the function to index into `preds_full`, which
    still contains the left-context preds.
  - Replaced the magic `-1.0e30f` / `+1.0e30f` sentinels (4 sites)
    with named constants `k_score_neg_inf` / `k_score_pos_inf`
    backed by `std::numeric_limits<float>::{lowest,max}()`. Dropped
    the inline "-inf is UB with current FP flags" comments: IEEE
    754 +/-inf is well-defined; the original concern (avoiding
    NaN-on-arithmetic) still holds because we only store and
    compare the sentinels.

src/parakeet_engine.cpp
  - On the AOSC path, skip the `for (cur_full) remap_id(...)` loop
    and the `prev_chunk_full_segments = std::move(cur_full)` store:
    `compute_slot_remap_` is never consulted when `cache_active` is
    true (AOSC anchors slot identity through the speaker cache), so
    the work was dead.
  - Switched v2.1 detection from pure-shape to "prefer the
    converter's `parakeet.model_variant` GGUF tag; fall back to
    `(n_layers == 17, n_mels == 128)` for legacy GGUFs". This
    prevents a future v2.2/v3 variant that happens to share v2.1's
    encoder shape from silently opting into AOSC.

include/parakeet/diarization.h
  - Moved the v1-vs-v2.1 detection rationale comment out of
    parakeet_engine.cpp and into the `SortformerStreamingOptions::
    spkcache_enable` block, with a paragraph on the tag-first /
    shape-fallback policy.

src/parakeet_ctc.{h,cpp}
  - Added `std::string ParakeetCtcModel::model_variant` (optional
    GGUF metadata mirror; empty on legacy GGUFs).
  - Loader reads `parakeet.model_variant` next to the existing
    `parakeet.model.type` read; absent key -> empty string ->
    detection falls back to shape.

scripts/convert-nemo-to-gguf.py
  - New `detect_sortformer_variant(ckpt: Path)` derives a stable
    variant tag from the source .nemo filename
    (`sortformer-v1` / `sortformer-streaming-v2` /
    `sortformer-streaming-v2.1-aosc`); empty string for unknown
    checkpoints.
  - Sortformer branch of `write_gguf` writes
    `parakeet.model_variant` when the tag is non-empty.
  - `write_gguf` signature extended with `ckpt: Path`; only the
    one internal call site adjusted.

scripts/download-all-models.sh
  - Added the diar_streaming_sortformer_4spk-v2.1 fetch block (the
    AOSC fine-tune that this PR's tests target); bumped the budget
    comment from "~14 GiB" to "~14.5 GiB" and listed v2.1 in the
    contents line.

CMakeLists.txt + test/test_sortformer_streaming.cpp
  - Streaming ctest now consumes `${_qvp_sfsv21_q8_gguf}` (was
    `${_qvp_sfs_q8_gguf}`, the v2 model). The in-binary default
    GGUF path is the matching v2.1 q8_0. Aligns the test with the
    line-299 comment that says the binary "reflects the production
    v2.1 AOSC config out of the box".

test/test_utils.h (new) + test/test_sortformer_{streaming,aosc_speakers}.cpp
  - Extracted the two 40-line `load_wav_pcm16le_mono` / `file_exists`
    duplicates into a shared inline header in the `parakeet_test`
    namespace. The duplicate copies and the "duplicated here on
    purpose" comment block in test_sortformer_aosc_speakers.cpp
    are gone; both tests `#include "test_utils.h"` and use
    `using parakeet_test::...`.

Build + ctest verification
  - `cmake --build build -j` clean (no new warnings).
  - `ctest -R 'test-sortformer-(streaming |aosc-speakers)'`:
      test-sortformer-streaming ........  Passed   8.23 s
      test-sortformer-aosc-speakers-abcba . Passed  33.80 s
      test-sortformer-aosc-speakers-abcdba  Passed  36.91 s
    The locally-symlinked v2.1 GGUF predates the `parakeet.model_variant`
    key, so the AOSC tests passing here also verifies the shape-fallback
    path. Re-running the converter on the v2.1 .nemo will populate
    the new key for the strict-tag path.

Reviewer comments deferred / skipped (rationale):
  - Encoder graph cache thrashing during FIFO ramp-up (#4): perf
    only; proper fix wants pre-build-at-diarize_start + silence
    padding or a mask argument, not minimal. Tracked for a follow-up
    perf PR.
  - WAV fixtures committed as ~11 MB binaries (#8): project-wide
    Git LFS adoption decision, not a code change.
  - `ring.erase` O(n) under AOSC's aggressive trim (#10): pre-existing
    on the v1 path; wants a std::deque refactor, out of scope.
  - `encoder_ms` attribution surprising (#12): code is correct and
    matches sibling paths; the user explicitly opted against
    comment-only "clarifications".
@gianni-cor gianni-cor deleted the feat/bci-patches-v184 branch May 28, 2026 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants