Skip to content

vad : add streaming detect + explicit state reset#3677

Merged
danbev merged 1 commit into
ggml-org:masterfrom
danielbodart:streaming-vad-state-upstream
Apr 17, 2026
Merged

vad : add streaming detect + explicit state reset#3677
danbev merged 1 commit into
ggml-org:masterfrom
danielbodart:streaming-vad-state-upstream

Conversation

@danielbodart

@danielbodart danielbodart commented Feb 23, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add whisper_vad_detect_speech_no_reset() — identical to whisper_vad_detect_speech but does not reset LSTM hidden/cell state, enabling temporal continuity when calling per-chunk in a streaming loop
  • Add whisper_vad_reset_state() — explicit state reset for use between utterances
  • Refactor whisper_vad_detect_speech as a thin wrapper (reset + no_reset) — zero behavior change for existing callers

Motivation

whisper_vad_detect_speech calls ggml_backend_buffer_clear(vctx->buffer, 0) on every invocation, which resets the Silero LSTM hidden/cell states. This is correct for batch processing (the current use case), but prevents temporal continuity when calling per-chunk in a streaming loop — the LSTM effectively degrades to a feedforward classifier with no memory between chunks.

For streaming applications that call VAD once per chunk (e.g. 512 samples at 16kHz = 32ms), the model needs to carry state across calls to make use of its recurrent architecture.

Changes

Two new public API functions following existing naming conventions:

// Like whisper_vad_detect_speech, but does not reset LSTM state.
// Use for streaming: call whisper_vad_reset_state() between utterances.
WHISPER_API bool whisper_vad_detect_speech_no_reset(
        struct whisper_vad_context * vctx,
        const float * samples,
        int   n_samples);

// Reset LSTM hidden/cell states to zero.
WHISPER_API void whisper_vad_reset_state(struct whisper_vad_context * vctx);

whisper_vad_detect_speech is now reset + no_reset — existing callers (including whisper_vad_segments_from_samples, test-vad.cpp, examples/speech.cpp) are completely unaffected.

whisper_vad_detect_speech resets LSTM state on every call, which is
correct for batch processing but prevents temporal continuity when
calling per-chunk in a streaming loop.

Add whisper_vad_detect_speech_no_reset (skips buffer clear) and
whisper_vad_reset_state (explicit clear between utterances).
Existing whisper_vad_detect_speech is now a thin wrapper — zero
behavior change for current callers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@marek-hradil

Copy link
Copy Markdown

This seems like exactly what I need! Any updates on this?

@marek-hradil

Copy link
Copy Markdown

@ggerganov Heyy, sorry, don't want to be annoying, but could you have a quick look? Seems like a small change opened since February, which would make the VAD in whisper more useful.

@KitaitiMakoto

Copy link
Copy Markdown
Contributor

danbev, who implemented VAD feature reacted this pull request with eyes, so, he must be recognizing this. I guess he's just busy.

@danbev danbev merged commit 166c20b into ggml-org:master Apr 17, 2026
bygreencn added a commit to bygreencn/whisper.cpp that referenced this pull request Apr 28, 2026
* ggerganov/master: (162 commits)
  bench : sync submit-results URL to ggml-org (ggml-org#3769)
  whisper : add stateless VAD detect + explicit state reset for streaming (ggml-org#3677)
  sync : ggml
  vulkan: add noncontiguous GLU support (llama/21081)
  hexagon: support for IQ4_NL and MXFP4 (llama/21018)
  rpc : proper handling of data pointers to CPU buffers (llama/21030)
  metal : Fix dimension constraint violation in matmul2d descriptor (llama/21048)
  hip: use fnuz fp8 for conversion on CDNA3 (llama/21040)
  opencl: allow large buffer for adreno (llama/20997)
  fix(ggml): correct RISC-V ISA string canonical ordering for RVV in CMake (llama/20888)
  ggml-cuda: Add NVFP4 dp4a kernel (llama/20644)
  CUDA & CPU: support F32 kernel type for `CONV_TRANSPOSE_2D` (llama/17094)
  mtmd: Add DeepSeekOCR Support (llama/17400)
  llama: fix llama-model-saver (llama/20503)
  sycl : fix wrong variable check by assert (llama/20903)
  metal : add FLOOR, CEIL, ROUND, TRUNC unary ops (llama/20930)
  metal : add FA instantiations for HSK=512, HSV=512 (llama/20902)
  hexagon: general DMA and Binary Op fixes for large strides (llama/20918)
  opencl: add q6_K gemm and gemv kernels for Adreno (llama/20089)
  rpc : RCE patch (llama/20908)
  ...
GustavoA1604 pushed a commit to tetherto/qvac-ext-lib-whisper.cpp that referenced this pull request May 20, 2026


Upstream ggml-org/whisper.cpp PR ggml-org#3677 added the streaming VAD entry
points but shipped no test. Lock the public contract on the tetherto
fork so regressions surface immediately:

  - whisper_vad_detect_speech idempotent (reset is implicit)
  - whisper_vad_reset_state restores LSTM state exactly
  - detect_speech == reset_state + detect_speech_no_reset
  - detect_speech_no_reset on contiguous halves == single-shot
    detect_speech (state carries across no-reset call boundary)

Splits at a 512-sample boundary (Silero v6.2.0 window size) so no
mid-stream zero padding is introduced. Uses the bundled silero VAD
model and samples/jfk.wav; no whisper transcribe model needed.

QVAC-18991

Co-authored-by: Cursor <cursoragent@cursor.com>
gianni-cor pushed a commit to tetherto/qvac-ext-lib-whisper.cpp that referenced this pull request May 28, 2026


Upstream ggml-org/whisper.cpp PR ggml-org#3677 added the streaming VAD entry
points but shipped no test. Lock the public contract on the tetherto
fork so regressions surface immediately:

  - whisper_vad_detect_speech idempotent (reset is implicit)
  - whisper_vad_reset_state restores LSTM state exactly
  - detect_speech == reset_state + detect_speech_no_reset
  - detect_speech_no_reset on contiguous halves == single-shot
    detect_speech (state carries across no-reset call boundary)

Splits at a 512-sample boundary (Silero v6.2.0 window size) so no
mid-stream zero padding is introduced. Uses the bundled silero VAD
model and samples/jfk.wav; no whisper transcribe model needed.

QVAC-18991

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants