vad : add streaming detect + explicit state reset by danielbodart · Pull Request #3677 · ggml-org/whisper.cpp

danielbodart · 2026-02-23T17:15:15Z

Summary

Add whisper_vad_detect_speech_no_reset() — identical to whisper_vad_detect_speech but does not reset LSTM hidden/cell state, enabling temporal continuity when calling per-chunk in a streaming loop
Add whisper_vad_reset_state() — explicit state reset for use between utterances
Refactor whisper_vad_detect_speech as a thin wrapper (reset + no_reset) — zero behavior change for existing callers

Motivation

whisper_vad_detect_speech calls ggml_backend_buffer_clear(vctx->buffer, 0) on every invocation, which resets the Silero LSTM hidden/cell states. This is correct for batch processing (the current use case), but prevents temporal continuity when calling per-chunk in a streaming loop — the LSTM effectively degrades to a feedforward classifier with no memory between chunks.

For streaming applications that call VAD once per chunk (e.g. 512 samples at 16kHz = 32ms), the model needs to carry state across calls to make use of its recurrent architecture.

Changes

Two new public API functions following existing naming conventions:

// Like whisper_vad_detect_speech, but does not reset LSTM state.
// Use for streaming: call whisper_vad_reset_state() between utterances.
WHISPER_API bool whisper_vad_detect_speech_no_reset(
        struct whisper_vad_context * vctx,
        const float * samples,
        int   n_samples);

// Reset LSTM hidden/cell states to zero.
WHISPER_API void whisper_vad_reset_state(struct whisper_vad_context * vctx);

whisper_vad_detect_speech is now reset + no_reset — existing callers (including whisper_vad_segments_from_samples, test-vad.cpp, examples/speech.cpp) are completely unaffected.

whisper_vad_detect_speech resets LSTM state on every call, which is correct for batch processing but prevents temporal continuity when calling per-chunk in a streaming loop. Add whisper_vad_detect_speech_no_reset (skips buffer clear) and whisper_vad_reset_state (explicit clear between utterances). Existing whisper_vad_detect_speech is now a thin wrapper — zero behavior change for current callers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

marek-hradil · 2026-04-10T11:43:48Z

This seems like exactly what I need! Any updates on this?

marek-hradil · 2026-04-14T15:02:21Z

@ggerganov Heyy, sorry, don't want to be annoying, but could you have a quick look? Seems like a small change opened since February, which would make the VAD in whisper more useful.

KitaitiMakoto · 2026-04-15T17:17:22Z

danbev, who implemented VAD feature reacted this pull request with eyes, so, he must be recognizing this. I guess he's just busy.

* ggerganov/master: (162 commits) bench : sync submit-results URL to ggml-org (ggml-org#3769) whisper : add stateless VAD detect + explicit state reset for streaming (ggml-org#3677) sync : ggml vulkan: add noncontiguous GLU support (llama/21081) hexagon: support for IQ4_NL and MXFP4 (llama/21018) rpc : proper handling of data pointers to CPU buffers (llama/21030) metal : Fix dimension constraint violation in matmul2d descriptor (llama/21048) hip: use fnuz fp8 for conversion on CDNA3 (llama/21040) opencl: allow large buffer for adreno (llama/20997) fix(ggml): correct RISC-V ISA string canonical ordering for RVV in CMake (llama/20888) ggml-cuda: Add NVFP4 dp4a kernel (llama/20644) CUDA & CPU: support F32 kernel type for `CONV_TRANSPOSE_2D` (llama/17094) mtmd: Add DeepSeekOCR Support (llama/17400) llama: fix llama-model-saver (llama/20503) sycl : fix wrong variable check by assert (llama/20903) metal : add FLOOR, CEIL, ROUND, TRUNC unary ops (llama/20930) metal : add FA instantiations for HSK=512, HSV=512 (llama/20902) hexagon: general DMA and Binary Op fixes for large strides (llama/20918) opencl: add q6_K gemm and gemv kernels for Adreno (llama/20089) rpc : RCE patch (llama/20908) ...

Upstream ggml-org/whisper.cpp PR ggml-org#3677 added the streaming VAD entry points but shipped no test. Lock the public contract on the tetherto fork so regressions surface immediately: - whisper_vad_detect_speech idempotent (reset is implicit) - whisper_vad_reset_state restores LSTM state exactly - detect_speech == reset_state + detect_speech_no_reset - detect_speech_no_reset on contiguous halves == single-shot detect_speech (state carries across no-reset call boundary) Splits at a 512-sample boundary (Silero v6.2.0 window size) so no mid-stream zero padding is introduced. Uses the bundled silero VAD model and samples/jfk.wav; no whisper transcribe model needed. QVAC-18991 Co-authored-by: Cursor <cursoragent@cursor.com>

danielbodart mentioned this pull request Feb 24, 2026

Make Silero VAD stateful across calls (carry LSTM state) danielbodart/capsper#1

Closed

danbev approved these changes Apr 15, 2026

View reviewed changes

danbev merged commit 166c20b into ggml-org:master Apr 17, 2026

Zbig9000 mentioned this pull request May 19, 2026

QVAC-18991: pull latest whisper.cpp from upstream (+ VAD-streaming regression test) tetherto/qvac-ext-lib-whisper.cpp#25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vad : add streaming detect + explicit state reset#3677

vad : add streaming detect + explicit state reset#3677
danbev merged 1 commit into
ggml-org:masterfrom
danielbodart:streaming-vad-state-upstream

danielbodart commented Feb 23, 2026 •

edited

Loading

Uh oh!

marek-hradil commented Apr 10, 2026

Uh oh!

marek-hradil commented Apr 14, 2026

Uh oh!

KitaitiMakoto commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

danielbodart commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Uh oh!

marek-hradil commented Apr 10, 2026

Uh oh!

marek-hradil commented Apr 14, 2026

Uh oh!

KitaitiMakoto commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danielbodart commented Feb 23, 2026 •

edited

Loading