-
Notifications
You must be signed in to change notification settings - Fork 70
Docs/voice assistant #1816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Docs/voice assistant #1816
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
cf51ddc
doc: create new page - sdk - new ai capability - voice assistant
BrunoCampana 04a956f
doc: create new page - sdk - voice assistant
BrunoCampana c831bcf
doc: remove temporary doc that leaked into commit
BrunoCampana b93f04b
Merge branch 'main' into docs/voice-assistant
BrunoCampana 97a796b
doc: content new - sdk - voice assistant - PR review
BrunoCampana 4e95971
Merge branch 'main' into docs/voice-assistant
BrunoCampana File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
169 changes: 169 additions & 0 deletions
169
docs/website/content/docs/sdk/examples/ai-tasks/voice-assistant.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,169 @@ | ||
| --- | ||
| title: Voice assistant | ||
| description: Real-time voice conversation pipeline — microphone → transcription → LLM → text-to-speech → speakers. | ||
| --- | ||
|
|
||
| ## Overview | ||
|
|
||
| A voice assistant chains three AI capabilities into a continuous conversation loop: | ||
|
|
||
| ```mermaid | ||
| flowchart LR | ||
| Mic([Microphone]) -->|PCM audio| ASR[ASR<br/>Whisper + Silero VAD] | ||
| ASR -->|utterance| LLM[LLM<br/>completion] | ||
| LLM -->|response text| TTS[TTS<br/>Supertonic] | ||
| TTS -->|audio| Spk([Speakers]) | ||
| Spk -.conversation loop.-> Mic | ||
|
|
||
| click ASR "/sdk/examples/ai-tasks/transcription" "Transcription" | ||
| click LLM "/sdk/examples/ai-tasks/completion" "Completion" | ||
| click TTS "/sdk/examples/ai-tasks/text-to-speech" "Text-to-speech" | ||
| ``` | ||
|
|
||
| Compared to using each capability individually, the key differences are: | ||
| - You need to coordinate **three model loads** simultaneously (Whisper + VAD, LLM, and TTS bundle) — they all stay loaded for the duration of the session. | ||
| - VAD parameters need **conservative tuning** to avoid the assistant transcribing its own TTS output (self-hearing feedback loop). | ||
| - You should **gate the microphone during TTS playback** and apply a short post-playback cooldown so room reverb doesn't bleed into the next utterance. | ||
| - You should **filter short or non-linguistic transcripts** (e.g. `"."`, `"[BLANK_AUDIO]"`) since Whisper hallucinates them from near-silent audio. | ||
|
|
||
| ## Functions | ||
|
|
||
| Use the following sequence of function calls: | ||
| 1. [`loadModel()`](/sdk/api#loadmodel) three times — once per `modelType` (`"whisper"`, `"llm"`, `"tts"`). | ||
| 2. [`transcribeStream()`](/sdk/api#transcribestream) — open a streaming session that emits utterances on VAD-detected pauses. | ||
| 3. [`completion()`](/sdk/api#completion) — generate a response from the rolling conversation history (streamed). | ||
| 4. [`textToSpeech()`](/sdk/api#texttospeech) — synthesize the response into a PCM buffer. | ||
| 5. [`unloadModel()`](/sdk/api#unloadmodel) for each loaded model on shutdown. | ||
|
|
||
| For how to use each function, see [SDK — API reference](/sdk/api/). | ||
|
|
||
| ## Models | ||
|
|
||
| You load four model bundles in total: | ||
| - A `qvac-ext-lib-whisper.cpp`-compatible model for transcription, plus a Silero VAD model. | ||
| - A `llama.cpp`-compatible LLM for response generation. | ||
| - A Supertonic TTS bundle (text encoder, duration predictor, vector estimator, vocoder, unicode indexer, config, and voice style). | ||
|
|
||
| Recommended defaults (used in the example below): | ||
|
|
||
| | Stage | Model | | ||
| | --- | --- | | ||
| | ASR | `WHISPER_TINY` | | ||
| | VAD | `VAD_SILERO_5_1_2` | | ||
| | LLM | `LLAMA_3_2_1B_INST_Q4_0` | | ||
| | TTS | Supertonic2 (English) | | ||
|
|
||
| For models available as constants, see [SDK — Models](/sdk/getting-started#models). | ||
|
|
||
| ## Example | ||
|
|
||
| This example is **desktop-only**. Mobile (React Native / Expo) needs a different audio path and isn't covered here. | ||
|
|
||
| ### Requirements | ||
|
|
||
| - **FFmpeg** (with `ffplay`) on `PATH` — `ffmpeg` captures mic audio, `ffplay` plays back TTS output. | ||
| - **Microphone** access (on macOS, the running shell needs mic permission in *System Settings → Privacy & Security → Microphone*). | ||
| - **Speakers** connected and selected as the default output device. | ||
|
|
||
| <Accordions type="multiple"> | ||
| <Accordion title="Installing FFmpeg"> | ||
| | Platform | Command | | ||
| | --- | --- | | ||
| | macOS (Homebrew) | `brew install ffmpeg` | | ||
| | Debian / Ubuntu | `sudo apt update && sudo apt install ffmpeg` | | ||
| | Fedora / RHEL | `sudo dnf install ffmpeg` (enable [RPM Fusion](https://rpmfusion.org/Configuration) first if needed) | | ||
| | Arch Linux | `sudo pacman -S ffmpeg` | | ||
|
|
||
| Verify the install with: | ||
|
|
||
| ```bash | ||
| ffmpeg -version | ||
| ``` | ||
| </Accordion> | ||
|
|
||
| <Accordion title="Selecting a microphone"> | ||
| By default the example uses the system default mic on each OS: | ||
| - **macOS:** AVFoundation audio device `:0`. | ||
| - **Linux:** PulseAudio source `default`. | ||
|
|
||
| To use a different mic, set the `MIC_DEVICE` environment variable: | ||
|
|
||
| ```bash | ||
| # macOS — pick by index (list with `ffmpeg -f avfoundation -list_devices true -i ""`) | ||
| MIC_DEVICE=":1" bun run examples/voice-assistant/voice-assistant.ts | ||
|
|
||
| # Linux — pick a PulseAudio source (list with `pactl list short sources`) | ||
| MIC_DEVICE="alsa_input.usb-Blue_Microphones_Yeti-00" \ | ||
| bun run examples/voice-assistant/voice-assistant.ts | ||
| ``` | ||
| </Accordion> | ||
| </Accordions> | ||
|
|
||
| ### Running it | ||
|
|
||
| The following script implements the full loop with VAD tuning, mic gating during playback, and short-utterance filtering: | ||
|
|
||
| <Tabs> | ||
| <Tab value="js" label="JavaScript" default> | ||
| <WrapCode> | ||
|
|
||
| ```js file=<rootDir>/packages/sdk/dist/examples/voice-assistant/voice-assistant.js title="voice-assistant.js" lineNumbers | ||
| ``` | ||
| </WrapCode> | ||
| </Tab> | ||
|
|
||
| <Tab value="ts" label="TypeScript"> | ||
| <WrapCode> | ||
|
|
||
| ```ts file=<rootDir>/packages/sdk/examples/voice-assistant/voice-assistant.ts title="voice-assistant.ts" lineNumbers | ||
| ``` | ||
| </WrapCode> | ||
| </Tab> | ||
| </Tabs> | ||
|
|
||
| Speak into the mic; transcriptions and the assistant's spoken responses will follow. Press `Ctrl+C` to quit. Models are downloaded on first run (~1 GB total) and cached locally; subsequent runs work fully offline. | ||
|
|
||
| ### Tuning | ||
|
|
||
| The defaults are deliberately conservative to prevent the assistant from hearing its own TTS output and looping on itself (a classic failure mode when mic and speakers share the same room). The relevant VAD parameters in the script: | ||
|
|
||
| ```ts | ||
| { | ||
| threshold: 0.6, // less sensitive than Silero's default | ||
| min_speech_duration_ms: 300, // drops short clicks / breaths / stray words | ||
| min_silence_duration_ms: 700,// long quiet tail before committing a segment | ||
| max_speech_duration_s: 15.0, // caps runaway utterances | ||
| speech_pad_ms: 200, // edge padding improves accuracy | ||
| } | ||
| ``` | ||
|
|
||
| Plus three additional safeguards: | ||
|
|
||
| - **Mic gate during TTS:** incoming audio is dropped while the assistant speaks, so it cannot transcribe its own output. | ||
| - **Post-playback cooldown** (`POST_PLAYBACK_COOLDOWN_MS = 300`): keeps the mic gated for a moment after playback so speaker/room reverb doesn't bleed into the next VAD segment. | ||
| - **Minimum utterance length** (`MIN_UTTERANCE_CHARS = 3`): drops single-character or two-letter phantom transcripts like `"you"` or `"."` that Whisper hallucinates from near-silent audio. | ||
|
|
||
| ### Troubleshooting | ||
|
|
||
| If you run into common issues, adjust the values above: | ||
|
|
||
| | Symptom | Fix | | ||
| | --- | --- | | ||
| | Assistant cuts you off mid-sentence | Raise `min_silence_duration_ms` to `900-1000` | | ||
| | Assistant talks over itself / loops forever | Raise `threshold` to `0.7`; raise `min_silence_duration_ms` to `900`; raise `POST_PLAYBACK_COOLDOWN_MS` to `500` | | ||
| | Slow to respond after you stop talking | Lower `min_silence_duration_ms` to `500` | | ||
| | Picks up background typing / keyboard | Raise `threshold` to `0.7` and `min_speech_duration_ms` to `400` | | ||
| | Short commands ("yes", "no") are ignored | Lower `MIN_UTTERANCE_CHARS` to `2` | | ||
|
|
||
| If you're running with headphones (mic cannot hear the speaker), you can loosen everything: `threshold: 0.5`, `min_silence_duration_ms: 500`, `POST_PLAYBACK_COOLDOWN_MS: 0`. | ||
|
|
||
| ### Customizing | ||
|
|
||
| - **Different ASR model:** swap `WHISPER_TINY` for a larger Whisper model for better transcription accuracy (e.g. `WHISPER_BASE_Q8_0`, `WHISPER_SMALL_Q8_0`, `WHISPER_LARGE_V3_TURBO`, etc.). | ||
| - **Different LLM:** swap `LLAMA_3_2_1B_INST_Q4_0` for any LLM constant from `@qvac/sdk`. Larger models give better answers at the cost of latency. | ||
| - **Different voice:** replace the Supertonic constants with another TTS model (e.g. Chatterbox — see [Text-to-Speech](/sdk/examples/ai-tasks/text-to-speech)). | ||
| - **System prompt:** edit `SYSTEM_PROMPT` at the top of the script. The default instructs the LLM to be concise and avoid markdown so responses are pleasant to listen to. | ||
|
|
||
| <Callout type="success"> | ||
| **Tip:** all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see [SDK quickstart](/sdk/getting-started/quickstart). | ||
| </Callout> | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.