Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
169 changes: 169 additions & 0 deletions docs/website/content/docs/sdk/examples/ai-tasks/voice-assistant.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
---
title: Voice assistant
description: Real-time voice conversation pipeline — microphone → transcription → LLM → text-to-speech → speakers.
---

## Overview

A voice assistant chains three AI capabilities into a continuous conversation loop:

```mermaid
flowchart LR
Mic([Microphone]) -->|PCM audio| ASR[ASR<br/>Whisper + Silero VAD]
ASR -->|utterance| LLM[LLM<br/>completion]
LLM -->|response text| TTS[TTS<br/>Supertonic]
TTS -->|audio| Spk([Speakers])
Spk -.conversation loop.-> Mic

click ASR "/sdk/examples/ai-tasks/transcription" "Transcription"
click LLM "/sdk/examples/ai-tasks/completion" "Completion"
click TTS "/sdk/examples/ai-tasks/text-to-speech" "Text-to-speech"
```

Compared to using each capability individually, the key differences are:
- You need to coordinate **three model loads** simultaneously (Whisper + VAD, LLM, and TTS bundle) — they all stay loaded for the duration of the session.
- VAD parameters need **conservative tuning** to avoid the assistant transcribing its own TTS output (self-hearing feedback loop).
- You should **gate the microphone during TTS playback** and apply a short post-playback cooldown so room reverb doesn't bleed into the next utterance.
- You should **filter short or non-linguistic transcripts** (e.g. `"."`, `"[BLANK_AUDIO]"`) since Whisper hallucinates them from near-silent audio.

## Functions

Use the following sequence of function calls:
1. [`loadModel()`](/sdk/api#loadmodel) three times — once per `modelType` (`"whisper"`, `"llm"`, `"tts"`).
2. [`transcribeStream()`](/sdk/api#transcribestream) — open a streaming session that emits utterances on VAD-detected pauses.
3. [`completion()`](/sdk/api#completion) — generate a response from the rolling conversation history (streamed).
4. [`textToSpeech()`](/sdk/api#texttospeech) — synthesize the response into a PCM buffer.
5. [`unloadModel()`](/sdk/api#unloadmodel) for each loaded model on shutdown.

For how to use each function, see [SDK — API reference](/sdk/api/).

## Models

You load four model bundles in total:
- A `qvac-ext-lib-whisper.cpp`-compatible model for transcription, plus a Silero VAD model.
- A `llama.cpp`-compatible LLM for response generation.
- A Supertonic TTS bundle (text encoder, duration predictor, vector estimator, vocoder, unicode indexer, config, and voice style).

Recommended defaults (used in the example below):

| Stage | Model |
| --- | --- |
| ASR | `WHISPER_TINY` |
| VAD | `VAD_SILERO_5_1_2` |
| LLM | `LLAMA_3_2_1B_INST_Q4_0` |
| TTS | Supertonic2 (English) |

For models available as constants, see [SDK — Models](/sdk/getting-started#models).

## Example

This example is **desktop-only**. Mobile (React Native / Expo) needs a different audio path and isn't covered here.

### Requirements

- **FFmpeg** (with `ffplay`) on `PATH` — `ffmpeg` captures mic audio, `ffplay` plays back TTS output.
- **Microphone** access (on macOS, the running shell needs mic permission in *System Settings → Privacy & Security → Microphone*).
- **Speakers** connected and selected as the default output device.

<Accordions type="multiple">
<Accordion title="Installing FFmpeg">
| Platform | Command |
| --- | --- |
| macOS (Homebrew) | `brew install ffmpeg` |
| Debian / Ubuntu | `sudo apt update && sudo apt install ffmpeg` |
| Fedora / RHEL | `sudo dnf install ffmpeg` (enable [RPM Fusion](https://rpmfusion.org/Configuration) first if needed) |
| Arch Linux | `sudo pacman -S ffmpeg` |

Verify the install with:

```bash
ffmpeg -version
```
</Accordion>

<Accordion title="Selecting a microphone">
By default the example uses the system default mic on each OS:
- **macOS:** AVFoundation audio device `:0`.
- **Linux:** PulseAudio source `default`.

To use a different mic, set the `MIC_DEVICE` environment variable:

```bash
# macOS — pick by index (list with `ffmpeg -f avfoundation -list_devices true -i ""`)
MIC_DEVICE=":1" bun run examples/voice-assistant/voice-assistant.ts

# Linux — pick a PulseAudio source (list with `pactl list short sources`)
MIC_DEVICE="alsa_input.usb-Blue_Microphones_Yeti-00" \
bun run examples/voice-assistant/voice-assistant.ts
```
</Accordion>
</Accordions>

### Running it

The following script implements the full loop with VAD tuning, mic gating during playback, and short-utterance filtering:

<Tabs>
<Tab value="js" label="JavaScript" default>
<WrapCode>

```js file=<rootDir>/packages/sdk/dist/examples/voice-assistant/voice-assistant.js title="voice-assistant.js" lineNumbers
```
</WrapCode>
</Tab>

<Tab value="ts" label="TypeScript">
<WrapCode>

```ts file=<rootDir>/packages/sdk/examples/voice-assistant/voice-assistant.ts title="voice-assistant.ts" lineNumbers
```
</WrapCode>
</Tab>
</Tabs>

Speak into the mic; transcriptions and the assistant's spoken responses will follow. Press `Ctrl+C` to quit. Models are downloaded on first run (~1 GB total) and cached locally; subsequent runs work fully offline.

### Tuning

The defaults are deliberately conservative to prevent the assistant from hearing its own TTS output and looping on itself (a classic failure mode when mic and speakers share the same room). The relevant VAD parameters in the script:

```ts
{
threshold: 0.6, // less sensitive than Silero's default
min_speech_duration_ms: 300, // drops short clicks / breaths / stray words
min_silence_duration_ms: 700,// long quiet tail before committing a segment
max_speech_duration_s: 15.0, // caps runaway utterances
speech_pad_ms: 200, // edge padding improves accuracy
}
```

Plus three additional safeguards:

- **Mic gate during TTS:** incoming audio is dropped while the assistant speaks, so it cannot transcribe its own output.
- **Post-playback cooldown** (`POST_PLAYBACK_COOLDOWN_MS = 300`): keeps the mic gated for a moment after playback so speaker/room reverb doesn't bleed into the next VAD segment.
- **Minimum utterance length** (`MIN_UTTERANCE_CHARS = 3`): drops single-character or two-letter phantom transcripts like `"you"` or `"."` that Whisper hallucinates from near-silent audio.

### Troubleshooting

If you run into common issues, adjust the values above:

| Symptom | Fix |
| --- | --- |
| Assistant cuts you off mid-sentence | Raise `min_silence_duration_ms` to `900-1000` |
| Assistant talks over itself / loops forever | Raise `threshold` to `0.7`; raise `min_silence_duration_ms` to `900`; raise `POST_PLAYBACK_COOLDOWN_MS` to `500` |
| Slow to respond after you stop talking | Lower `min_silence_duration_ms` to `500` |
| Picks up background typing / keyboard | Raise `threshold` to `0.7` and `min_speech_duration_ms` to `400` |
| Short commands ("yes", "no") are ignored | Lower `MIN_UTTERANCE_CHARS` to `2` |

If you're running with headphones (mic cannot hear the speaker), you can loosen everything: `threshold: 0.5`, `min_silence_duration_ms: 500`, `POST_PLAYBACK_COOLDOWN_MS: 0`.

### Customizing

- **Different ASR model:** swap `WHISPER_TINY` for a larger Whisper model for better transcription accuracy (e.g. `WHISPER_BASE_Q8_0`, `WHISPER_SMALL_Q8_0`, `WHISPER_LARGE_V3_TURBO`, etc.).
- **Different LLM:** swap `LLAMA_3_2_1B_INST_Q4_0` for any LLM constant from `@qvac/sdk`. Larger models give better answers at the cost of latency.
Comment thread
BrunoCampana marked this conversation as resolved.
- **Different voice:** replace the Supertonic constants with another TTS model (e.g. Chatterbox — see [Text-to-Speech](/sdk/examples/ai-tasks/text-to-speech)).
- **System prompt:** edit `SYSTEM_PROMPT` at the top of the script. The default instructs the LLM to be concise and avoid markdown so responses are pleasant to listen to.

<Callout type="success">
**Tip:** all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see [SDK quickstart](/sdk/getting-started/quickstart).
</Callout>
1 change: 1 addition & 0 deletions docs/website/content/docs/sdk/getting-started/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ The JS SDK is cross-platform, type-safe, and pluggable, exposing all QVAC capabi
* [**Multimodal:**](/sdk/examples/ai-tasks/multimodal) LLM inference over text, images, and other media within a single conversation context.
* [**Fine-tuning:**](/sdk/examples/ai-tasks/fine-tuning) adapting LLMs to domain-specific tasks via LoRA.
* [**RAG:**](/sdk/examples/ai-tasks/rag) out-of-the-box retrieval-augmented generation workflow.
* [**Voice assistant:**](/sdk/examples/ai-tasks/voice-assistant) real-time voice conversation pipeline chaining transcription, LLM completion, and text-to-speech.

### P2P capabilities

Expand Down
3 changes: 2 additions & 1 deletion docs/website/src/lib/custom-tree.ts
Original file line number Diff line number Diff line change
Expand Up @@ -107,13 +107,14 @@ export const customTree: Node[] = [
{ name: 'Completion', url: '/sdk/examples/ai-tasks/completion', type: 'page', icon: resolveIcon('MessagesSquare') },
{ name: 'Text embeddings', url: '/sdk/examples/ai-tasks/text-embeddings', type: 'page', icon: resolveIcon('Hash') },
{ name: 'Translation', url: '/sdk/examples/ai-tasks/translation', type: 'page', icon: resolveIcon('Languages') },
{ name: 'Transcription', url: '/sdk/examples/ai-tasks/transcription', type: 'page', icon: resolveIcon('Mic') },
{ name: 'Transcription', url: '/sdk/examples/ai-tasks/transcription', type: 'page', icon: resolveIcon('Speech') },
{ name: 'Text-to-Speech', url: '/sdk/examples/ai-tasks/text-to-speech', type: 'page', icon: resolveIcon('Volume2') },
{ name: 'OCR', url: '/sdk/examples/ai-tasks/ocr', type: 'page', icon: resolveIcon('ScanText') },
{ name: 'Image generation', url: '/sdk/examples/ai-tasks/image-generation', type: 'page', icon: resolveIcon('Image') },
{ name: 'Multimodal', url: '/sdk/examples/ai-tasks/multimodal', type: 'page', icon: resolveIcon('GalleryHorizontal') },
{ name: 'Fine-tuning', url: '/sdk/examples/ai-tasks/fine-tuning', type: 'page', icon: resolveIcon('FlaskConical') },
{ name: 'RAG', url: '/sdk/examples/ai-tasks/rag', type: 'page', icon: resolveIcon('ScanSearch') },
{ name: 'Voice assistant', url: '/sdk/examples/ai-tasks/voice-assistant', type: 'page', icon: resolveIcon('Mic') },
],
},
{
Expand Down
3 changes: 3 additions & 0 deletions docs/website/src/mdx-components.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ import * as React from "react";
import Link from "next/link";
import { GithubInfo } from 'fumadocs-ui/components/github-info';
import { CustomTabs, CustomTabsItem } from "@/components/custom-tabs";
import { Accordion, Accordions } from 'fumadocs-ui/components/accordion';

function WrapCode({ children }: { children: React.ReactNode }) {
return <div className="fd-code-wrap">{children}</div>;
Expand Down Expand Up @@ -51,6 +52,8 @@ export function getMDXComponents(components?: MDXComponents): MDXComponents {
...TabsComponents,
Tabs: CustomTabs,
Tab: CustomTabsItem,
Accordion,
Accordions,
...StepComponents,
...components,
};
Expand Down
Loading