Skip to content
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@
Measures TTFP improvement from DAC-code caching when using uploaded voices.

Setup:
1. Start vllm-omni with Fish Speech S2 Pro (use our feat branch)
1. Start vllm-omni with Fish Speech S2 Pro
2. Provide a reference audio file for voice cloning

Usage:
python bench_voice_cache.py \
python bench_speaker_cache.py \
--ref-audio /path/to/reference.wav \
--ref-text "Transcript of the reference audio." \
--num-prompts 20 \
Expand Down
49 changes: 49 additions & 0 deletions docs/serving/speech_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -358,6 +358,31 @@ curl -X POST http://localhost:8091/v1/audio/speech \
}' --output cloned.wav
```

### Voice Storage & Caching

Uploaded voices are persisted to disk as a single `.safetensors` file per voice
(audio samples + metadata — name, consent, ref_text, sample_rate, created_at —
in the file header). On server restart the directory is scanned and all
previously uploaded voices are restored automatically, so uploads survive
process restarts.

Uploading an existing name overwrites the previous entry (a warning is logged).

Feature extraction artifacts (ref_code, speaker_embedding, DAC codes, etc.)
are cached in-process with a shared LRU so repeated requests with the same
`voice=...` skip the extraction pipeline. The cache is a true singleton across
all TTS model types; deleting a voice invalidates every model-type slot at
once.

**Configuration (environment variables):**

| Variable | Default | Description |
|----------|---------|-------------|
| `SPEAKER_SAMPLES_DIR` | `~/.cache/vllm-omni/speakers` | Directory for persisted uploaded speakers (`.safetensors` files). |
| `SPEAKER_MAX_UPLOADED` | `1000` | Maximum number of uploaded speakers kept on disk. Upload requests past the cap return 400. |

The in-memory LRU has a fixed 512 MiB byte budget.

## Batch Speech Generation

The batch endpoint synthesizes multiple texts in a single request, returning all results as JSON with base64-encoded audio.
Expand Down Expand Up @@ -543,6 +568,30 @@ Fish Speech uses `ref_audio` and `ref_text` for voice cloning (no `task_type` ne
|-------|-------------|
| `mistralai/Voxtral-4B-TTS-2603` | 3B AR + FlowMatching TTS. Supports text-to-speech with preset voices. |

### CosyVoice3

| Model | Description |
|-------|-------------|
| `FunAudioLLM/Fun-CosyVoice3-0.5B-2512` | Voice cloning from `ref_audio` + `ref_text`. No built-in voice presets — upload a voice or pass `ref_audio`/`ref_text` per request. |

### OmniVoice

| Model | Description |
|-------|-------------|
| `k2-fsa/OmniVoice` | Pure-diffusion TTS. Supports voice cloning via `ref_audio` (with optional `ref_text`); no built-in voice presets. |

### VoxCPM2

| Model | Description |
|-------|-------------|
| `openbmb/VoxCPM2` | TTS + voice cloning with built-in speaker presets and uploaded-voice support. Accepts `voice` (preset or uploaded) or `ref_audio` + optional `ref_text`. |

### MOSS-TTS-Nano

| Model | Description |
|-------|-------------|
| `OpenMOSS-Team/MOSS-TTS-Nano` | Voice cloning only. Requires `ref_audio` (or an uploaded `voice`); no built-in voice presets. `ref_text` is accepted but ignored — upstream's `voice_clone` mode does not consume a transcript. |

## Error Responses

### 400 Bad Request
Expand Down
68 changes: 68 additions & 0 deletions docs/user_guide/examples/offline_inference/voxtral_tts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Voxtral TTS Offline Inference

Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/voxtral_tts>.


`end2end.py` runs Voxtral TTS end-to-end offline inference using vLLM. It supports both blocking (`Omni`) and streaming (`AsyncOmni`) generation, batched prompts with configurable concurrency, and voice selection via preset name or reference audio file.

When `mistral_common` has `SpeechRequest` support, prompt token IDs are built via `encode_speech_request`. Otherwise, the script falls back to manual token construction.

## Usage Examples


```bash
# Basic single-prompt with cheerful_female voice preset
python3 examples/offline_inference/voxtral_tts/end2end.py \
--write-audio --voice cheerful_female \
--model mistralai/Voxtral-4B-TTS-2603 \
--text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"

# 32 replicate prompts with cheerful_female voice preset
python3 examples/offline_inference/voxtral_tts/end2end.py \
--num-prompts 32 --write-audio --voice cheerful_female \
--model mistralai/Voxtral-4B-TTS-2603 \
--text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"

# Streaming with neutral_female voice preset
python3 examples/offline_inference/voxtral_tts/end2end.py \
--streaming --write-audio --voice neutral_female \
--model mistralai/Voxtral-4B-TTS-2603 \
--text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"

# 32 prompts, 8 concurrent requests per wave, streaming with neutral_female voice
python3 examples/offline_inference/voxtral_tts/end2end.py \
--num-prompts 32 --concurrency 8 --streaming --write-audio --voice neutral_female \
--model mistralai/Voxtral-4B-TTS-2603 \
--text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"

# Short debug prompt with reference audio
python3 examples/offline_inference/voxtral_tts/end2end.py \
--write-audio \
--model mistralai/Voxtral-4B-TTS-2603 \
--text "This is a test message." \
--audio-path path/to/reference_audio.wav
```

## Arguments

| Argument | Description |
|---|---|
| `--model PATH` | HuggingFace repo ID or local directory path (default: `mistralai/Voxtral-4B-TTS-2603`) |
| `--text TEXT` | Text to synthesize (default: `"This is a test message."`) |
| `--audio-path PATH` | Path to reference audio file for voice cloning |
| `--output-dir DIR` | Directory to write output WAV files (default: `output_audio`) |
| `--deploy-config PATH` | Override the deploy config path. If unset, auto-loads `vllm_omni/deploy/voxtral_tts.yaml` from the HF `model_type`. |
| `--num-prompts N` | Number of replicate prompts to run for measuring performance (default: 1) |
| `--streaming` | Use streaming generation via `AsyncOmni` (default: blocking `Omni`) |
| `--concurrency N` | Max concurrent requests per wave (must be used with `--streaming`, must evenly divide `--num-prompts`) |
| `--voice NAME` | Voice preset to use instead of reference audio file (e.g., casual_female, casual_male, cheerful_female, neutral_female, neutral_male) |
| `--write-audio` | Write generated audio to WAV files |
| `--profiling-mode` | Enable profiling mode (reduces max tokens to 50) |
| `--log-stats` | Enable detailed statistics logging |

## Example materials

??? abstract "end2end.py"
``````py
--8<-- "examples/offline_inference/voxtral_tts/end2end.py"
``````
58 changes: 58 additions & 0 deletions examples/offline_inference/voxtral_tts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Voxtral TTS Offline Inference

`end2end.py` runs Voxtral TTS end-to-end offline inference using vLLM. It supports both blocking (`Omni`) and streaming (`AsyncOmni`) generation, batched prompts with configurable concurrency, and voice selection via preset name or reference audio file.

When `mistral_common` has `SpeechRequest` support, prompt token IDs are built via `encode_speech_request`. Otherwise, the script falls back to manual token construction.

## Usage Examples


```bash
# Basic single-prompt with cheerful_female voice preset
python3 examples/offline_inference/voxtral_tts/end2end.py \
--write-audio --voice cheerful_female \
--model mistralai/Voxtral-4B-TTS-2603 \
--text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"

# 32 replicate prompts with cheerful_female voice preset
python3 examples/offline_inference/voxtral_tts/end2end.py \
--num-prompts 32 --write-audio --voice cheerful_female \
--model mistralai/Voxtral-4B-TTS-2603 \
--text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"

# Streaming with neutral_female voice preset
python3 examples/offline_inference/voxtral_tts/end2end.py \
--streaming --write-audio --voice neutral_female \
--model mistralai/Voxtral-4B-TTS-2603 \
--text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"

# 32 prompts, 8 concurrent requests per wave, streaming with neutral_female voice
python3 examples/offline_inference/voxtral_tts/end2end.py \
--num-prompts 32 --concurrency 8 --streaming --write-audio --voice neutral_female \
--model mistralai/Voxtral-4B-TTS-2603 \
--text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"

# Short debug prompt with reference audio
python3 examples/offline_inference/voxtral_tts/end2end.py \
--write-audio \
--model mistralai/Voxtral-4B-TTS-2603 \
--text "This is a test message." \
--audio-path path/to/reference_audio.wav
```

## Arguments

| Argument | Description |
|---|---|
| `--model PATH` | HuggingFace repo ID or local directory path (default: `mistralai/Voxtral-4B-TTS-2603`) |
| `--text TEXT` | Text to synthesize (default: `"This is a test message."`) |
| `--audio-path PATH` | Path to reference audio file for voice cloning |
| `--output-dir DIR` | Directory to write output WAV files (default: `output_audio`) |
| `--deploy-config PATH` | Override the deploy config path. If unset, auto-loads `vllm_omni/deploy/voxtral_tts.yaml` from the HF `model_type`. |
| `--num-prompts N` | Number of replicate prompts to run for measuring performance (default: 1) |
| `--streaming` | Use streaming generation via `AsyncOmni` (default: blocking `Omni`) |
| `--concurrency N` | Max concurrent requests per wave (must be used with `--streaming`, must evenly divide `--num-prompts`) |
| `--voice NAME` | Voice preset to use instead of reference audio file. Check Huggingface `mistralai/Voxtral-4B-TTS-2603` to get the list of available voices |
| `--write-audio` | Write generated audio to WAV files |
| `--profiling-mode` | Enable profiling mode (reduces max tokens to 50) |
| `--log-stats` | Enable detailed statistics logging |
1 change: 1 addition & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
"tests.helpers.fixtures.log",
"tests.helpers.fixtures.run_args",
"tests.helpers.fixtures.runtime",
"tests.helpers.fixtures.speaker_cache",
)


Expand Down
100 changes: 46 additions & 54 deletions tests/entrypoints/openai_api/test_serving_speech.py
Original file line number Diff line number Diff line change
Expand Up @@ -300,6 +300,24 @@ def client(test_app):


class TestSpeechAPI:
@pytest.fixture(autouse=True)
def _mock_upload_io(self, mocker: MockerFixture):
"""Mock soundfile/safetensors so upload accepts fake audio bytes."""
samples = np.zeros(88200, dtype=np.float32) # 2s @ 44.1 kHz
mocker.patch("soundfile.read", return_value=(samples, 44100))

def _fake_save_file(tensors, path, metadata=None):
Path(path).touch()

mocker.patch("safetensors.torch.save_file", side_effect=_fake_save_file)
mock_ctx = mocker.MagicMock()
mock_ctx.keys.return_value = ["audio"]
mock_ctx.get_tensor.return_value = torch.zeros(88200)
mock_ctx.metadata.return_value = {"sample_rate": "44100"}
mock_safe_open = mocker.MagicMock()
mock_safe_open.return_value.__enter__.return_value = mock_ctx
mocker.patch("safetensors.safe_open", mock_safe_open)

def test_create_speech_success(self, client):
payload = {
"input": "Hello world",
Expand Down Expand Up @@ -470,27 +488,17 @@ def test_upload_voice_invalid_mime_type(self, client):
assert "MIME type" in result["detail"]

def test_upload_voice_name_collision(self, client):
"""Test voice upload with duplicate name."""
# First upload
"""Re-uploading the same name overwrites the previous entry (no 400)."""
audio_content = b"fake audio content"
files = {
"audio_sample": ("test.wav", audio_content, "audio/wav"),
}
data = {
"consent": "user_consent_123",
"name": "test_voice",
}
files = {"audio_sample": ("test.wav", audio_content, "audio/wav")}
data = {"consent": "user_consent_123", "name": "test_voice"}

response = client.post("/v1/audio/voices", files=files, data=data)
assert response.status_code == 200

# Second upload with same name
response = client.post("/v1/audio/voices", files=files, data=data)
assert response.status_code == 400
result = response.json()
assert "detail" in result
assert "already exists" in result["detail"]
response = client.delete("/v1/audio/voices/test_voice")
assert response.status_code == 200
client.delete("/v1/audio/voices/test_voice")

def test_upload_voice_missing_parameters(self, client):
"""Test voice upload with missing required parameters."""
Expand Down Expand Up @@ -970,7 +978,7 @@ def test_build_tts_params_with_uploaded_voice(self, speech_server, mocker: Mocke
"file_path": "/tmp/voice_samples/custom_voice_consent_123.wav",
"mime_type": "audio/wav",
"ref_text": None,
"created_at": 1711234567.89,
"created_at": 1711234567,
}
}
speech_server.supported_speakers = {"ryan", "vivian", "custom_voice"}
Expand All @@ -983,7 +991,7 @@ def test_build_tts_params_with_uploaded_voice(self, speech_server, mocker: Mocke
assert params["ref_audio"] == ["data:audio/wav;base64,ZmFrZWF1ZGlv"]
assert params["x_vector_only_mode"] == [True]
assert params["task_type"] == ["Base"]
assert params["voice_created_at"] == [1711234567.89]
assert params["voice_created_at"] == [1711234567]
assert "ref_text" not in params

def test_build_tts_params_with_uploaded_voice_ref_text(self, speech_server, mocker: MockerFixture):
Expand All @@ -994,7 +1002,7 @@ def test_build_tts_params_with_uploaded_voice_ref_text(self, speech_server, mock
"file_path": "/tmp/voice_samples/custom_voice_consent_123.wav",
"mime_type": "audio/wav",
"ref_text": "Hello world transcript",
"created_at": 1711234567.89,
"created_at": 1711234567,
}
}
speech_server.supported_speakers = {"ryan", "vivian", "custom_voice"}
Expand All @@ -1008,7 +1016,7 @@ def test_build_tts_params_with_uploaded_voice_ref_text(self, speech_server, mock
assert params["x_vector_only_mode"] == [False]
assert params["task_type"] == ["Base"]
assert params["ref_text"] == ["Hello world transcript"]
assert params["voice_created_at"] == [1711234567.89]
assert params["voice_created_at"] == [1711234567]

def test_build_tts_params_without_uploaded_voice(self, speech_server):
"""Test _build_tts_params does not auto-set ref_audio for non-uploaded voices."""
Expand Down Expand Up @@ -1051,28 +1059,29 @@ def test_build_tts_params_with_explicit_ref_audio(self, speech_server):
assert "x_vector_only_mode" not in params

def test_get_uploaded_audio_data(self, speech_server, mocker: MockerFixture):
"""Test _get_uploaded_audio_data function."""
# Mock file operations
mock_open = mocker.patch("builtins.open", create=True)
mock_b64encode = mocker.patch("base64.b64encode")
mock_exists = mocker.patch("pathlib.Path.exists")
mock_exists.return_value = True
mock_b64encode.return_value = b"ZmFrZWF1ZGlv"

# Setup mock file
mock_file = mocker.MagicMock()
mock_file.read.return_value = b"fakeaudio"
mock_open.return_value.__enter__.return_value = mock_file
"""Returns a data URL by loading audio via safetensors + re-encoding WAV."""
mocker.patch("pathlib.Path.exists", return_value=True)
mocker.patch("soundfile.write")
mocker.patch("base64.b64encode", return_value=b"ZmFrZWF1ZGlv")
mock_ctx = mocker.MagicMock()
mock_ctx.keys.return_value = ["audio"]
mock_ctx.get_tensor.return_value = torch.zeros(88200)
mock_ctx.metadata.return_value = {"sample_rate": "44100"}
mock_safe_open = mocker.MagicMock()
mock_safe_open.return_value.__enter__.return_value = mock_ctx
mocker.patch("safetensors.safe_open", mock_safe_open)

# Setup uploaded speaker
speech_server.uploaded_speakers = {
"test_voice": {"name": "test_voice", "file_path": "/tmp/test.wav", "mime_type": "audio/wav"}
"test_voice": {
"name": "test_voice",
"file_path": "/tmp/test.safetensors",
"mime_type": "audio/wav",
"embedding_source": "audio",
"sample_rate": 44100,
}
}
result = speech_server._get_uploaded_audio_data("test_voice")

assert result == "data:audio/wav;base64,ZmFrZWF1ZGlv"
mock_open.assert_called_once_with(Path("/tmp/test.wav"), "rb")
mock_b64encode.assert_called_once_with(b"fakeaudio")

def test_get_uploaded_audio_data_missing_file(self, speech_server, mocker: MockerFixture):
"""Test _get_uploaded_audio_data when file is missing."""
Expand Down Expand Up @@ -1230,24 +1239,6 @@ def test_regression_1603_speaker_key_with_uploaded_embedding_voice(self, speech_
# Must NOT have ref_audio — that would fail for safetensors files
assert "ref_audio" not in params

def test_validate_rejects_embedding_voice_with_pending_cache(self, speech_server, mocker: MockerFixture):
"""Validation should reject embedding voices whose cache is not yet ready."""
speech_server.uploaded_speakers = {
"myvoice": {
"name": "myvoice",
"file_path": "/tmp/myvoice.safetensors",
"mime_type": "application/x-safetensors",
"embedding_source": "direct",
"cache_status": "pending",
"cache_file": None,
}
}
req = OpenAICreateSpeechRequest.model_validate({"input": "Hello", "speaker": "myvoice", "task_type": "Base"})
mocker.patch("pathlib.Path.exists", return_value=True)
err = speech_server._validate_qwen_tts_request(req)
assert err is not None
assert "not yet ready" in err

def test_x_vector_only_mode_not_overwritten_for_uploaded_embedding(self, speech_server, mocker: MockerFixture):
"""x_vector_only_mode set by uploaded embedding must not be overwritten by request field."""
speech_server.uploaded_speakers = {
Expand Down Expand Up @@ -2294,6 +2285,7 @@ def test_prepare_speech_generation_cosyvoice3(self, cosyvoice3_server, mocker: M
"mm_processor_kwargs": {"prompt_text": "ref text", "sample_rate": 24000},
}
)
cosyvoice3_server._apply_cosyvoice3_dynamic_tokens = mocker.MagicMock(side_effect=lambda spl, req: spl)

request = OpenAICreateSpeechRequest(
input="Hello",
Expand Down
Loading
Loading