vllm-project · linyueqian · May 5, 2026 · Apr 28, 2026 · Apr 28, 2026 · Apr 28, 2026
@@ -3,11 +3,11 @@
 Measures TTFP improvement from DAC-code caching when using uploaded voices.
 
 Setup:
-  1. Start vllm-omni with Fish Speech S2 Pro (use our feat branch)
+  1. Start vllm-omni with Fish Speech S2 Pro
   2. Provide a reference audio file for voice cloning
 
 Usage:
-    python bench_voice_cache.py \
+    python bench_speaker_cache.py \
         --ref-audio /path/to/reference.wav \
         --ref-text "Transcript of the reference audio." \
         --num-prompts 20 \

@@ -358,6 +358,31 @@ curl -X POST http://localhost:8091/v1/audio/speech \
     }' --output cloned.wav
 ```
 
+### Voice Storage & Caching
+
+Uploaded voices are persisted to disk as a single `.safetensors` file per voice
+(audio samples + metadata — name, consent, ref_text, sample_rate, created_at —
+in the file header). On server restart the directory is scanned and all
+previously uploaded voices are restored automatically, so uploads survive
+process restarts.
+
+Uploading an existing name overwrites the previous entry (a warning is logged).
+
+Feature extraction artifacts (ref_code, speaker_embedding, DAC codes, etc.)
+are cached in-process with a shared LRU so repeated requests with the same
+`voice=...` skip the extraction pipeline. The cache is a true singleton across
+all TTS model types; deleting a voice invalidates every model-type slot at
+once.
+
+**Configuration (environment variables):**
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `SPEAKER_SAMPLES_DIR` | `~/.cache/vllm-omni/speakers` | Directory for persisted uploaded speakers (`.safetensors` files). |
+| `SPEAKER_MAX_UPLOADED` | `1000` | Maximum number of uploaded speakers kept on disk. Upload requests past the cap return 400. |
+
+The in-memory LRU has a fixed 512 MiB byte budget.
+
 ## Batch Speech Generation
 
 The batch endpoint synthesizes multiple texts in a single request, returning all results as JSON with base64-encoded audio.
@@ -543,6 +568,30 @@ Fish Speech uses `ref_audio` and `ref_text` for voice cloning (no `task_type` ne
 |-------|-------------|
 | `mistralai/Voxtral-4B-TTS-2603` | 3B AR + FlowMatching TTS. Supports text-to-speech with preset voices. |
 
+### CosyVoice3
+
+| Model | Description |
+|-------|-------------|
+| `FunAudioLLM/Fun-CosyVoice3-0.5B-2512` | Voice cloning from `ref_audio` + `ref_text`. No built-in voice presets — upload a voice or pass `ref_audio`/`ref_text` per request. |
+
+### OmniVoice
+
+| Model | Description |
+|-------|-------------|
+| `k2-fsa/OmniVoice` | Pure-diffusion TTS. Supports voice cloning via `ref_audio` (with optional `ref_text`); no built-in voice presets. |
+
+### VoxCPM2
+
+| Model | Description |
+|-------|-------------|
+| `openbmb/VoxCPM2` | TTS + voice cloning with built-in speaker presets and uploaded-voice support. Accepts `voice` (preset or uploaded) or `ref_audio` + optional `ref_text`. |
+
+### MOSS-TTS-Nano
+
+| Model | Description |
+|-------|-------------|
+| `OpenMOSS-Team/MOSS-TTS-Nano` | Voice cloning only. Requires `ref_audio` (or an uploaded `voice`); no built-in voice presets. `ref_text` is accepted but ignored — upstream's `voice_clone` mode does not consume a transcript. |
+
 ## Error Responses
 
 ### 400 Bad Request

@@ -0,0 +1,68 @@
+# Voxtral TTS Offline Inference
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/voxtral_tts>.
+
+
+`end2end.py` runs Voxtral TTS end-to-end offline inference using vLLM. It supports both blocking (`Omni`) and streaming (`AsyncOmni`) generation, batched prompts with configurable concurrency, and voice selection via preset name or reference audio file.
+
+When `mistral_common` has `SpeechRequest` support, prompt token IDs are built via `encode_speech_request`. Otherwise, the script falls back to manual token construction.
+
+## Usage Examples
+
+
+```bash
+# Basic single-prompt with cheerful_female voice preset
+python3 examples/offline_inference/voxtral_tts/end2end.py \
+    --write-audio --voice cheerful_female \
+    --model mistralai/Voxtral-4B-TTS-2603 \
+    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"
+
+# 32 replicate prompts with cheerful_female voice preset
+python3 examples/offline_inference/voxtral_tts/end2end.py \
+    --num-prompts 32 --write-audio --voice cheerful_female \
+    --model mistralai/Voxtral-4B-TTS-2603 \
+    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"
+
+# Streaming with neutral_female voice preset
+python3 examples/offline_inference/voxtral_tts/end2end.py \
+    --streaming --write-audio --voice neutral_female \
+    --model mistralai/Voxtral-4B-TTS-2603 \
+    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"
+
+# 32 prompts, 8 concurrent requests per wave, streaming with neutral_female voice
+python3 examples/offline_inference/voxtral_tts/end2end.py \
+    --num-prompts 32 --concurrency 8 --streaming --write-audio --voice neutral_female \
+    --model mistralai/Voxtral-4B-TTS-2603 \
+    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"
+
+# Short debug prompt with reference audio
+python3 examples/offline_inference/voxtral_tts/end2end.py \
+    --write-audio \
+    --model mistralai/Voxtral-4B-TTS-2603 \
+    --text "This is a test message." \
+    --audio-path path/to/reference_audio.wav
+```
+
+## Arguments
+
+| Argument | Description |
+|---|---|
+| `--model PATH` | HuggingFace repo ID or local directory path (default: `mistralai/Voxtral-4B-TTS-2603`) |
+| `--text TEXT` | Text to synthesize (default: `"This is a test message."`) |
+| `--audio-path PATH` | Path to reference audio file for voice cloning |
+| `--output-dir DIR` | Directory to write output WAV files (default: `output_audio`) |
+| `--deploy-config PATH` | Override the deploy config path. If unset, auto-loads `vllm_omni/deploy/voxtral_tts.yaml` from the HF `model_type`. |
+| `--num-prompts N` | Number of replicate prompts to run for measuring performance (default: 1) |
+| `--streaming` | Use streaming generation via `AsyncOmni` (default: blocking `Omni`) |
+| `--concurrency N` | Max concurrent requests per wave (must be used with `--streaming`, must evenly divide `--num-prompts`) |
+| `--voice NAME` | Voice preset to use instead of reference audio file (e.g., casual_female, casual_male, cheerful_female, neutral_female, neutral_male) |
+| `--write-audio` | Write generated audio to WAV files |
+| `--profiling-mode` | Enable profiling mode (reduces max tokens to 50) |
+| `--log-stats` | Enable detailed statistics logging |
+
+## Example materials
+
+??? abstract "end2end.py"
+    ``````py
+    --8<-- "examples/offline_inference/voxtral_tts/end2end.py"
+    ``````
@@ -0,0 +1,58 @@
+# Voxtral TTS Offline Inference
+
+`end2end.py` runs Voxtral TTS end-to-end offline inference using vLLM. It supports both blocking (`Omni`) and streaming (`AsyncOmni`) generation, batched prompts with configurable concurrency, and voice selection via preset name or reference audio file.
+
+When `mistral_common` has `SpeechRequest` support, prompt token IDs are built via `encode_speech_request`. Otherwise, the script falls back to manual token construction.
+
+## Usage Examples
+
+
+```bash
+# Basic single-prompt with cheerful_female voice preset
+python3 examples/offline_inference/voxtral_tts/end2end.py \
+    --write-audio --voice cheerful_female \
+    --model mistralai/Voxtral-4B-TTS-2603 \
+    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"
+
+# 32 replicate prompts with cheerful_female voice preset
+python3 examples/offline_inference/voxtral_tts/end2end.py \
+    --num-prompts 32 --write-audio --voice cheerful_female \
+    --model mistralai/Voxtral-4B-TTS-2603 \
+    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"
+
+# Streaming with neutral_female voice preset
+python3 examples/offline_inference/voxtral_tts/end2end.py \
+    --streaming --write-audio --voice neutral_female \
+    --model mistralai/Voxtral-4B-TTS-2603 \
+    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"
+
+# 32 prompts, 8 concurrent requests per wave, streaming with neutral_female voice
+python3 examples/offline_inference/voxtral_tts/end2end.py \
+    --num-prompts 32 --concurrency 8 --streaming --write-audio --voice neutral_female \
+    --model mistralai/Voxtral-4B-TTS-2603 \
+    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"
+
+# Short debug prompt with reference audio
+python3 examples/offline_inference/voxtral_tts/end2end.py \
+    --write-audio \
+    --model mistralai/Voxtral-4B-TTS-2603 \
+    --text "This is a test message." \
+    --audio-path path/to/reference_audio.wav
+```
+
+## Arguments
+
+| Argument | Description |
+|---|---|
+| `--model PATH` | HuggingFace repo ID or local directory path (default: `mistralai/Voxtral-4B-TTS-2603`) |
+| `--text TEXT` | Text to synthesize (default: `"This is a test message."`) |
+| `--audio-path PATH` | Path to reference audio file for voice cloning |
+| `--output-dir DIR` | Directory to write output WAV files (default: `output_audio`) |
+| `--deploy-config PATH` | Override the deploy config path. If unset, auto-loads `vllm_omni/deploy/voxtral_tts.yaml` from the HF `model_type`. |
+| `--num-prompts N` | Number of replicate prompts to run for measuring performance (default: 1) |
+| `--streaming` | Use streaming generation via `AsyncOmni` (default: blocking `Omni`) |
+| `--concurrency N` | Max concurrent requests per wave (must be used with `--streaming`, must evenly divide `--num-prompts`) |
+| `--voice NAME` | Voice preset to use instead of reference audio file. Check Huggingface `mistralai/Voxtral-4B-TTS-2603` to get the list of available voices |
+| `--write-audio` | Write generated audio to WAV files |
+| `--profiling-mode` | Enable profiling mode (reduces max tokens to 50) |
+| `--log-stats` | Enable detailed statistics logging |
@@ -13,6 +13,7 @@
     "tests.helpers.fixtures.log",
     "tests.helpers.fixtures.run_args",
     "tests.helpers.fixtures.runtime",
+    "tests.helpers.fixtures.speaker_cache",
 )
 
 

@@ -300,6 +300,24 @@ def client(test_app):
 
 
 class TestSpeechAPI:
+    @pytest.fixture(autouse=True)
+    def _mock_upload_io(self, mocker: MockerFixture):
+        """Mock soundfile/safetensors so upload accepts fake audio bytes."""
+        samples = np.zeros(88200, dtype=np.float32)  # 2s @ 44.1 kHz
+        mocker.patch("soundfile.read", return_value=(samples, 44100))
+
+        def _fake_save_file(tensors, path, metadata=None):
+            Path(path).touch()
+
+        mocker.patch("safetensors.torch.save_file", side_effect=_fake_save_file)
+        mock_ctx = mocker.MagicMock()
+        mock_ctx.keys.return_value = ["audio"]
+        mock_ctx.get_tensor.return_value = torch.zeros(88200)
+        mock_ctx.metadata.return_value = {"sample_rate": "44100"}
+        mock_safe_open = mocker.MagicMock()
+        mock_safe_open.return_value.__enter__.return_value = mock_ctx
+        mocker.patch("safetensors.safe_open", mock_safe_open)
+
     def test_create_speech_success(self, client):
         payload = {
             "input": "Hello world",
@@ -470,27 +488,17 @@ def test_upload_voice_invalid_mime_type(self, client):
         assert "MIME type" in result["detail"]
 
     def test_upload_voice_name_collision(self, client):
-        """Test voice upload with duplicate name."""
-        # First upload
+        """Re-uploading the same name overwrites the previous entry (no 400)."""
         audio_content = b"fake audio content"
-        files = {
-            "audio_sample": ("test.wav", audio_content, "audio/wav"),
-        }
-        data = {
-            "consent": "user_consent_123",
-            "name": "test_voice",
-        }
+        files = {"audio_sample": ("test.wav", audio_content, "audio/wav")}
+        data = {"consent": "user_consent_123", "name": "test_voice"}
 
         response = client.post("/v1/audio/voices", files=files, data=data)
         assert response.status_code == 200
 
-        # Second upload with same name
         response = client.post("/v1/audio/voices", files=files, data=data)
-        assert response.status_code == 400
-        result = response.json()
-        assert "detail" in result
-        assert "already exists" in result["detail"]
-        response = client.delete("/v1/audio/voices/test_voice")
+        assert response.status_code == 200
+        client.delete("/v1/audio/voices/test_voice")
 
     def test_upload_voice_missing_parameters(self, client):
         """Test voice upload with missing required parameters."""
@@ -970,7 +978,7 @@ def test_build_tts_params_with_uploaded_voice(self, speech_server, mocker: Mocke
                 "file_path": "/tmp/voice_samples/custom_voice_consent_123.wav",
                 "mime_type": "audio/wav",
                 "ref_text": None,
-                "created_at": 1711234567.89,
+                "created_at": 1711234567,
             }
         }
         speech_server.supported_speakers = {"ryan", "vivian", "custom_voice"}
@@ -983,7 +991,7 @@ def test_build_tts_params_with_uploaded_voice(self, speech_server, mocker: Mocke
         assert params["ref_audio"] == ["data:audio/wav;base64,ZmFrZWF1ZGlv"]
         assert params["x_vector_only_mode"] == [True]
         assert params["task_type"] == ["Base"]
-        assert params["voice_created_at"] == [1711234567.89]
+        assert params["voice_created_at"] == [1711234567]
         assert "ref_text" not in params
 
     def test_build_tts_params_with_uploaded_voice_ref_text(self, speech_server, mocker: MockerFixture):
@@ -994,7 +1002,7 @@ def test_build_tts_params_with_uploaded_voice_ref_text(self, speech_server, mock
                 "file_path": "/tmp/voice_samples/custom_voice_consent_123.wav",
                 "mime_type": "audio/wav",
                 "ref_text": "Hello world transcript",
-                "created_at": 1711234567.89,
+                "created_at": 1711234567,
             }
         }
         speech_server.supported_speakers = {"ryan", "vivian", "custom_voice"}
@@ -1008,7 +1016,7 @@ def test_build_tts_params_with_uploaded_voice_ref_text(self, speech_server, mock
         assert params["x_vector_only_mode"] == [False]
         assert params["task_type"] == ["Base"]
         assert params["ref_text"] == ["Hello world transcript"]
-        assert params["voice_created_at"] == [1711234567.89]
+        assert params["voice_created_at"] == [1711234567]
 
     def test_build_tts_params_without_uploaded_voice(self, speech_server):
         """Test _build_tts_params does not auto-set ref_audio for non-uploaded voices."""
@@ -1051,28 +1059,29 @@ def test_build_tts_params_with_explicit_ref_audio(self, speech_server):
         assert "x_vector_only_mode" not in params
 
     def test_get_uploaded_audio_data(self, speech_server, mocker: MockerFixture):
-        """Test _get_uploaded_audio_data function."""
-        # Mock file operations
-        mock_open = mocker.patch("builtins.open", create=True)
-        mock_b64encode = mocker.patch("base64.b64encode")
-        mock_exists = mocker.patch("pathlib.Path.exists")
-        mock_exists.return_value = True
-        mock_b64encode.return_value = b"ZmFrZWF1ZGlv"
-
-        # Setup mock file
-        mock_file = mocker.MagicMock()
-        mock_file.read.return_value = b"fakeaudio"
-        mock_open.return_value.__enter__.return_value = mock_file
+        """Returns a data URL by loading audio via safetensors + re-encoding WAV."""
+        mocker.patch("pathlib.Path.exists", return_value=True)
+        mocker.patch("soundfile.write")
+        mocker.patch("base64.b64encode", return_value=b"ZmFrZWF1ZGlv")
+        mock_ctx = mocker.MagicMock()
+        mock_ctx.keys.return_value = ["audio"]
+        mock_ctx.get_tensor.return_value = torch.zeros(88200)
+        mock_ctx.metadata.return_value = {"sample_rate": "44100"}
+        mock_safe_open = mocker.MagicMock()
+        mock_safe_open.return_value.__enter__.return_value = mock_ctx
+        mocker.patch("safetensors.safe_open", mock_safe_open)
 
-        # Setup uploaded speaker
         speech_server.uploaded_speakers = {
-            "test_voice": {"name": "test_voice", "file_path": "/tmp/test.wav", "mime_type": "audio/wav"}
+            "test_voice": {
+                "name": "test_voice",
+                "file_path": "/tmp/test.safetensors",
+                "mime_type": "audio/wav",
+                "embedding_source": "audio",
+                "sample_rate": 44100,
+            }
         }
         result = speech_server._get_uploaded_audio_data("test_voice")
-
         assert result == "data:audio/wav;base64,ZmFrZWF1ZGlv"
-        mock_open.assert_called_once_with(Path("/tmp/test.wav"), "rb")
-        mock_b64encode.assert_called_once_with(b"fakeaudio")
 
     def test_get_uploaded_audio_data_missing_file(self, speech_server, mocker: MockerFixture):
         """Test _get_uploaded_audio_data when file is missing."""
@@ -1230,24 +1239,6 @@ def test_regression_1603_speaker_key_with_uploaded_embedding_voice(self, speech_
         # Must NOT have ref_audio — that would fail for safetensors files
         assert "ref_audio" not in params
 
-    def test_validate_rejects_embedding_voice_with_pending_cache(self, speech_server, mocker: MockerFixture):
-        """Validation should reject embedding voices whose cache is not yet ready."""
-        speech_server.uploaded_speakers = {
-            "myvoice": {
-                "name": "myvoice",
-                "file_path": "/tmp/myvoice.safetensors",
-                "mime_type": "application/x-safetensors",
-                "embedding_source": "direct",
-                "cache_status": "pending",
-                "cache_file": None,
-            }
-        }
-        req = OpenAICreateSpeechRequest.model_validate({"input": "Hello", "speaker": "myvoice", "task_type": "Base"})
-        mocker.patch("pathlib.Path.exists", return_value=True)
-        err = speech_server._validate_qwen_tts_request(req)
-        assert err is not None
-        assert "not yet ready" in err
-
     def test_x_vector_only_mode_not_overwritten_for_uploaded_embedding(self, speech_server, mocker: MockerFixture):
         """x_vector_only_mode set by uploaded embedding must not be overwritten by request field."""
         speech_server.uploaded_speakers = {
@@ -2294,6 +2285,7 @@ def test_prepare_speech_generation_cosyvoice3(self, cosyvoice3_server, mocker: M
                 "mm_processor_kwargs": {"prompt_text": "ref text", "sample_rate": 24000},
             }
         )
+        cosyvoice3_server._apply_cosyvoice3_dynamic_tokens = mocker.MagicMock(side_effect=lambda spl, req: spl)
 
         request = OpenAICreateSpeechRequest(
             input="Hello",