diff --git a/.claude/skills/add-tts-model/SKILL.md b/.claude/skills/add-tts-model/SKILL.md index 963ffb4f64d..811a359d0e4 100644 --- a/.claude/skills/add-tts-model/SKILL.md +++ b/.claude/skills/add-tts-model/SKILL.md @@ -211,8 +211,8 @@ See `plan/voxcpm2_native_ar_design.md`. - Model files in `vllm_omni/model_executor/models//` - Stage config YAML -- Working `end2end.py` with correct audio output -- README.md in the example directory +- Working `end2end.py` at `examples/offline_inference/text_to_speech//end2end.py` +- New section in `examples/offline_inference/text_to_speech/README.md` (table row + per-model section). Do **not** create a top-level `examples/offline_inference//` dir or a per-model `README.md` inside `text_to_speech//` — the hub README is the documented surface and the mkdocs `generate_examples` hook only descends one level into `examples//`. ## Phase 3: Online Serving @@ -308,11 +308,11 @@ def build_voice_clone_prompt(ref_audio_path: str, text: str, codec) -> list: ### Deliverables - Updated `serving_speech.py` with all 5 integration points (single commit) -- Client scripts and server launcher -- Gradio demo with streaming and voice cloning UI +- Client scripts and server launcher under `examples/online_serving/text_to_speech//` +- Gradio demo with streaming and voice cloning UI in the same dir - E2E online serving test (`tests/e2e/online_serving/test_.py`) - Buildkite CI entry in `.buildkite/test-merge.yml` -- Documentation (offline + online serving docs) +- New section in `examples/online_serving/text_to_speech/README.md` (table row + per-model section). Do **not** create a top-level `examples/online_serving//` dir or a per-model `README.md` inside `text_to_speech//`. ### E2E test pitfalls to avoid diff --git a/docs/contributing/model/adding_tts_model.md b/docs/contributing/model/adding_tts_model.md index 34fd2dbb503..3e5ae30df6d 100644 --- a/docs/contributing/model/adding_tts_model.md +++ b/docs/contributing/model/adding_tts_model.md @@ -186,6 +186,17 @@ vllm_omni/model_executor/stage_configs/ your_model_name_async_chunk.yaml # Streaming mode config ``` +### Example placement + +TTS examples live in the consolidated text-to-speech hub, **not** in their +own top-level directory. Place per-model scripts under +`examples/offline_inference/text_to_speech//` and +`examples/online_serving/text_to_speech//`, and add a section +to the hub `README.md` files (table row + per-model section) instead of a +new per-model `README.md`. The mkdocs `generate_examples` hook treats the +`text_to_speech/` parent as a single example, so per-model READMEs inside +it would not be picked up — the hub README is the documented surface. + **Qwen3-TTS reference files:** | File | Purpose | diff --git a/docs/user_guide/examples/offline_inference/voxtral_tts.md b/docs/user_guide/examples/offline_inference/voxtral_tts.md deleted file mode 100644 index c6f41ac0875..00000000000 --- a/docs/user_guide/examples/offline_inference/voxtral_tts.md +++ /dev/null @@ -1,68 +0,0 @@ -# Voxtral TTS Offline Inference - -Source . - - -`end2end.py` runs Voxtral TTS end-to-end offline inference using vLLM. It supports both blocking (`Omni`) and streaming (`AsyncOmni`) generation, batched prompts with configurable concurrency, and voice selection via preset name or reference audio file. - -When `mistral_common` has `SpeechRequest` support, prompt token IDs are built via `encode_speech_request`. Otherwise, the script falls back to manual token construction. - -## Usage Examples - - -```bash -# Basic single-prompt with cheerful_female voice preset -python3 examples/offline_inference/voxtral_tts/end2end.py \ - --write-audio --voice cheerful_female \ - --model mistralai/Voxtral-4B-TTS-2603 \ - --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?" - -# 32 replicate prompts with cheerful_female voice preset -python3 examples/offline_inference/voxtral_tts/end2end.py \ - --num-prompts 32 --write-audio --voice cheerful_female \ - --model mistralai/Voxtral-4B-TTS-2603 \ - --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?" - -# Streaming with neutral_female voice preset -python3 examples/offline_inference/voxtral_tts/end2end.py \ - --streaming --write-audio --voice neutral_female \ - --model mistralai/Voxtral-4B-TTS-2603 \ - --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?" - -# 32 prompts, 8 concurrent requests per wave, streaming with neutral_female voice -python3 examples/offline_inference/voxtral_tts/end2end.py \ - --num-prompts 32 --concurrency 8 --streaming --write-audio --voice neutral_female \ - --model mistralai/Voxtral-4B-TTS-2603 \ - --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?" - -# Short debug prompt with reference audio -python3 examples/offline_inference/voxtral_tts/end2end.py \ - --write-audio \ - --model mistralai/Voxtral-4B-TTS-2603 \ - --text "This is a test message." \ - --audio-path path/to/reference_audio.wav -``` - -## Arguments - -| Argument | Description | -|---|---| -| `--model PATH` | HuggingFace repo ID or local directory path (default: `mistralai/Voxtral-4B-TTS-2603`) | -| `--text TEXT` | Text to synthesize (default: `"This is a test message."`) | -| `--audio-path PATH` | Path to reference audio file for voice cloning | -| `--output-dir DIR` | Directory to write output WAV files (default: `output_audio`) | -| `--deploy-config PATH` | Override the deploy config path. If unset, auto-loads `vllm_omni/deploy/voxtral_tts.yaml` from the HF `model_type`. | -| `--num-prompts N` | Number of replicate prompts to run for measuring performance (default: 1) | -| `--streaming` | Use streaming generation via `AsyncOmni` (default: blocking `Omni`) | -| `--concurrency N` | Max concurrent requests per wave (must be used with `--streaming`, must evenly divide `--num-prompts`) | -| `--voice NAME` | Voice preset to use instead of reference audio file (e.g., casual_female, casual_male, cheerful_female, neutral_female, neutral_male) | -| `--write-audio` | Write generated audio to WAV files | -| `--profiling-mode` | Enable profiling mode (reduces max tokens to 50) | -| `--log-stats` | Enable detailed statistics logging | - -## Example materials - -??? abstract "end2end.py" - ``````py - --8<-- "examples/offline_inference/voxtral_tts/end2end.py" - `````` diff --git a/examples/offline_inference/ming_flash_omni/README.md b/examples/offline_inference/ming_flash_omni/README.md index 179925bb68e..4b798619e2f 100644 --- a/examples/offline_inference/ming_flash_omni/README.md +++ b/examples/offline_inference/ming_flash_omni/README.md @@ -9,7 +9,7 @@ vLLM-Omni supports two deployment modes: | Thinker + Talker (omni-speech, default) | `vllm_omni/deploy/ming_flash_omni.yaml` | Text + Audio | | Thinker only (multimodal understanding) | `vllm_omni/deploy/ming_flash_omni_thinker_only.yaml` | Text | -For standalone TTS (talker only), see [`examples/offline_inference/ming_flash_omni_tts/`](../ming_flash_omni_tts/). +For standalone TTS (talker only), see the [Ming-flash-omni-TTS section in the Text-To-Speech hub](../text_to_speech/README.md#ming-flash-omni-tts). ## Setup diff --git a/examples/offline_inference/ming_flash_omni_tts/README.md b/examples/offline_inference/ming_flash_omni_tts/README.md deleted file mode 100644 index d0ad9b30d2f..00000000000 --- a/examples/offline_inference/ming_flash_omni_tts/README.md +++ /dev/null @@ -1,47 +0,0 @@ -# Ming-flash-omni Standalone TTS (Offline) - -This example runs **Ming-flash-omni-2.0 talker-only** offline inference with: - -- `model`: `Jonathan1909/Ming-flash-omni-2.0` -- `deploy config`: `vllm_omni/deploy/ming_flash_omni_tts.yaml` - -It follows the Ming cookbook parameter style: - -- `prompt`: `"Please generate speech based on the following description.\n"` -- `max_decode_steps`: `200` -- `cfg`: `2.0` -- `sigma`: `0.25` -- `temperature`: `0.0` - -## Quick Start - -```bash -python examples/offline_inference/ming_flash_omni_tts/end2end.py --case style -``` - -## Cases - -```bash -# Style -python examples/offline_inference/ming_flash_omni_tts/end2end.py --case style - -# IP -python examples/offline_inference/ming_flash_omni_tts/end2end.py --case ip - -# Basic (speed/pitch/volume control) -python examples/offline_inference/ming_flash_omni_tts/end2end.py --case basic -``` - -## Useful Arguments - -- `--text`: override default text in the selected case -- `--output`: custom output wav path -- `--model`: local model path or HF repo id -- `--deploy-config`: custom talker deploy YAML path -- `--log-stats`: enable runtime stats logs - -## Notes - -- This directory is for **standalone talker deployment (TTS)**. -- For Ming thinker multimodal understanding examples, see: - `examples/offline_inference/ming_flash_omni/`. diff --git a/examples/offline_inference/moss_tts_nano/README.md b/examples/offline_inference/moss_tts_nano/README.md deleted file mode 100644 index d2a7051400b..00000000000 --- a/examples/offline_inference/moss_tts_nano/README.md +++ /dev/null @@ -1,97 +0,0 @@ -# MOSS-TTS-Nano Offline Inference - -## Overview - -Single-stage offline TTS pipeline using the 0.1B MOSS-TTS-Nano AR LM and MOSS-Audio-Tokenizer-Nano codec. Outputs 48 kHz mono WAV (the upstream tokenizer is stereo at 48 kHz; the wrapper mixes down to mono so it lines up with the rest of the engine's single-channel audio path). - -> **No built-in speaker presets.** Every request needs `--prompt-audio` -> (a reference clip). The default `--mode voice_clone` is upstream's -> recommended workflow and is the only mode the OpenAI server exposes; -> the offline CLI also exposes `--mode continuation` for completeness, -> but note that upstream's continuation-with-prompt path emits very -> short / near-silent output, so it is rarely useful in practice. See -> upstream's `infer.py` for the full surface. -> -> Sample reference clips ship in the upstream repo under -> [`assets/audio/`](https://github.com/OpenMOSS/MOSS-TTS-Nano/tree/main/assets/audio) -> (e.g. `zh_1.wav`, `en_2.wav`, `jp_2.wav`). - -## Quick Start - -```bash -# Fetch a sample reference clip from upstream (one-off, user-scoped cache). -REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano" -mkdir -p "$REF_DIR" -[ -s "$REF_DIR/zh_1.wav" ] || \ - curl -L -o "$REF_DIR/zh_1.wav" https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav - -python end2end.py \ - --text "你好,这是MOSS-TTS-Nano的语音合成演示。" \ - --prompt-audio "$REF_DIR/zh_1.wav" -``` - -The first run downloads `OpenMOSS-Team/MOSS-TTS-Nano` and `OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano` from Hugging Face. - -## Usage - -``` -python end2end.py [OPTIONS] - -Required: - --prompt-audio PATH Reference WAV/MP3 for voice cloning / continuation - -Options: - --text TEXT Text to synthesize (default: "Hello, this is MOSS-TTS-Nano speaking.") - --prompt-text TEXT Optional. Required only with --mode continuation; - rejected by upstream in --mode voice_clone. - --mode MODE voice_clone (default) or continuation - --max-new-frames N Max AR frames, default 375 (~14 s audio) - --seed INT Random seed for reproducibility - --audio-temperature F Audio sampling temperature (default: 0.8) - --audio-top-k N Audio top-k sampling (default: 25) - --audio-top-p F Audio top-p sampling (default: 0.95) - --text-temperature F Text layer temperature (default: 1.0) - --output-dir DIR Directory for WAV outputs (default: $XDG_CACHE_HOME/moss_tts_nano_output, falls back to ~/.cache/...) - --deploy-config PATH Override deploy YAML (defaults to vllm_omni/deploy/moss_tts_nano.yaml) - --stage-init-timeout INT Timeout in seconds for stage init (default: 120) -``` - -## Examples - -```bash -REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano" - -# Chinese reference clip → Chinese synthesis (voice_clone, default) -python end2end.py \ - --text "你好,这是 MOSS-TTS-Nano 的语音合成测试。" \ - --prompt-audio "$REF_DIR/zh_1.wav" - -# Reproducible output -python end2end.py \ - --text "Deterministic test." \ - --prompt-audio "$REF_DIR/en_2.wav" \ - --seed 42 -``` - -## Deploy Config - -Runtime knobs live in `vllm_omni/deploy/moss_tts_nano.yaml` (auto-loaded; -override with `--deploy-config PATH`). Key stage-level settings: - -```yaml -stages: - - stage_id: 0 - gpu_memory_utilization: 0.3 # ~2 GB VRAM; increase for faster init - max_num_seqs: 4 # concurrent requests - max_model_len: 4096 -``` - -## Output Format - -WAV files, 48 kHz, mono. The MOSS audio tokenizer is internally stereo (2-channel) at 48 kHz; the wrapper averages the two channels into mono before reaching the engine, so playback duration / pitch are correct against the WAV header's 48 kHz rate. - -## Troubleshooting - -- **`libnvrtc.so.13: cannot open shared object file`**: torchaudio 2.10+ torchcodec backend requires NVRTC. The model patches `torchaudio.load/save` automatically at load time to fall back to soundfile. -- **`flash_attn not installed`**: The model falls back to `sdpa` attention automatically. -- **Empty audio**: Check that `--text` is non-empty and the model loaded successfully (look for "MOSS-TTS-Nano LM loaded" in logs). diff --git a/examples/offline_inference/text_to_speech/README.md b/examples/offline_inference/text_to_speech/README.md index a457c6c0a91..ddc5f11c16b 100644 --- a/examples/offline_inference/text_to_speech/README.md +++ b/examples/offline_inference/text_to_speech/README.md @@ -16,9 +16,11 @@ list of supported architectures across all modalities, see |---|---|---|---|---|---|---| | CosyVoice3 | `FunAudioLLM/Fun-CosyVoice3-0.5B-2512` | 2 (talker + code2wav) | ✓ | ✓ | — | 22.05 kHz | | Fish Speech S2 Pro | `fishaudio/s2-pro` | dual-AR | ✓ | ✓ | — | 44.1 kHz | +| Ming-flash-omni-TTS | `Jonathan1909/Ming-flash-omni-2.0` | single (talker only) | — (caption-controlled) | — | style / IP / basic captions | 44.1 kHz | +| MOSS-TTS-Nano | `OpenMOSS-Team/MOSS-TTS-Nano` | single (AR + codec) | ✓ (required) | ✓ | voice_clone, continuation | 48 kHz | | OmniVoice | `k2-fsa/OmniVoice` | 2 (gen + dec) | ✓ | — | voice design, language hint | 24 kHz | | Qwen3-TTS | `Qwen/Qwen3-TTS-12Hz-1.7B-{CustomVoice,VoiceDesign,Base}` | 2 (talker + code2wav) | ✓ (Base) | ✓ | 3 task variants | 24 kHz | -| VoxCPM | local model dir | split | ✓ | ✓ | — | 24 kHz | +| VoxCPM | `openbmb/VoxCPM-0.5B` | split | ✓ | ✓ | — | 24 kHz | | VoxCPM2 | `openbmb/VoxCPM2` | single (native AR) | ✓ | ✓ (online) | continuation | 48 kHz | | Voxtral TTS | `mistralai/Voxtral-4B-TTS-2603` | varies | ✓ | ✓ | voice presets | 24 kHz | @@ -126,6 +128,76 @@ Streaming requires `async_chunk: true` in the stage config. --- +## Ming-flash-omni-TTS + +Standalone talker-only deployment of Ming-flash-omni-2.0 at 44.1 kHz. Voice is controlled through caption fields (`风格` / `IP` / `语速`/`基频`/`音量`) rather than reference audio. + +### Prerequisites +The example calls into `vllm_omni.model_executor.models.ming_flash_omni.prompt_utils` for the default prompt and instruction builder; no extra pip install on top of the base vLLM-Omni install. + +### Quick start +```bash +python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case style +``` + +### Cases +```bash +# ASMR-style whisper (caption-driven) +python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case style + +# IP voice (preset character voice via caption) +python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case ip + +# Basic speed/pitch/volume control +python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case basic +``` + +Override the default text per case with `--text`, write to a custom path with `--output`. + +### Notes +- Talker-only deployment — for the multimodal Ming-flash-omni example, see [`examples/offline_inference/ming_flash_omni/`](../../ming_flash_omni/). +- Deploy config: `vllm_omni/deploy/ming_flash_omni_tts.yaml` (single GPU, `enforce_eager`, `max_num_seqs: 1`). +- Decode defaults from the Ming cookbook: `max_decode_steps=200`, `cfg=2.0`, `sigma=0.25`, `temperature=0.0`, `use_zero_spk_emb=True`. + +--- + +## MOSS-TTS-Nano + +Single-stage 0.1B AR LM + MOSS-Audio-Tokenizer-Nano codec at 48 kHz mono (mixed down from upstream stereo). ZH / EN / JA. Every request requires a reference clip via `--ref-audio`. + +> **No built-in speaker presets.** `--ref-audio` is required on every call. Default `--mode voice_clone` matches upstream's recommended workflow; `--mode continuation` is exposed for completeness but upstream's continuation-with-prompt path emits very short / near-silent output, so it is rarely useful in practice. Sample reference clips ship in the upstream repo under [`assets/audio/`](https://github.com/OpenMOSS/MOSS-TTS-Nano/tree/main/assets/audio) (e.g. `zh_1.wav`, `en_2.wav`, `jp_2.wav`). + +### Quick start +```bash +# Fetch a sample reference clip (one-off, user-scoped cache). +REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano" +mkdir -p "$REF_DIR" +[ -s "$REF_DIR/zh_1.wav" ] || \ + curl -L -o "$REF_DIR/zh_1.wav" https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav + +python examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py \ + --text "你好,这是MOSS-TTS-Nano的语音合成演示。" \ + --ref-audio "$REF_DIR/zh_1.wav" +``` +The first run downloads `OpenMOSS-Team/MOSS-TTS-Nano` and `OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano` from Hugging Face. + +### Reproducible runs +```bash +python examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py \ + --text "Deterministic test." \ + --ref-audio "$REF_DIR/en_2.wav" \ + --seed 42 +``` + +### Notes +- Output: 48 kHz mono WAV (the tokenizer is internally stereo at 48 kHz; the wrapper averages to mono before reaching the engine). +- Deploy config: `vllm_omni/deploy/moss_tts_nano.yaml` (auto-loaded; override with `--deploy-config`). +- Default `--max-new-frames 375` ≈ 14 s of audio; raise for longer outputs. +- `--ref-text` is rejected in `voice_clone` mode and required only with `--mode continuation`. +- Run `--help` for the full sampling-knob surface (`--audio-temperature`, `--audio-top-k`, `--audio-top-p`, `--text-temperature`). + +--- + ## OmniVoice Zero-shot multilingual TTS supporting 600+ languages, with three modes (auto / clone / design). diff --git a/examples/offline_inference/ming_flash_omni_tts/end2end.py b/examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py similarity index 100% rename from examples/offline_inference/ming_flash_omni_tts/end2end.py rename to examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py diff --git a/examples/offline_inference/moss_tts_nano/end2end.py b/examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py similarity index 92% rename from examples/offline_inference/moss_tts_nano/end2end.py rename to examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py index 1d2b21043be..3a27ad04dea 100644 --- a/examples/offline_inference/moss_tts_nano/end2end.py +++ b/examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py @@ -9,20 +9,20 @@ MOSS-TTS-Nano upstream supports two modes (matching ``infer.py``): -* ``voice_clone`` (recommended): only ``--prompt-audio`` is required. -* ``continuation``: ``--prompt-audio`` + ``--prompt-text`` together. +* ``voice_clone`` (recommended): only ``--ref-audio`` is required. +* ``continuation``: ``--ref-audio`` + ``--ref-text`` together. Usage: # Voice clone (recommended): ref audio only, no transcript needed. python end2end.py \\ --text "Hello!" \\ - --prompt-audio /path/to/ref.wav + --ref-audio /path/to/ref.wav # Continuation: ref audio + its transcript. python end2end.py \\ --text "Hello!" \\ - --prompt-audio /path/to/ref.wav \\ - --prompt-text "Transcript of the reference clip." \\ + --ref-audio /path/to/ref.wav \\ + --ref-text "Transcript of the reference clip." \\ --mode continuation # Sample reference clips ship in the upstream repo: @@ -120,11 +120,11 @@ def main(args) -> None: output_dir.mkdir(parents=True, exist_ok=True) print(f"Synthesizing: {args.text!r}") - print(f" ref_audio: {args.prompt_audio}") + print(f" ref_audio: {args.ref_audio}") inputs = build_request( text=args.text, - prompt_audio_path=args.prompt_audio, - prompt_text=args.prompt_text, + prompt_audio_path=args.ref_audio, + prompt_text=args.ref_text, mode=args.mode, max_new_frames=args.max_new_frames, seed=args.seed, @@ -158,15 +158,15 @@ def parse_args(): parser = FlexibleArgumentParser(description="MOSS-TTS-Nano offline inference") parser.add_argument("--text", default="Hello, this is MOSS-TTS-Nano speaking.", help="Text to synthesize.") parser.add_argument( - "--prompt-audio", + "--ref-audio", required=True, help="Path to reference audio for voice cloning / continuation (required).", ) parser.add_argument( - "--prompt-text", + "--ref-text", default=None, help=( - "Optional transcript of --prompt-audio. Required (and only meaningful) " + "Optional transcript of --ref-audio. Required (and only meaningful) " "in --mode continuation; rejected by upstream in --mode voice_clone." ), ) diff --git a/examples/offline_inference/voxtral_tts/README.md b/examples/offline_inference/voxtral_tts/README.md deleted file mode 100644 index bbe317798a8..00000000000 --- a/examples/offline_inference/voxtral_tts/README.md +++ /dev/null @@ -1,58 +0,0 @@ -# Voxtral TTS Offline Inference - -`end2end.py` runs Voxtral TTS end-to-end offline inference using vLLM. It supports both blocking (`Omni`) and streaming (`AsyncOmni`) generation, batched prompts with configurable concurrency, and voice selection via preset name or reference audio file. - -When `mistral_common` has `SpeechRequest` support, prompt token IDs are built via `encode_speech_request`. Otherwise, the script falls back to manual token construction. - -## Usage Examples - - -```bash -# Basic single-prompt with cheerful_female voice preset -python3 examples/offline_inference/voxtral_tts/end2end.py \ - --write-audio --voice cheerful_female \ - --model mistralai/Voxtral-4B-TTS-2603 \ - --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?" - -# 32 replicate prompts with cheerful_female voice preset -python3 examples/offline_inference/voxtral_tts/end2end.py \ - --num-prompts 32 --write-audio --voice cheerful_female \ - --model mistralai/Voxtral-4B-TTS-2603 \ - --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?" - -# Streaming with neutral_female voice preset -python3 examples/offline_inference/voxtral_tts/end2end.py \ - --streaming --write-audio --voice neutral_female \ - --model mistralai/Voxtral-4B-TTS-2603 \ - --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?" - -# 32 prompts, 8 concurrent requests per wave, streaming with neutral_female voice -python3 examples/offline_inference/voxtral_tts/end2end.py \ - --num-prompts 32 --concurrency 8 --streaming --write-audio --voice neutral_female \ - --model mistralai/Voxtral-4B-TTS-2603 \ - --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?" - -# Short debug prompt with reference audio -python3 examples/offline_inference/voxtral_tts/end2end.py \ - --write-audio \ - --model mistralai/Voxtral-4B-TTS-2603 \ - --text "This is a test message." \ - --audio-path path/to/reference_audio.wav -``` - -## Arguments - -| Argument | Description | -|---|---| -| `--model PATH` | HuggingFace repo ID or local directory path (default: `mistralai/Voxtral-4B-TTS-2603`) | -| `--text TEXT` | Text to synthesize (default: `"This is a test message."`) | -| `--audio-path PATH` | Path to reference audio file for voice cloning | -| `--output-dir DIR` | Directory to write output WAV files (default: `output_audio`) | -| `--deploy-config PATH` | Override the deploy config path. If unset, auto-loads `vllm_omni/deploy/voxtral_tts.yaml` from the HF `model_type`. | -| `--num-prompts N` | Number of replicate prompts to run for measuring performance (default: 1) | -| `--streaming` | Use streaming generation via `AsyncOmni` (default: blocking `Omni`) | -| `--concurrency N` | Max concurrent requests per wave (must be used with `--streaming`, must evenly divide `--num-prompts`) | -| `--voice NAME` | Voice preset to use instead of reference audio file. Check Huggingface `mistralai/Voxtral-4B-TTS-2603` to get the list of available voices | -| `--write-audio` | Write generated audio to WAV files | -| `--profiling-mode` | Enable profiling mode (reduces max tokens to 50) | -| `--log-stats` | Enable detailed statistics logging | diff --git a/examples/online_serving/ming_flash_omni/README.md b/examples/online_serving/ming_flash_omni/README.md index dd8c8aa8186..533a30e1cb0 100644 --- a/examples/online_serving/ming_flash_omni/README.md +++ b/examples/online_serving/ming_flash_omni/README.md @@ -11,7 +11,7 @@ Please refer to [README.md](../../../README.md) | Thinker + Talker (omni-speech, default) | `vllm serve ... --omni` | Text + Audio | | Thinker only (multimodal understanding) | `vllm serve ... --omni --deploy-config vllm_omni/deploy/ming_flash_omni_thinker_only.yaml` | Text | -For standalone TTS (talker only), see [`examples/online_serving/ming_flash_omni_tts/`](../ming_flash_omni_tts/). +For standalone TTS (talker only), see the [Ming-flash-omni-TTS section in the Text-To-Speech hub](../text_to_speech/README.md#ming-flash-omni-tts). ## Run examples (Ming-flash-omni 2.0) diff --git a/examples/online_serving/ming_flash_omni_tts/README.md b/examples/online_serving/ming_flash_omni_tts/README.md deleted file mode 100644 index 1b372e3897e..00000000000 --- a/examples/online_serving/ming_flash_omni_tts/README.md +++ /dev/null @@ -1,54 +0,0 @@ -# Ming-flash-omni Standalone TTS (Online Serving) - -This directory contains online e2e examples for **Ming-flash-omni-2.0 standalone talker deployment**. - -Server uses: - -- `model`: `Jonathan1909/Ming-flash-omni-2.0` -- `deploy config`: `vllm_omni/deploy/ming_flash_omni_tts.yaml` - -## Launch the Server - -```bash -# from repo root -bash examples/online_serving/ming_flash_omni_tts/run_server.sh -``` - -Equivalent manual command: - -```bash -vllm serve Jonathan1909/Ming-flash-omni-2.0 \ - --deploy-config vllm_omni/deploy/ming_flash_omni_tts.yaml \ - --host 0.0.0.0 \ - --port 8091 \ - --trust-remote-code \ - --omni -``` - -## Send TTS Request - -Python client: - -```bash -python examples/online_serving/ming_flash_omni_tts/speech_client.py \ - --text "我们当迎着阳光辛勤耕作,去摘取,去制作,去品尝,去馈赠。" \ - --output ming_online.wav -``` - -Long-form `instructions` (e.g. ASMR whisper style) via the client: - -```bash -python examples/online_serving/ming_flash_omni_tts/speech_client.py \ - --text "我会一直在这里陪着你,直到你慢慢、慢慢地沉入那个最温柔的梦里……好吗?" \ - --instructions "这是一种ASMR耳语,属于一种旨在引发特殊感官体验的创意风格。这个女性使用轻柔的普通话进行耳语,声音气音成分重。音量极低,紧贴麦克风,语速极慢,旨在制造触发听者颅内快感的声学刺激。" \ - --output ming_online_asmr.wav -``` - -## Notes - -- This is the **online serving** counterpart of `examples/offline_inference/ming_flash_omni_tts/`. -- The server uses `use_zero_spk_emb=True` and the default decode args - (`max_decode_steps=200`, `cfg=2.0`, `sigma=0.25`, `temperature=0.0`). - For other caption fields (`语速`, `基频`, `IP`, BGM, etc.) or overriding - decode args, use the offline e2e example where `additional_information` - is set explicitly. diff --git a/examples/online_serving/moss_tts_nano/README.md b/examples/online_serving/moss_tts_nano/README.md deleted file mode 100644 index b6c47322520..00000000000 --- a/examples/online_serving/moss_tts_nano/README.md +++ /dev/null @@ -1,147 +0,0 @@ -# MOSS-TTS-Nano - -## Model checkpoint - -| Model | Description | -|-------|-------------| -| `OpenMOSS-Team/MOSS-TTS-Nano` | 0.1B AR LM + MOSS-Audio-Tokenizer-Nano codec, 48 kHz mono (mixed down from upstream stereo), ZH/EN/JA | - -> **No built-in speaker presets.** Every request must include `ref_audio`. -> The server uses upstream's recommended `voice_clone` mode (per -> upstream's README and `infer.py` example). The OpenAI-schema `voice` -> and `ref_text` fields are accepted but ignored — `voice_clone` does -> not consume a transcript, and upstream's `continuation` mode (the only -> path that accepts `prompt_text`) emits near-silent output with a -> reference clip + transcript pair, so it is not exposed here. -> -> Sample reference clips are available in the upstream repo under -> [`assets/audio/`](https://github.com/OpenMOSS/MOSS-TTS-Nano/tree/main/assets/audio) -> (e.g. `zh_1.wav`, `en_2.wav`, `jp_2.wav`). - -## Gradio Demo - -An interactive Gradio demo is available with custom voice cloning and -streaming support. Upload your own reference audio in the UI. - -```bash -# Option 1: Launch server + Gradio together -./run_gradio_demo.sh - -# Option 2: If server is already running -python gradio_demo.py --api-base http://localhost:8091 -``` - -Then open http://localhost:7860 in your browser. - -## Launch the Server - -```bash -vllm serve OpenMOSS-Team/MOSS-TTS-Nano --omni --port 8091 -``` - -The deploy config at `vllm_omni/deploy/moss_tts_nano.yaml` auto-loads; no -`--stage-configs-path`, `--trust-remote-code`, or `--enforce-eager` flags -are needed. - -Or use the convenience script: - -```bash -./run_server.sh -``` - -## Send TTS Request - -Every request needs `ref_audio` (base64 data URL). Reuse a saved sample: - -```bash -# Fetch a sample reference clip from the upstream repo (one-off). -# Cache under XDG_CACHE_HOME so it survives across runs and stays user-scoped. -REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano" -mkdir -p "$REF_DIR" -REF_WAV="$REF_DIR/zh_1.wav" -[ -s "$REF_WAV" ] || curl -L -o "$REF_WAV" https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav -REF_AUDIO=$(base64 -w 0 "$REF_WAV") - -curl -X POST http://localhost:8091/v1/audio/speech \ - -H "Content-Type: application/json" \ - -d "{ - \"input\": \"你好,这是语音合成测试。\", - \"ref_audio\": \"data:audio/wav;base64,${REF_AUDIO}\", - \"response_format\": \"wav\" - }" --output output.wav -``` - -### Using Python - -```python -import base64 -import os -import urllib.request -from pathlib import Path - -import httpx - -ref_dir = Path(os.environ.get("XDG_CACHE_HOME", Path.home() / ".cache")) / "moss-tts-nano" -ref_dir.mkdir(parents=True, exist_ok=True) -ref_wav = ref_dir / "zh_1.wav" -if not ref_wav.exists() or ref_wav.stat().st_size == 0: - urllib.request.urlretrieve( - "https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav", - ref_wav, - ) - -with ref_wav.open("rb") as f: - ref_audio_b64 = base64.b64encode(f.read()).decode("ascii") - -response = httpx.post( - "http://localhost:8091/v1/audio/speech", - json={ - "input": "你好,这是语音合成测试。", - "ref_audio": f"data:audio/wav;base64,{ref_audio_b64}", - "response_format": "wav", - }, - timeout=300.0, -) - -with open("output.wav", "wb") as f: - f.write(response.content) -``` - -### Streaming - -```bash -curl -X POST http://localhost:8091/v1/audio/speech \ - -H "Content-Type: application/json" \ - -d "{ - \"input\": \"Hello, streaming output from MOSS-TTS-Nano.\", - \"ref_audio\": \"data:audio/wav;base64,${REF_AUDIO}\", - \"stream\": true, - \"response_format\": \"pcm\" - }" --no-buffer | play -t raw -r 48000 -e signed -b 16 -c 1 - -``` - -**Note:** Output is 48 kHz mono PCM. Upstream's audio tokenizer is internally stereo at 48 kHz; the model wrapper averages the two channels into mono before reaching the engine, so playback duration / pitch are correct against the WAV header's 48 kHz rate. - -## API Parameters - -MOSS-TTS-Nano uses the standard `/v1/audio/speech` endpoint. - -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `input` | string | **required** | Text to synthesize (ZH / EN / JA) | -| `ref_audio` | string | **required** | Base64 data URL of the reference audio clip | -| `ref_text` | string | accepted, ignored | Schema-compatible field; voice_clone mode does not consume a transcript | -| `response_format` | string | `"wav"` | Audio format: wav, mp3, flac, pcm | -| `stream` | bool | false | Stream raw PCM chunks | -| `max_new_tokens` | int | 4096 | Maximum tokens to generate | - -The `voice` and `ref_text` fields from the OpenAI schema are accepted but -ignored — there are no built-in speaker presets in MOSS-TTS-Nano, and -upstream's voice_clone mode does not consume a transcript. - -## Troubleshooting - -1. **`libnvrtc.so.13: cannot open shared object file`**: torchaudio 2.10+ defaults to torchcodec which requires NVRTC. vLLM-Omni patches this automatically at model load time to use soundfile instead. -2. **Connection refused**: Ensure the server is running on the correct port. -3. **Flashinfer version mismatch**: Set `FLASHINFER_DISABLE_VERSION_CHECK=1` if you see version warnings. -4. **Out of memory**: The default `gpu_memory_utilization=0.3` is conservative. Increase it in the stage config if you have more VRAM available. diff --git a/examples/online_serving/text_to_speech/README.md b/examples/online_serving/text_to_speech/README.md index 836b8ed7ac8..8481e723ef9 100644 --- a/examples/online_serving/text_to_speech/README.md +++ b/examples/online_serving/text_to_speech/README.md @@ -15,9 +15,11 @@ For the full list of supported architectures across all modalities, see | Model | HuggingFace repo | Voice cloning | Streaming | Voice presets / upload | Gradio demo | |---|---|---|---|---|---| | Fish Speech S2 Pro | `fishaudio/s2-pro` | ✓ (`ref_audio`+`ref_text`) | ✓ (PCM stream) | — | ✓ | +| Ming-flash-omni-TTS | `Jonathan1909/Ming-flash-omni-2.0` | — (caption-controlled) | — | caption fields (`instructions`) | — | +| MOSS-TTS-Nano | `OpenMOSS-Team/MOSS-TTS-Nano` | ✓ (`ref_audio` required) | ✓ (PCM stream) | — | ✓ | | OmniVoice | `k2-fsa/OmniVoice` | (offline only) | — | — | — | | Qwen3-TTS | `Qwen/Qwen3-TTS-12Hz-1.7B-{CustomVoice,VoiceDesign,Base}` | ✓ (Base) | ✓ (PCM + WebSocket) | ✓ (presets + `/v1/audio/voices` upload) | ✓ (standard + FastRTC) | -| VoxCPM | local model dir | ✓ | ✓ (PCM stream) | — | — | +| VoxCPM | `openbmb/VoxCPM-0.5B` | ✓ | ✓ (PCM stream) | — | — | | VoxCPM2 | `openbmb/VoxCPM2` | ✓ | ✓ (AudioWorklet via gradio) | — | ✓ | | Voxtral TTS | `mistralai/Voxtral-4B-TTS-2603` | ✓ (gated upstream) | ✓ | ✓ (presets) | ✓ | @@ -141,6 +143,105 @@ python fish_speech/gradio_demo.py --api-base http://localhost:8091 # if server --- +## Ming-flash-omni-TTS + +Standalone talker-only deployment of Ming-flash-omni-2.0. Voice is controlled through caption text passed via `instructions`. + +### Launch +```bash +# from repo root +bash examples/online_serving/text_to_speech/ming_flash_omni_tts/run_server.sh +``` +Equivalent manual command: +```bash +vllm serve Jonathan1909/Ming-flash-omni-2.0 \ + --deploy-config vllm_omni/deploy/ming_flash_omni_tts.yaml \ + --host 0.0.0.0 --port 8091 \ + --trust-remote-code --omni +``` + +### Sending requests +```bash +python examples/online_serving/text_to_speech/ming_flash_omni_tts/speech_client.py \ + --text "我们当迎着阳光辛勤耕作,去摘取,去制作,去品尝,去馈赠。" \ + --output ming_online.wav +``` + +ASMR-style caption via `instructions`: +```bash +python examples/online_serving/text_to_speech/ming_flash_omni_tts/speech_client.py \ + --text "我会一直在这里陪着你,直到你慢慢、慢慢地沉入那个最温柔的梦里……好吗?" \ + --instructions "这是一种ASMR耳语,属于一种旨在引发特殊感官体验的创意风格。这个女性使用轻柔的普通话进行耳语,声音气音成分重。" \ + --output ming_online_asmr.wav +``` + +### Notes +- Server uses `use_zero_spk_emb=True` and the cookbook decode defaults (`max_decode_steps=200`, `cfg=2.0`, `sigma=0.25`, `temperature=0.0`). For other caption fields (`语速`, `基频`, `IP`, BGM, etc.) or overriding decode args, use the offline example where `additional_information` is set explicitly. +- This is the online counterpart of [`examples/offline_inference/text_to_speech/ming_flash_omni_tts/`](../../offline_inference/text_to_speech/ming_flash_omni_tts/). +- For multimodal Ming-flash-omni online serving, see [`examples/online_serving/ming_flash_omni/`](../../ming_flash_omni/). + +--- + +## MOSS-TTS-Nano + +Single-stage 0.1B AR LM + MOSS-Audio-Tokenizer-Nano codec at 48 kHz mono. Every request must include `ref_audio`; there are no built-in speaker presets. + +> The OpenAI-schema `voice` and `ref_text` fields are accepted but ignored — `voice_clone` does not consume a transcript, and upstream's `continuation` mode (the only path that accepts `prompt_text`) emits near-silent output, so it is not exposed here. Sample reference clips ship in the upstream repo under [`assets/audio/`](https://github.com/OpenMOSS/MOSS-TTS-Nano/tree/main/assets/audio). + +### Launch +```bash +vllm serve OpenMOSS-Team/MOSS-TTS-Nano --omni --port 8091 +# or: +./moss_tts_nano/run_server.sh +``` +The deploy config at `vllm_omni/deploy/moss_tts_nano.yaml` auto-loads; no `--stage-configs-path`, `--trust-remote-code`, or `--enforce-eager` flags are needed. + +### Sending requests +```bash +# One-off fetch of a sample reference clip; cache under XDG_CACHE_HOME. +REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano" +mkdir -p "$REF_DIR" +REF_WAV="$REF_DIR/zh_1.wav" +[ -s "$REF_WAV" ] || curl -L -o "$REF_WAV" https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav +REF_AUDIO=$(base64 -w 0 "$REF_WAV") + +curl -X POST http://localhost:8091/v1/audio/speech \ + -H "Content-Type: application/json" \ + -d "{ + \"input\": \"你好,这是语音合成测试。\", + \"ref_audio\": \"data:audio/wav;base64,${REF_AUDIO}\", + \"response_format\": \"wav\" + }" --output output.wav +``` + +### Streaming PCM +```bash +curl -X POST http://localhost:8091/v1/audio/speech \ + -H "Content-Type: application/json" \ + -d "{ + \"input\": \"Hello, streaming output from MOSS-TTS-Nano.\", + \"ref_audio\": \"data:audio/wav;base64,${REF_AUDIO}\", + \"stream\": true, + \"response_format\": \"pcm\" + }" --no-buffer | play -t raw -r 48000 -e signed -b 16 -c 1 - +``` + +### Gradio demo +```bash +# Option 1: launch server + Gradio together +./moss_tts_nano/run_gradio_demo.sh + +# Option 2: server already running +python moss_tts_nano/gradio_demo.py --api-base http://localhost:8091 +``` +Then open http://localhost:7860 in your browser. + +### Notes +- Output is 48 kHz mono PCM (the upstream tokenizer is internally stereo at 48 kHz; the wrapper averages to mono before reaching the engine). +- Standard `/v1/audio/speech` request shape: `input`, `ref_audio` (base64 data URL), `response_format`, `stream`, `max_new_tokens`. The `voice` and `ref_text` fields from the OpenAI schema are accepted but ignored. + +--- + ## OmniVoice Zero-shot multilingual TTS (600+ languages). Online serving currently exposes **auto voice** only; voice cloning and voice design are available offline. diff --git a/examples/online_serving/ming_flash_omni_tts/run_server.sh b/examples/online_serving/text_to_speech/ming_flash_omni_tts/run_server.sh similarity index 100% rename from examples/online_serving/ming_flash_omni_tts/run_server.sh rename to examples/online_serving/text_to_speech/ming_flash_omni_tts/run_server.sh diff --git a/examples/online_serving/ming_flash_omni_tts/speech_client.py b/examples/online_serving/text_to_speech/ming_flash_omni_tts/speech_client.py similarity index 100% rename from examples/online_serving/ming_flash_omni_tts/speech_client.py rename to examples/online_serving/text_to_speech/ming_flash_omni_tts/speech_client.py diff --git a/examples/online_serving/moss_tts_nano/gradio_demo.py b/examples/online_serving/text_to_speech/moss_tts_nano/gradio_demo.py similarity index 100% rename from examples/online_serving/moss_tts_nano/gradio_demo.py rename to examples/online_serving/text_to_speech/moss_tts_nano/gradio_demo.py diff --git a/examples/online_serving/moss_tts_nano/run_gradio_demo.sh b/examples/online_serving/text_to_speech/moss_tts_nano/run_gradio_demo.sh similarity index 100% rename from examples/online_serving/moss_tts_nano/run_gradio_demo.sh rename to examples/online_serving/text_to_speech/moss_tts_nano/run_gradio_demo.sh diff --git a/examples/online_serving/moss_tts_nano/run_server.sh b/examples/online_serving/text_to_speech/moss_tts_nano/run_server.sh similarity index 100% rename from examples/online_serving/moss_tts_nano/run_server.sh rename to examples/online_serving/text_to_speech/moss_tts_nano/run_server.sh diff --git a/recipes/inclusionAI/Ming-flash-omni-2.0.md b/recipes/inclusionAI/Ming-flash-omni-2.0.md index 124b94c3872..71a90cbe7c8 100644 --- a/recipes/inclusionAI/Ming-flash-omni-2.0.md +++ b/recipes/inclusionAI/Ming-flash-omni-2.0.md @@ -22,9 +22,12 @@ Use this recipe when you want a known-good starting point for serving - Upstream model: [`inclusionAI/Ming`](https://github.com/inclusionAI/Ming) -- For offline inference and additional client variants, see - `examples/offline_inference/ming_flash_omni{,_tts}/` and - `examples/online_serving/ming_flash_omni{,_tts}/`. +- For offline inference and additional client variants, see the + multimodal example dirs `examples/offline_inference/ming_flash_omni/` and + `examples/online_serving/ming_flash_omni/`. The standalone TTS variant + lives under the consolidated text-to-speech hub at + `examples/offline_inference/text_to_speech/ming_flash_omni_tts/` and + `examples/online_serving/text_to_speech/ming_flash_omni_tts/`. ## Hardware Support