vllm-project · hsliuustc0106 · Apr 23, 2026 · Apr 7, 2026 · Apr 10, 2026 · Apr 10, 2026
@@ -1,75 +1,91 @@
 # Ming-flash-omni 2.0
 
-[Ming-flash-omni-2.0](https://github.com/inclusionAI/Ming) is an omni-modal model supporting text, image, video, and audio understanding, with outputs in text, image, and audio. For now, Ming-flash-omni-2.0 in vLLM-Omni is supported with thinker stage (multi-modal understanding).
+[Ming-flash-omni-2.0](https://github.com/inclusionAI/Ming) is an omni-modal model supporting text, image, video, and audio understanding, with text and speech outputs.
+
+vLLM-Omni supports two deployment modes:
+
+| Mode | Stage config | Output |
+|------|-------------|--------|
+| Thinker only (multimodal understanding) | `ming_flash_omni_thinker.yaml` (default `--omni`) | Text |
+| Thinker + Talker (omni-speech) | `ming_flash_omni.yaml` | Text + Audio |
+
+For standalone TTS (talker only), see [`examples/offline_inference/ming_flash_omni_tts/`](../ming_flash_omni_tts/).
 
 ## Setup
 
 Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
 
+The default `--omni` flag runs thinker only.  For omni-speech, pass the two-stage config explicitly:
+
+```bash
+--stage-configs-path vllm_omni/model_executor/stage_configs/ming_flash_omni.yaml
+```
+
 ## Run examples
 
-### Text-only
+The end-to-end script defaults to built-in assets; pass `--image-path`,
+`--audio-path`, or `--video-path` to override.
+
 ```bash
+# Text-only
 python examples/offline_inference/ming_flash_omni/end2end.py --query-type text
+
+# Image / audio / video / mixed understanding
+python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_image
+python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_audio
+python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_video --num-frames 16
+python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_mixed_modalities \
+    --image-path /path/to/image.jpg --audio-path /path/to/audio.wav
 ```
 
 #### Reasoning (Thinking Mode)
 
-Reasoning (Thinking) mode is enabled via applying "detailed thinking on" when building the system prompt template (in `apply_chat_template`).
-
-In the end2end example, a default problem for thinking mode is provided, as referred to the example usage of Ming's cookbook;
-To utilize it, you have to download the example figure from https://github.com/inclusionAI/Ming/blob/3954fcb880ff5e61ff128bcf7f1ec344d46a6fe3/figures/cases/3_0.png
+Reasoning ("detailed thinking on") is applied by the script when
+`--query-type reasoning` is set. The default prompt matches Ming's cookbook
+and expects the reference figure from the upstream repo — see
+`get_reasoning_query` in `end2end.py`.
 
 ```bash
 python examples/offline_inference/ming_flash_omni/end2end.py -q reasoning --image-path ./3_0.png
 ```
 
-### Image understanding
-```bash
-python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_image
+### Omni-speech (thinker + talker)
 
-# With a local image
-python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_image --image-path /path/to/image.jpg
-```
+To enable spoken output, use the two-stage config and request `audio` (or `text,audio`) modalities.
+The thinker processes your multimodal input, generates text, then the talker synthesises the response as speech.
 
-### Audio understanding
+**Audio-only output** (speech response, no text):
 ```bash
-python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_audio
-
-# With a local audio file
-python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_audio --audio-path /path/to/audio.wav
+python examples/offline_inference/ming_flash_omni/end2end.py \
+    --query-type text \
+    --stage-configs-path vllm_omni/model_executor/stage_configs/ming_flash_omni.yaml \
+    --modalities audio \
+    --output-dir output_ming_omni_speech
 ```
 
-### Video understanding
+**Both text and audio output**:
 ```bash
-python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_video
-
-# With a local video and custom frame count
-python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_video --video-path /path/to/video.mp4 --num-frames 16
+python examples/offline_inference/ming_flash_omni/end2end.py \
+    --query-type use_audio \
+    --stage-configs-path vllm_omni/model_executor/stage_configs/ming_flash_omni.yaml \
+    --modalities text,audio \
+    --output-dir output_ming_omni_speech
 ```
 
-### Mixed modalities (image + audio)
-```bash
-python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_mixed_modalities \
-    --image-path /path/to/image.jpg \
-    --audio-path /path/to/audio.wav
-```
+Generated `.wav` files are saved to `--output-dir` (default `output_ming`), one per request.
 
-If media file paths are not provided, the script uses built-in default assets.
+The stage config allocates thinker on GPUs 0–3 and talker on GPU 3 by default. Adjust `devices` in the YAML to match your hardware.
 
 ### Modality control
-To control output modalities (e.g. text-only output):
-```bash
-python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_audio --modalities text
-```
 
-*For now, only text output is supported*
+| `--modalities` | Thinker output | Talker | Saved files |
+|---------------|----------------|--------|-------------|
+| `text` (default) | Text | Not run | `<id>.txt` |
+| `audio` | Text (internal) | Runs | `<id>.wav` |
+| `text,audio` | Text | Runs | `<id>.txt` + `<id>.wav` |
 
-### Custom stage config
-```bash
-python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_image \
-    --stage-configs-path /path/to/your_config.yaml
-```
+Pass `--stage-configs-path /path/to/your_config.yaml` to any of the commands
+above to override the stage config.
 
 ## Online serving
 

@@ -7,6 +7,7 @@
 from typing import NamedTuple
 
 import numpy as np
+import soundfile as sf
 import vllm
 from PIL import Image
 from transformers import AutoProcessor
@@ -319,7 +320,16 @@ def main(args):
         seed=SEED,
         detokenize=True,
     )
-    sampling_params_list = [thinker_sampling_params]
+    # Talker (ming_tts) uses a custom generation loop (CFM + AudioVAE);
+    # vLLM sampling is a no-op here — max_tokens=1 just satisfies the scheduler.
+    talker_sampling_params = SamplingParams(
+        temperature=0.0,
+        max_tokens=1,
+    )
+    all_sampling_params = [thinker_sampling_params, talker_sampling_params]
+    # Match sampling params to the number of configured stages
+    # (thinker-only yaml → 1, thinker+talker yaml → 2).
+    sampling_params_list = all_sampling_params[: omni.num_stages]
 
     prompts = [query_result.inputs for _ in range(args.num_prompts)]
 
@@ -362,7 +372,19 @@ def main(args):
                 print(f"Failed to write output file {out_txt}: {e}")
 
         elif stage_outputs.final_output_type == "audio":
-            raise NotImplementedError("Add audio example after talker supported.")
+            request_id = output.request_id
+            mm = output.outputs[0].multimodal_output
+            if mm and "audio" in mm:
+                audio = mm["audio"]
+                sr_raw = mm.get("sr", 44100)
+                sample_rate = int(sr_raw.item() if hasattr(sr_raw, "item") else sr_raw)
+                audio_numpy = audio.float().squeeze().cpu().numpy()
+                output_wav = os.path.join(output_dir, f"{request_id}.wav")
+                sf.write(output_wav, audio_numpy, samplerate=sample_rate, format="WAV")
+                print(
+                    f"Request ID: {request_id}, audio saved to {output_wav} "
+                    f"({len(audio_numpy) / sample_rate:.2f}s, {sample_rate}Hz)"
+                )
 
         processed_count += 1
         if profiler_enabled and processed_count >= total_requests:

@@ -0,0 +1,47 @@
+# Ming-flash-omni Standalone TTS (Offline)
+
+This example runs **Ming-flash-omni-2.0 talker-only** offline inference with:
+
+- `model`: `Jonathan1909/Ming-flash-omni-2.0`
+- `stage config`: `vllm_omni/model_executor/stage_configs/ming_flash_omni_tts.yaml`
+
+It follows the Ming cookbook parameter style:
+
+- `prompt`: `"Please generate speech based on the following description.\n"`
+- `max_decode_steps`: `200`
+- `cfg`: `2.0`
+- `sigma`: `0.25`
+- `temperature`: `0.0`
+
+## Quick Start
+
+```bash
+python examples/offline_inference/ming_flash_omni_tts/end2end.py --case style
+```
+
+## Cases
+
+```bash
+# Style
+python examples/offline_inference/ming_flash_omni_tts/end2end.py --case style
+
+# IP
+python examples/offline_inference/ming_flash_omni_tts/end2end.py --case ip
+
+# Basic (speed/pitch/volume control)
+python examples/offline_inference/ming_flash_omni_tts/end2end.py --case basic
+```
+
+## Useful Arguments
+
+- `--text`: override default text in the selected case
+- `--output`: custom output wav path
+- `--model`: local model path or HF repo id
+- `--stage-configs-path`: custom talker stage config path
+- `--log-stats`: enable runtime stats logs
+
+## Notes
+
+- This directory is for **standalone talker deployment (TTS)**.
+- For Ming thinker multimodal understanding examples, see:
+  `examples/offline_inference/ming_flash_omni/`.
@@ -0,0 +1,129 @@
+"""Offline e2e example for Ming-flash-omni-2.0 standalone talker (TTS)."""
+
+import os
+from typing import Any
+
+import soundfile as sf
+import torch
+from vllm.utils.argparse_utils import FlexibleArgumentParser
+
+os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
+
+from vllm_omni.entrypoints.omni import Omni
+from vllm_omni.inputs.data import OmniTokensPrompt
+from vllm_omni.model_executor.models.ming_flash_omni.prompt_utils import (
+    DEFAULT_PROMPT,
+    create_instruction,
+)
+
+MODEL_NAME = "Jonathan1909/Ming-flash-omni-2.0"
+DEFAULT_STAGE_CONFIG = "vllm_omni/model_executor/stage_configs/ming_flash_omni_tts.yaml"
+
+
+def get_messages(case: str, text_override: str | None) -> dict[str, Any]:
+    if case == "style":
+        text = text_override or "我会一直在这里陪着你，直到你慢慢、慢慢地沉入那个最温柔的梦里……好吗？"
+        instruction = create_instruction(
+            {
+                "风格": "这是一种ASMR耳语，属于一种旨在引发特殊感官体验的创意风格。这个女性使用轻柔的普通话进行耳语，声音气音成分重。音量极低，紧贴麦克风，语速极慢，旨在制造触发听者颅内快感的声学刺激。",
+            }
+        )
+        return {
+            "prompt": DEFAULT_PROMPT,
+            "text": text,
+            "instruction": instruction,
+            "use_zero_spk_emb": True,
+        }
+    if case == "ip":
+        text = text_override or "这款产品的名字，叫变态坑爹牛肉丸。"
+        return {
+            "prompt": DEFAULT_PROMPT,
+            "text": text,
+            "instruction": create_instruction({"IP": "灵小甄"}),
+            "use_zero_spk_emb": True,
+        }
+    if case == "basic":
+        text = text_override or "我们当迎着阳光辛勤耕作，去摘取，去制作，去品尝，去馈赠。"
+        return {
+            "prompt": DEFAULT_PROMPT,
+            "text": text,
+            "instruction": create_instruction({"语速": "快速", "基频": "中", "音量": "中"}),
+            "use_zero_spk_emb": True,
+        }
+    raise ValueError(f"Unknown case: {case}")
+
+
+def save_audio(mm: dict[str, Any], output_path: str) -> None:
+    if not mm or "audio" not in mm:
+        raise RuntimeError("No audio found in model output")
+    audio = mm["audio"]
+    sr_raw = mm.get("sr", 44100)
+    if isinstance(sr_raw, torch.Tensor):
+        sample_rate = int(sr_raw.item())
+    else:
+        sample_rate = int(sr_raw)
+    waveform = audio.squeeze().float().cpu().numpy()
+    sf.write(output_path, waveform, sample_rate)
+    print(f"Saved {output_path} ({len(waveform) / sample_rate:.2f}s, {sample_rate}Hz)")
+
+
+def parse_args():
+    parser = FlexibleArgumentParser(description="Ming-flash-omni standalone talker offline e2e example")
+    parser.add_argument("--model", type=str, default=MODEL_NAME, help="Model name or local path.")
+    parser.add_argument(
+        "--stage-configs-path",
+        type=str,
+        default=DEFAULT_STAGE_CONFIG,
+        help="Path to stage configs yaml for standalone talker deployment.",
+    )
+    parser.add_argument(
+        "--case",
+        type=str,
+        default="style",
+        choices=["style", "ip", "basic"],
+        help="Example case.",
+    )
+    parser.add_argument("--text", type=str, default=None, help="Override default text for the selected case.")
+    parser.add_argument("--output", type=str, default=None, help="Output wav path.")
+    parser.add_argument("--log-stats", action="store_true", default=False, help="Enable stats logging.")
+    parser.add_argument("--init-timeout", type=int, default=600, help="Engine init timeout in seconds.")
+    parser.add_argument("--stage-init-timeout", type=int, default=300, help="Single stage init timeout in seconds.")
+    return parser.parse_args()
+
+
+def main():
+    args = parse_args()
+
+    omni = Omni(
+        model=args.model,
+        stage_configs_path=args.stage_configs_path,
+        trust_remote_code=True,
+        log_stats=args.log_stats,
+        init_timeout=args.init_timeout,
+        stage_init_timeout=args.stage_init_timeout,
+    )
+
+    messages = get_messages(args.case, args.text)
+    decode_args = {
+        # Standalone TTS deployment
+        "ming_task": "instruct",
+        "max_decode_steps": 200,
+        "cfg": 2.0,
+        "sigma": 0.25,
+        "temperature": 0.0,
+    }
+    req = OmniTokensPrompt(
+        prompt_token_ids=[0],
+        additional_information={**messages, **decode_args},
+    )
+
+    outputs = omni.generate(req)
+    mm = outputs[0].outputs[0].multimodal_output
+
+    output_path = args.output or f"output_{args.case}.wav"
+    save_audio(mm, output_path)
+    omni.close()
+
+
+if __name__ == "__main__":
+    main()