Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
ce774a3
add draft
yuanheng-zhao Apr 7, 2026
d80c101
temp draft upd
yuanheng-zhao Apr 10, 2026
e71e910
apply x_transformers utils
yuanheng-zhao Apr 10, 2026
0cb5db2
add e2e TTS
yuanheng-zhao Apr 10, 2026
be36fc0
upd ming online TTS
yuanheng-zhao Apr 11, 2026
9d048b0
upd thinker -> talker (1/N)
yuanheng-zhao Apr 11, 2026
8737080
cleanup omni unified model
yuanheng-zhao Apr 11, 2026
dd24397
upd tts and omni speech paths via ming task type
yuanheng-zhao Apr 12, 2026
a44a751
segment text (without TalkerTN)
yuanheng-zhao Apr 12, 2026
14785df
omni-speech path: default zero spk emb; port vocie settings
yuanheng-zhao Apr 12, 2026
812e479
voice register quick fix
yuanheng-zhao Apr 12, 2026
67a2bdd
upd ming yaml
yuanheng-zhao Apr 14, 2026
a881381
quick fix local voice preset path
yuanheng-zhao Apr 16, 2026
3b3fc6a
fix(ming-talker): preserve voice reference across segments and improv…
LHXuuu Apr 16, 2026
461e643
trivial: rm unused code
yuanheng-zhao Apr 17, 2026
903748b
trivial: cleanup talker/text processing comments
yuanheng-zhao Apr 17, 2026
f7a1f87
fix code consistency
yuanheng-zhao Apr 17, 2026
fa06e47
trivial: ruff
yuanheng-zhao Apr 18, 2026
4f95a8d
upd use mrope handling
yuanheng-zhao Apr 18, 2026
2dc24b8
upd Ming e2e and readme
yuanheng-zhao Apr 18, 2026
b762095
trivial: fix pre-commit
yuanheng-zhao Apr 18, 2026
f276db1
complement ming tests
yuanheng-zhao Apr 19, 2026
3a1fd78
rm training args
yuanheng-zhao Apr 19, 2026
f62c371
code cleanup
yuanheng-zhao Apr 19, 2026
8ae4a56
cleanup code talker CFM
yuanheng-zhao Apr 19, 2026
179e44b
upd Ming serving speech args
yuanheng-zhao Apr 19, 2026
c839df9
Canonicalize ref headers to Ming repo
yuanheng-zhao Apr 19, 2026
7a42f5e
Merge branch 'main' into model/ming-omni-talker-draft
yuanheng-zhao Apr 19, 2026
226b3a5
upd talker modules type annot
yuanheng-zhao Apr 20, 2026
aceb49a
upd checks in talker module
yuanheng-zhao Apr 20, 2026
4f6d0a3
refactor talker cls
yuanheng-zhao Apr 20, 2026
faea882
upd ref headers
yuanheng-zhao Apr 20, 2026
3aa5d9e
Add ming recipe and trim example readme
yuanheng-zhao Apr 20, 2026
cfda612
upd recipe
yuanheng-zhao Apr 20, 2026
3e161af
audio generator step debug log
yuanheng-zhao Apr 20, 2026
0b1c105
trim readme
yuanheng-zhao Apr 20, 2026
91ee8c3
Merge from main
yuanheng-zhao Apr 20, 2026
1482a8b
upd e2e test imports
yuanheng-zhao Apr 21, 2026
4a7548c
Merge branch 'main' into model/ming-omni-talker-draft
hsliuustc0106 Apr 21, 2026
b07d48e
rm ming expansion test; add module dummy tests
yuanheng-zhao Apr 21, 2026
3a1ba78
Merge branch 'main' into model/ming-omni-talker-draft
yuanheng-zhao Apr 21, 2026
b1aaf56
put talker modules into a single file
yuanheng-zhao Apr 22, 2026
00a5009
Merge branch 'main' into model/ming-omni-talker-draft
yuanheng-zhao Apr 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 55 additions & 39 deletions examples/offline_inference/ming_flash_omni/README.md
Original file line number Diff line number Diff line change
@@ -1,75 +1,91 @@
# Ming-flash-omni 2.0

[Ming-flash-omni-2.0](https://github.com/inclusionAI/Ming) is an omni-modal model supporting text, image, video, and audio understanding, with outputs in text, image, and audio. For now, Ming-flash-omni-2.0 in vLLM-Omni is supported with thinker stage (multi-modal understanding).
[Ming-flash-omni-2.0](https://github.com/inclusionAI/Ming) is an omni-modal model supporting text, image, video, and audio understanding, with text and speech outputs.

vLLM-Omni supports two deployment modes:

| Mode | Stage config | Output |
|------|-------------|--------|
| Thinker only (multimodal understanding) | `ming_flash_omni_thinker.yaml` (default `--omni`) | Text |
| Thinker + Talker (omni-speech) | `ming_flash_omni.yaml` | Text + Audio |

For standalone TTS (talker only), see [`examples/offline_inference/ming_flash_omni_tts/`](../ming_flash_omni_tts/).

## Setup

Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.

The default `--omni` flag runs thinker only. For omni-speech, pass the two-stage config explicitly:

```bash
--stage-configs-path vllm_omni/model_executor/stage_configs/ming_flash_omni.yaml
```

## Run examples

### Text-only
The end-to-end script defaults to built-in assets; pass `--image-path`,
`--audio-path`, or `--video-path` to override.

```bash
# Text-only
python examples/offline_inference/ming_flash_omni/end2end.py --query-type text

# Image / audio / video / mixed understanding
python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_image
python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_audio
python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_video --num-frames 16
python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_mixed_modalities \
--image-path /path/to/image.jpg --audio-path /path/to/audio.wav
```

#### Reasoning (Thinking Mode)

Reasoning (Thinking) mode is enabled via applying "detailed thinking on" when building the system prompt template (in `apply_chat_template`).

In the end2end example, a default problem for thinking mode is provided, as referred to the example usage of Ming's cookbook;
To utilize it, you have to download the example figure from https://github.com/inclusionAI/Ming/blob/3954fcb880ff5e61ff128bcf7f1ec344d46a6fe3/figures/cases/3_0.png
Reasoning ("detailed thinking on") is applied by the script when
`--query-type reasoning` is set. The default prompt matches Ming's cookbook
and expects the reference figure from the upstream repo — see
`get_reasoning_query` in `end2end.py`.

```bash
python examples/offline_inference/ming_flash_omni/end2end.py -q reasoning --image-path ./3_0.png
```

### Image understanding
```bash
python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_image
### Omni-speech (thinker + talker)

# With a local image
python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_image --image-path /path/to/image.jpg
```
To enable spoken output, use the two-stage config and request `audio` (or `text,audio`) modalities.
The thinker processes your multimodal input, generates text, then the talker synthesises the response as speech.

### Audio understanding
**Audio-only output** (speech response, no text):
```bash
python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_audio

# With a local audio file
python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_audio --audio-path /path/to/audio.wav
python examples/offline_inference/ming_flash_omni/end2end.py \
--query-type text \
--stage-configs-path vllm_omni/model_executor/stage_configs/ming_flash_omni.yaml \
--modalities audio \
--output-dir output_ming_omni_speech
```

### Video understanding
**Both text and audio output**:
```bash
python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_video

# With a local video and custom frame count
python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_video --video-path /path/to/video.mp4 --num-frames 16
python examples/offline_inference/ming_flash_omni/end2end.py \
--query-type use_audio \
--stage-configs-path vllm_omni/model_executor/stage_configs/ming_flash_omni.yaml \
--modalities text,audio \
--output-dir output_ming_omni_speech
```

### Mixed modalities (image + audio)
```bash
python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_mixed_modalities \
--image-path /path/to/image.jpg \
--audio-path /path/to/audio.wav
```
Generated `.wav` files are saved to `--output-dir` (default `output_ming`), one per request.

If media file paths are not provided, the script uses built-in default assets.
The stage config allocates thinker on GPUs 0–3 and talker on GPU 3 by default. Adjust `devices` in the YAML to match your hardware.

### Modality control
To control output modalities (e.g. text-only output):
```bash
python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_audio --modalities text
```

*For now, only text output is supported*
| `--modalities` | Thinker output | Talker | Saved files |
|---------------|----------------|--------|-------------|
| `text` (default) | Text | Not run | `<id>.txt` |
| `audio` | Text (internal) | Runs | `<id>.wav` |
| `text,audio` | Text | Runs | `<id>.txt` + `<id>.wav` |

### Custom stage config
```bash
python examples/offline_inference/ming_flash_omni/end2end.py --query-type use_image \
--stage-configs-path /path/to/your_config.yaml
```
Pass `--stage-configs-path /path/to/your_config.yaml` to any of the commands
above to override the stage config.

## Online serving

Expand Down
26 changes: 24 additions & 2 deletions examples/offline_inference/ming_flash_omni/end2end.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from typing import NamedTuple

import numpy as np
import soundfile as sf
import vllm
from PIL import Image
from transformers import AutoProcessor
Expand Down Expand Up @@ -319,7 +320,16 @@ def main(args):
seed=SEED,
detokenize=True,
)
sampling_params_list = [thinker_sampling_params]
# Talker (ming_tts) uses a custom generation loop (CFM + AudioVAE);
# vLLM sampling is a no-op here — max_tokens=1 just satisfies the scheduler.
talker_sampling_params = SamplingParams(
temperature=0.0,
max_tokens=1,
)
all_sampling_params = [thinker_sampling_params, talker_sampling_params]
# Match sampling params to the number of configured stages
# (thinker-only yaml → 1, thinker+talker yaml → 2).
sampling_params_list = all_sampling_params[: omni.num_stages]

prompts = [query_result.inputs for _ in range(args.num_prompts)]

Expand Down Expand Up @@ -362,7 +372,19 @@ def main(args):
print(f"Failed to write output file {out_txt}: {e}")

elif stage_outputs.final_output_type == "audio":
raise NotImplementedError("Add audio example after talker supported.")
request_id = output.request_id
mm = output.outputs[0].multimodal_output
if mm and "audio" in mm:
audio = mm["audio"]
sr_raw = mm.get("sr", 44100)
sample_rate = int(sr_raw.item() if hasattr(sr_raw, "item") else sr_raw)
audio_numpy = audio.float().squeeze().cpu().numpy()
output_wav = os.path.join(output_dir, f"{request_id}.wav")
sf.write(output_wav, audio_numpy, samplerate=sample_rate, format="WAV")
print(
f"Request ID: {request_id}, audio saved to {output_wav} "
f"({len(audio_numpy) / sample_rate:.2f}s, {sample_rate}Hz)"
)

processed_count += 1
if profiler_enabled and processed_count >= total_requests:
Expand Down
47 changes: 47 additions & 0 deletions examples/offline_inference/ming_flash_omni_tts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Ming-flash-omni Standalone TTS (Offline)

This example runs **Ming-flash-omni-2.0 talker-only** offline inference with:

- `model`: `Jonathan1909/Ming-flash-omni-2.0`
- `stage config`: `vllm_omni/model_executor/stage_configs/ming_flash_omni_tts.yaml`

It follows the Ming cookbook parameter style:

- `prompt`: `"Please generate speech based on the following description.\n"`
- `max_decode_steps`: `200`
- `cfg`: `2.0`
- `sigma`: `0.25`
- `temperature`: `0.0`

## Quick Start

```bash
python examples/offline_inference/ming_flash_omni_tts/end2end.py --case style
```

## Cases

```bash
# Style
python examples/offline_inference/ming_flash_omni_tts/end2end.py --case style

# IP
python examples/offline_inference/ming_flash_omni_tts/end2end.py --case ip

# Basic (speed/pitch/volume control)
python examples/offline_inference/ming_flash_omni_tts/end2end.py --case basic
```

## Useful Arguments

- `--text`: override default text in the selected case
- `--output`: custom output wav path
- `--model`: local model path or HF repo id
- `--stage-configs-path`: custom talker stage config path
- `--log-stats`: enable runtime stats logs

## Notes

- This directory is for **standalone talker deployment (TTS)**.
- For Ming thinker multimodal understanding examples, see:
`examples/offline_inference/ming_flash_omni/`.
129 changes: 129 additions & 0 deletions examples/offline_inference/ming_flash_omni_tts/end2end.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
"""Offline e2e example for Ming-flash-omni-2.0 standalone talker (TTS)."""

import os
from typing import Any

import soundfile as sf
import torch
from vllm.utils.argparse_utils import FlexibleArgumentParser

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniTokensPrompt
from vllm_omni.model_executor.models.ming_flash_omni.prompt_utils import (
DEFAULT_PROMPT,
create_instruction,
)

MODEL_NAME = "Jonathan1909/Ming-flash-omni-2.0"
DEFAULT_STAGE_CONFIG = "vllm_omni/model_executor/stage_configs/ming_flash_omni_tts.yaml"


def get_messages(case: str, text_override: str | None) -> dict[str, Any]:
if case == "style":
text = text_override or "我会一直在这里陪着你,直到你慢慢、慢慢地沉入那个最温柔的梦里……好吗?"
instruction = create_instruction(
{
"风格": "这是一种ASMR耳语,属于一种旨在引发特殊感官体验的创意风格。这个女性使用轻柔的普通话进行耳语,声音气音成分重。音量极低,紧贴麦克风,语速极慢,旨在制造触发听者颅内快感的声学刺激。",
}
)
return {
"prompt": DEFAULT_PROMPT,
"text": text,
"instruction": instruction,
"use_zero_spk_emb": True,
}
if case == "ip":
text = text_override or "这款产品的名字,叫变态坑爹牛肉丸。"
return {
"prompt": DEFAULT_PROMPT,
"text": text,
"instruction": create_instruction({"IP": "灵小甄"}),
"use_zero_spk_emb": True,
}
if case == "basic":
text = text_override or "我们当迎着阳光辛勤耕作,去摘取,去制作,去品尝,去馈赠。"
return {
"prompt": DEFAULT_PROMPT,
"text": text,
"instruction": create_instruction({"语速": "快速", "基频": "中", "音量": "中"}),
"use_zero_spk_emb": True,
}
raise ValueError(f"Unknown case: {case}")


def save_audio(mm: dict[str, Any], output_path: str) -> None:
if not mm or "audio" not in mm:
raise RuntimeError("No audio found in model output")
audio = mm["audio"]
sr_raw = mm.get("sr", 44100)
if isinstance(sr_raw, torch.Tensor):
sample_rate = int(sr_raw.item())
else:
sample_rate = int(sr_raw)
waveform = audio.squeeze().float().cpu().numpy()
sf.write(output_path, waveform, sample_rate)
print(f"Saved {output_path} ({len(waveform) / sample_rate:.2f}s, {sample_rate}Hz)")


def parse_args():
parser = FlexibleArgumentParser(description="Ming-flash-omni standalone talker offline e2e example")
parser.add_argument("--model", type=str, default=MODEL_NAME, help="Model name or local path.")
parser.add_argument(
"--stage-configs-path",
type=str,
default=DEFAULT_STAGE_CONFIG,
help="Path to stage configs yaml for standalone talker deployment.",
)
parser.add_argument(
"--case",
type=str,
default="style",
choices=["style", "ip", "basic"],
help="Example case.",
)
parser.add_argument("--text", type=str, default=None, help="Override default text for the selected case.")
parser.add_argument("--output", type=str, default=None, help="Output wav path.")
parser.add_argument("--log-stats", action="store_true", default=False, help="Enable stats logging.")
parser.add_argument("--init-timeout", type=int, default=600, help="Engine init timeout in seconds.")
parser.add_argument("--stage-init-timeout", type=int, default=300, help="Single stage init timeout in seconds.")
return parser.parse_args()


def main():
args = parse_args()

omni = Omni(
model=args.model,
stage_configs_path=args.stage_configs_path,
trust_remote_code=True,
log_stats=args.log_stats,
init_timeout=args.init_timeout,
stage_init_timeout=args.stage_init_timeout,
)

messages = get_messages(args.case, args.text)
decode_args = {
# Standalone TTS deployment
"ming_task": "instruct",
"max_decode_steps": 200,
"cfg": 2.0,
"sigma": 0.25,
"temperature": 0.0,
}
req = OmniTokensPrompt(
prompt_token_ids=[0],
additional_information={**messages, **decode_args},
)

outputs = omni.generate(req)
mm = outputs[0].outputs[0].multimodal_output

output_path = args.output or f"output_{args.case}.wav"
save_audio(mm, output_path)
omni.close()


if __name__ == "__main__":
main()
Loading
Loading