Skip to content
1 change: 1 addition & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTT
- [ ] The test results. Please pasting the results comparison before and after, or e2e results.
- [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. **Please run `mkdocs serve` to sync the documentation editions to `./docs`.**
- [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft.
- [ ] (HT fork) If this PR adds or changes HT-specific functionality, update the **HT Fork Changes** section in `README.md`.
</details>

**BEFORE SUBMITTING, PLEASE READ <https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md>** (anything written below this line will be removed by GitHub Actions)
34 changes: 31 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/logos/vllm-omni-logo.png">
<img alt="vllm-omni" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/logos/vllm-omni-logo.png" width=55%>
<source media="(prefers-color-scheme: dark)" src="docs/source/logos/ht-vllm-omni-logo.png">
<img alt="ht-vllm-omni" src="docs/source/logos/ht-vllm-omni-logo.png" width=55%>
</picture>
</p>
<h3 align="center">
Expand All @@ -13,11 +13,39 @@ Easy, fast, and cheap omni-modality model serving for everyone
</p>


---

## HT Fork Changes

This is the [Heiervang Technologies](https://github.com/heiervang-technologies) fork of vLLM-Omni. The `ht` branch contains the following changes on top of upstream `main`:

### Qwen3 TTS Streaming
- HTTP-level streaming for TTS speech API (`/v1/audio/speech`)
- Model-level streaming for TTS with chunked audio output
- `tts-stream` bash tool for low-latency streaming playback
- Preset voice support for tts-stream
- Stride-0 tensor serialization fix for streaming TTS

### Qwen3 TTS Performance
- Manual KV-cached loop for code predictor — avoids redundant recomputation *(upstream has this for qwen3_omni; HT's version targets qwen3_tts)*
- Regional `torch.compile` for code predictor decoder layers *(upstream has this for qwen3_omni; HT's version targets qwen3_tts)*
- CUDA graph support for speech tokenizer decoder *(cherry-picked from unmerged upstream PR [#1205](https://github.com/vllm-project/vllm-omni/pull/1205))*

### Qwen3 TTS Bug Fixes
- ~~Fix Qwen3 TTS 0.6B profile run hang~~ *(now in upstream)*
- ~~Cap `max_new_tokens` during profile run instead of short-circuiting~~ *(now in upstream)*
- SDPA attention fallback when flash-attn is unavailable
- ~~Handle single tensor in audio frame metrics for non-streaming TTS~~ *(superseded by upstream fix)*

### Speaker Embedding
- Speaker embedding support for voice cloning (`speaker_embedding` parameter)
- Speaker embedding examples and inference scripts

---

*Latest News* 🔥

- [2026/02] We released [0.14.0](https://github.com/vllm-project/vllm-omni/releases/tag/v0.14.0) - This is the first **stable release** of vLLM-Omni that expands Omnis diffusion / image-video generation and audio / TTS stack, improves distributed execution and memory efficiency, and broadens platform/backend coverage (GPU/ROCm/NPU/XPU). It also brings meaningful upgrades to serving APIs, profiling & benchmarking, and overall stability. Please check our latest [paper](https://arxiv.org/abs/2602.02204) for architecture design and performance results.
- [2026/02] We released [0.14.0](https://github.com/vllm-project/vllm-omni/releases/tag/v0.14.0) - This is the first **stable release** of vLLM-Omni that expands Omni's diffusion / image-video generation and audio / TTS stack, improves distributed execution and memory efficiency, and broadens platform/backend coverage (GPU/ROCm/NPU/XPU). It also brings meaningful upgrades to serving APIs, profiling & benchmarking, and overall stability. Please check our latest [paper](https://arxiv.org/abs/2602.02204) for architecture design and performance results.
- [2026/01] We released [0.12.0rc1](https://github.com/vllm-project/vllm-omni/releases/tag/v0.12.0rc1) - a major RC milestone focused on maturing the diffusion stack, strengthening OpenAI-compatible serving, expanding omni-model coverage, and improving stability across platforms (GPU/NPU/ROCm).
- [2025/11] vLLM community officially released [vllm-project/vllm-omni](https://github.com/vllm-project/vllm-omni) in order to support omni-modality models serving.

Expand Down
178 changes: 178 additions & 0 deletions benchmarks/audio_quality_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: Apache-2.0
"""Generate audio samples to verify generate_codes() produces correct output.

Loads the real Qwen3-TTS model and generates speech through the full pipeline:
talker → code predictor (generate_codes()) → speech tokenizer decoder

Usage:
python benchmarks/audio_quality_test.py [--model MODEL] [--device DEVICE]

Output WAV files are saved to output_audio/ for manual listening.
"""

import argparse
import logging
import os
import sys
import time
import types
from pathlib import Path

import numpy as np
import soundfile as sf
import torch

# ---------------------------------------------------------------------------
# Bootstrap: stub vllm modules so qwen3_tts.py can be imported standalone.
# The actual TTS model uses only transformers/torch — vllm is only needed
# by the Qwen3TTSModelForGeneration wrapper which we don't use here.
# ---------------------------------------------------------------------------
_REPO = Path(__file__).resolve().parents[1]

_STUB_FQNS = [
"vllm", "vllm.config", "vllm.logger", "vllm.sequence",
"vllm_omni", "vllm_omni.patch",
"vllm_omni.diffusion", "vllm_omni.diffusion.compile",
"vllm_omni.model_executor",
"vllm_omni.model_executor.model_loader",
"vllm_omni.model_executor.model_loader.weight_utils",
"vllm_omni.model_executor.models",
"vllm_omni.model_executor.models.output_templates",
"vllm_omni.model_executor.models.qwen3_omni",
"vllm_omni.model_executor.models.registry",
"vllm_omni.model_executor.models.qwen3_tts",
]


def _setup_stubs():
from typing import NamedTuple

for fqn in _STUB_FQNS:
if fqn not in sys.modules:
mod = types.ModuleType(fqn)
mod.__path__ = [str(_REPO / fqn.replace(".", "/"))]
mod.__package__ = fqn
mod.__spec__ = None
sys.modules[fqn] = mod

sys.modules["vllm.logger"].init_logger = lambda name: logging.getLogger(name)
sys.modules["vllm.config"].VllmConfig = type("VllmConfig", (), {})

class IntermediateTensors:
def __init__(self, d=None):
self.tensors = d or {}
sys.modules["vllm.sequence"].IntermediateTensors = IntermediateTensors

class OmniOutput(NamedTuple):
text_hidden_states: object
multimodal_outputs: dict | None = None
intermediate_tensors: object | None = None
next_token_id: object | None = None
sys.modules["vllm_omni.model_executor.models.output_templates"].OmniOutput = OmniOutput

sys.modules["vllm_omni.diffusion.compile"].regionally_compile = lambda *a, **kw: None
sys.modules["vllm_omni.model_executor.model_loader.weight_utils"].download_weights_from_hf_specific = (
lambda *a, **kw: None
)
sys.modules["vllm_omni.model_executor.models.qwen3_omni"].Qwen3OmniMoeForConditionalGeneration = type(
"Stub", (), {}
)
sys.modules["vllm_omni.model_executor.models.registry"].OmniModelRegistry = type("Stub", (), {})

for fqn in _STUB_FQNS:
parts = fqn.split(".")
if len(parts) > 1:
parent = sys.modules.get(".".join(parts[:-1]))
child = sys.modules.get(fqn)
if parent and child:
setattr(parent, parts[-1], child)


_setup_stubs()


def main():
parser = argparse.ArgumentParser(description="TTS audio quality test")
parser.add_argument(
"--model",
default="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
help="HuggingFace model ID or local path",
)
parser.add_argument("--device", default="cuda:0")
parser.add_argument("--output-dir", default="output_audio")
args = parser.parse_args()

device = torch.device(args.device)
out_dir = Path(args.output_dir)
out_dir.mkdir(parents=True, exist_ok=True)

# ── Load model ──────────────────────────────────────────────────────
print(f"Loading model: {args.model}")
print(f"Device: {device}")
t0 = time.perf_counter()

from vllm_omni.model_executor.models.qwen3_tts.qwen3_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
args.model,
torch_dtype=torch.bfloat16,
device_map=str(device),
)
load_time = time.perf_counter() - t0
print(f"Model loaded in {load_time:.1f}s")
print()

# ── Test samples ────────────────────────────────────────────────────
samples = [
{
"name": "english_hello",
"text": "Hello! This is a test of the text to speech system. The quick brown fox jumps over the lazy dog.",
"speaker": "Vivian",
"language": "English",
},
{
"name": "english_numbers",
"text": "One, two, three, four, five. The year is twenty twenty six.",
"speaker": "Vivian",
"language": "English",
},
]

# ── Generate ────────────────────────────────────────────────────────
for sample in samples:
name = sample["name"]
print(f"Generating: {name}")
print(f" Text: {sample['text'][:60]}{'...' if len(sample['text']) > 60 else ''}")

t0 = time.perf_counter()
with torch.no_grad():
wavs, sr = model.generate_custom_voice(
text=sample["text"],
speaker=sample["speaker"],
language=sample["language"],
)
gen_time = time.perf_counter() - t0

wav = wavs[0]
duration = len(wav) / sr
rms = np.sqrt(np.mean(wav.astype(np.float64) ** 2))

out_path = out_dir / f"{name}.wav"
sf.write(str(out_path), wav, sr)

print(f" Duration: {duration:.2f}s")
print(f" Sample rate: {sr} Hz")
print(f" Samples: {len(wav):,}")
print(f" RMS: {rms:.4f}")
print(f" Min/Max: {wav.min():.4f} / {wav.max():.4f}")
print(f" Gen time: {gen_time:.2f}s")
print(f" RTF: {gen_time / duration:.2f}x")
print(f" Saved: {out_path}")
print()

print("Done. Listen to the WAV files to verify audio quality.")


if __name__ == "__main__":
main()
Loading