[Model] Add GLM-TTS text-to-speech model support by BeatSeat · Pull Request #3141 · vllm-project/vllm-omni

BeatSeat · 2026-04-25T09:21:09Z

This PR adds support for GLM-TTS, a two-stage text-to-speech model from Zhipu AI.

Architecture:

Stage 0 (AR): Llama-based autoregressive model for text to speech tokens
Stage 1 (DiT): Flow matching DiT for speech tokens to audio waveform

Key features:

Dynamic min_len/max_len based on text token length (min/max_token_text_ratio)
RAS (Repetition Aware Sampling) for stable generation
Voice cloning support via reference audio
OpenAI-compatible /v1/audio/speech API

Signed-off-by: BeatSeat wendavid552@gmail.com

Purpose

Add GLM-TTS (Zhipu AI) as a two-stage TTS pipeline in vLLM-Omni.

Stage 0 — AR Code Predictor:

Llama-architecture autoregressive model converting text into speech tokens (single VQ codebook, 32K vocab)
WhisperVQ encoder for reference audio encoding (voice cloning)
Text frontend with Chinese/English normalization (WeTextProcessing + inflect)
RAS (Repetition-Aware Sampling): top_p=0.8, top_k=25, win_size=10, tau_r=0.1
Dynamic min_len/max_len based on min_token_text_ratio=2.0 / max_token_text_ratio=20.0

Stage 1 — Flow Matching DiT:

DiT transformer: speech tokens → mel-spectrogram via flow matching (Euler ODE, 32 steps)
HiFT vocoder: mel → 22050 Hz mono waveform
CAMPPlus speaker embedding for voice cloning

Online Serving:

All 5 integration points in serving_speech.py following _tts_model_type string pattern
Voice cloning via ref_audio (URL / data URI) + ref_text

Closes #821
Related: #808, #834

Files

Category	Path
AR model (6 files)	`vllm_omni/model_executor/models/glm_tts/`
DiT pipeline (3 files)	`vllm_omni/diffusion/models/glm_tts/`
Stage configs	`vllm_omni/model_executor/stage_configs/glm_tts.yaml`, `glm_tts_async_chunk.yaml`
Input processor	`vllm_omni/model_executor/stage_input_processors/glm_tts.py`
Config registration	`vllm_omni/transformers_utils/configs/glm_tts.py`
Registries	`vllm_omni/model_executor/models/registry.py`, `vllm_omni/diffusion/registry.py`
Serving	`vllm_omni/entrypoints/openai/serving_speech.py`
Examples	`examples/offline_inference/glm_tts/`, `examples/online_serving/glm_tts/`
E2E tests	`tests/e2e/offline_inference/test_glm_tts.py`, `tests/e2e/online_serving/test_glm_tts.py`
CI	`.buildkite/test-ready.yml`, `.buildkite/test-merge.yml`

Test Plan

Automated E2E Tests

Offline (tests/e2e/offline_inference/test_glm_tts.py):

test_offline_tts_zh — basic Chinese TTS (core_model)
test_offline_tts_long_text — longer Chinese text (core_model)
test_offline_voice_clone_zh — voice cloning with public URL ref audio (advanced_model)

Online (tests/e2e/online_serving/test_glm_tts.py):

test_basic_tts_zh — /v1/audio/speech basic TTS (core_model)
test_basic_tts_long_text — longer text via API (core_model)
test_voice_clone_zh — voice cloning via API (advanced_model)
test_models_endpoint — /v1/models endpoint (advanced_model)

Buildkite CI:

test-ready.yml: core_model, gpu_1_queue, 20min
test-merge.yml: advanced_model, gpu_1_queue, 20min

Manual Commands

# Offline
python examples/offline_inference/glm_tts/end2end.py \
  --model THUDM/GLM-TTS --text "你好，这是一个语音合成测试。" --output-dir ./output

# Server
vllm-omni serve THUDM/GLM-TTS \
  --stage-configs-path vllm_omni/model_executor/stage_configs/glm_tts.yaml \
  --trust-remote-code --enforce-eager

# Client
python examples/online_serving/glm_tts/openai_speech_client.py \
  --text "你好，这是一个语音合成测试。"

# Voice cloning
python examples/online_serving/glm_tts/openai_speech_client.py \
  --text "你好" \
  --ref-audio "https://raw.githubusercontent.com/zai-org/GLM-TTS/main/examples/prompt/jiayan_zh.wav" \
  --ref-text "他当时还跟线下其他的站姐吵架，然后，打架进局子了。"

Test Result

Environment

Tested over my own device:

GPU: NVIDIA A40 (48GB), Python 3.10, CUDA 12.x
Config: enforce_eager=true, async_scheduling=false

Offline Inference (10 samples, 12-50 chars)

Mode	Samples	Avg Wall	Avg Audio	Total RTF	Throughput
No-clone	10	4.28s	6.82s	0.628	1.59x realtime
Voice clone	10	4.66s	6.84s	0.682	1.47x realtime

Per-sample breakdown (no-clone)

#	Chars	Wall(s)	Audio(s)	RTF
0	14	3.44	2.98	1.154
1	12	2.05	2.74	0.749
2	26	5.03	5.50	0.914
3	24	3.21	5.06	0.635
4	33	4.41	7.54	0.584
5	50	6.20	11.34	0.547
6	45	6.02	11.34	0.531
7	35	4.12	7.14	0.577
8	32	4.21	7.46	0.564
9	27	4.15	7.06	0.587

Both modes faster than realtime (RTF < 1.0)
Clone mode ~10% slower (WhisperVQ encoding overhead)
DiT stage ~700ms/request; AR stage is the bottleneck
No existing model behavior modified

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md

BeatSeat · 2026-04-25T09:27:18Z

Since @zhangj1an is still working at Kimi Audio, I tried to make a contribution first. This is my first time contributing to the VLLM community. Please let me know if I missed anything. Thank you!

zhangj1an · 2026-04-25T09:44:43Z

I'm new to adding models too, here are feedbacks I received, maybe it will be helpful to you,

Regarding model configs, can refer to #2072 (the part on How to Add a New Model) and #3065
Regarding coding style, I was cross checking against review comments in #2906, and code in #2890. (In general I'm asked to try re-use as much code as possible)
It is optional but encouraged to add model recipe, can refer to this example, https://github.com/vllm-project/vllm-omni/blob/main/recipes/Qwen/Qwen3-Omni.md
Regarding adding tests, can refer to this guide, https://github.com/vllm-project/vllm-omni/blob/main/docs/contributing/ci/CI_5levels.md.
Need to fix pre-commit checks.
When the PR is ready, can convert it to ready for review, and cc yueqian.

hsliuustc0106 · 2026-04-25T10:59:40Z

@JaredforReal

Signed-off-by: BeatSeat <wendavid552@gmail.com>

chatgpt-codex-connector · 2026-04-26T18:37:49Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

BeatSeat changed the title ~~[Model] Add GLM-TTS text-to-speech model support~~ [WIP][Model] Add GLM-TTS text-to-speech model support Apr 25, 2026

BeatSeat force-pushed the feat/glm-tts-support branch from 5b65770 to 7b1f0c3 Compare April 25, 2026 09:23

BeatSeat force-pushed the feat/glm-tts-support branch from 1a7f46c to 33abb4b Compare April 26, 2026 13:37

BeatSeat added 3 commits April 27, 2026 01:06

feat(glm-tts): add model pipeline support

b79cf3f

Signed-off-by: BeatSeat <wendavid552@gmail.com>

docs(glm-tts): add usage examples and recipe

851e126

Signed-off-by: BeatSeat <wendavid552@gmail.com>

test(glm-tts): add coverage and CI entries

6087baa

Signed-off-by: BeatSeat <wendavid552@gmail.com>

BeatSeat force-pushed the feat/glm-tts-support branch from 66da285 to 6087baa Compare April 26, 2026 18:36

BeatSeat changed the title ~~[WIP][Model] Add GLM-TTS text-to-speech model support~~ [Model] Add GLM-TTS text-to-speech model support Apr 26, 2026

BeatSeat marked this pull request as ready for review April 26, 2026 18:37

BeatSeat requested a review from hsliuustc0106 as a code owner April 26, 2026 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Add GLM-TTS text-to-speech model support#3141

[Model] Add GLM-TTS text-to-speech model support#3141
BeatSeat wants to merge 3 commits intovllm-project:mainfrom
BeatSeat:feat/glm-tts-support

BeatSeat commented Apr 25, 2026

Uh oh!

BeatSeat commented Apr 25, 2026

Uh oh!

zhangj1an commented Apr 25, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 commented Apr 25, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BeatSeat commented Apr 25, 2026

Purpose

Files

Test Plan

Automated E2E Tests

Manual Commands

Test Result

Environment

Offline Inference (10 samples, 12-50 chars)

Uh oh!

BeatSeat commented Apr 25, 2026

Uh oh!

zhangj1an commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsliuustc0106 commented Apr 25, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhangj1an commented Apr 25, 2026 •

edited

Loading