Skip to content

[Model] Add GLM-TTS text-to-speech model support#3141

Open
BeatSeat wants to merge 3 commits intovllm-project:mainfrom
BeatSeat:feat/glm-tts-support
Open

[Model] Add GLM-TTS text-to-speech model support#3141
BeatSeat wants to merge 3 commits intovllm-project:mainfrom
BeatSeat:feat/glm-tts-support

Conversation

@BeatSeat
Copy link
Copy Markdown

This PR adds support for GLM-TTS, a two-stage text-to-speech model from Zhipu AI.

Architecture:

  • Stage 0 (AR): Llama-based autoregressive model for text to speech tokens
  • Stage 1 (DiT): Flow matching DiT for speech tokens to audio waveform

Key features:

  • Dynamic min_len/max_len based on text token length (min/max_token_text_ratio)
  • RAS (Repetition Aware Sampling) for stable generation
  • Voice cloning support via reference audio
  • OpenAI-compatible /v1/audio/speech API

Signed-off-by: BeatSeat wendavid552@gmail.com

Purpose

Add GLM-TTS (Zhipu AI) as a two-stage TTS pipeline in vLLM-Omni.

Stage 0 — AR Code Predictor:

  • Llama-architecture autoregressive model converting text into speech tokens (single VQ codebook, 32K vocab)
  • WhisperVQ encoder for reference audio encoding (voice cloning)
  • Text frontend with Chinese/English normalization (WeTextProcessing + inflect)
  • RAS (Repetition-Aware Sampling): top_p=0.8, top_k=25, win_size=10, tau_r=0.1
  • Dynamic min_len/max_len based on min_token_text_ratio=2.0 / max_token_text_ratio=20.0

Stage 1 — Flow Matching DiT:

  • DiT transformer: speech tokens → mel-spectrogram via flow matching (Euler ODE, 32 steps)
  • HiFT vocoder: mel → 22050 Hz mono waveform
  • CAMPPlus speaker embedding for voice cloning

Online Serving:

  • All 5 integration points in serving_speech.py following _tts_model_type string pattern
  • Voice cloning via ref_audio (URL / data URI) + ref_text

Closes #821
Related: #808, #834

Files

Category Path
AR model (6 files) vllm_omni/model_executor/models/glm_tts/
DiT pipeline (3 files) vllm_omni/diffusion/models/glm_tts/
Stage configs vllm_omni/model_executor/stage_configs/glm_tts.yaml, glm_tts_async_chunk.yaml
Input processor vllm_omni/model_executor/stage_input_processors/glm_tts.py
Config registration vllm_omni/transformers_utils/configs/glm_tts.py
Registries vllm_omni/model_executor/models/registry.py, vllm_omni/diffusion/registry.py
Serving vllm_omni/entrypoints/openai/serving_speech.py
Examples examples/offline_inference/glm_tts/, examples/online_serving/glm_tts/
E2E tests tests/e2e/offline_inference/test_glm_tts.py, tests/e2e/online_serving/test_glm_tts.py
CI .buildkite/test-ready.yml, .buildkite/test-merge.yml

Test Plan

Automated E2E Tests

Offline (tests/e2e/offline_inference/test_glm_tts.py):

  • test_offline_tts_zh — basic Chinese TTS (core_model)
  • test_offline_tts_long_text — longer Chinese text (core_model)
  • test_offline_voice_clone_zh — voice cloning with public URL ref audio (advanced_model)

Online (tests/e2e/online_serving/test_glm_tts.py):

  • test_basic_tts_zh/v1/audio/speech basic TTS (core_model)
  • test_basic_tts_long_text — longer text via API (core_model)
  • test_voice_clone_zh — voice cloning via API (advanced_model)
  • test_models_endpoint/v1/models endpoint (advanced_model)

Buildkite CI:

  • test-ready.yml: core_model, gpu_1_queue, 20min
  • test-merge.yml: advanced_model, gpu_1_queue, 20min

Manual Commands

# Offline
python examples/offline_inference/glm_tts/end2end.py \
  --model THUDM/GLM-TTS --text "你好,这是一个语音合成测试。" --output-dir ./output

# Server
vllm-omni serve THUDM/GLM-TTS \
  --stage-configs-path vllm_omni/model_executor/stage_configs/glm_tts.yaml \
  --trust-remote-code --enforce-eager

# Client
python examples/online_serving/glm_tts/openai_speech_client.py \
  --text "你好,这是一个语音合成测试。"

# Voice cloning
python examples/online_serving/glm_tts/openai_speech_client.py \
  --text "你好" \
  --ref-audio "https://raw.githubusercontent.com/zai-org/GLM-TTS/main/examples/prompt/jiayan_zh.wav" \
  --ref-text "他当时还跟线下其他的站姐吵架,然后,打架进局子了。"

Test Result

Environment

Tested over my own device:

GPU: NVIDIA A40 (48GB), Python 3.10, CUDA 12.x
Config: enforce_eager=true, async_scheduling=false

Offline Inference (10 samples, 12-50 chars)

Mode Samples Avg Wall Avg Audio Total RTF Throughput
No-clone 10 4.28s 6.82s 0.628 1.59x realtime
Voice clone 10 4.66s 6.84s 0.682 1.47x realtime
Per-sample breakdown (no-clone)
# Chars Wall(s) Audio(s) RTF
0 14 3.44 2.98 1.154
1 12 2.05 2.74 0.749
2 26 5.03 5.50 0.914
3 24 3.21 5.06 0.635
4 33 4.41 7.54 0.584
5 50 6.20 11.34 0.547
6 45 6.02 11.34 0.531
7 35 4.12 7.14 0.577
8 32 4.21 7.46 0.564
9 27 4.15 7.06 0.587
  • Both modes faster than realtime (RTF < 1.0)
  • Clone mode ~10% slower (WhisperVQ encoding overhead)
  • DiT stage ~700ms/request; AR stage is the bottleneck
  • No existing model behavior modified

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md

@BeatSeat BeatSeat changed the title [Model] Add GLM-TTS text-to-speech model support [WIP][Model] Add GLM-TTS text-to-speech model support Apr 25, 2026
@BeatSeat BeatSeat force-pushed the feat/glm-tts-support branch from 5b65770 to 7b1f0c3 Compare April 25, 2026 09:23
@BeatSeat
Copy link
Copy Markdown
Author

Since @zhangj1an is still working at Kimi Audio, I tried to make a contribution first. This is my first time contributing to the VLLM community. Please let me know if I missed anything. Thank you!

@zhangj1an
Copy link
Copy Markdown
Contributor

zhangj1an commented Apr 25, 2026

I'm new to adding models too, here are feedbacks I received, maybe it will be helpful to you,

Regarding model configs, can refer to #2072 (the part on How to Add a New Model) and #3065
Regarding coding style, I was cross checking against review comments in #2906, and code in #2890. (In general I'm asked to try re-use as much code as possible)
It is optional but encouraged to add model recipe, can refer to this example, https://github.com/vllm-project/vllm-omni/blob/main/recipes/Qwen/Qwen3-Omni.md
Regarding adding tests, can refer to this guide, https://github.com/vllm-project/vllm-omni/blob/main/docs/contributing/ci/CI_5levels.md.
Need to fix pre-commit checks.
When the PR is ready, can convert it to ready for review, and cc yueqian.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@JaredforReal

@BeatSeat BeatSeat force-pushed the feat/glm-tts-support branch from 1a7f46c to 33abb4b Compare April 26, 2026 13:37
Signed-off-by: BeatSeat <wendavid552@gmail.com>
Signed-off-by: BeatSeat <wendavid552@gmail.com>
Signed-off-by: BeatSeat <wendavid552@gmail.com>
@BeatSeat BeatSeat force-pushed the feat/glm-tts-support branch from 66da285 to 6087baa Compare April 26, 2026 18:36
@BeatSeat BeatSeat changed the title [WIP][Model] Add GLM-TTS text-to-speech model support [Model] Add GLM-TTS text-to-speech model support Apr 26, 2026
@BeatSeat BeatSeat marked this pull request as ready for review April 26, 2026 18:37
@BeatSeat BeatSeat requested a review from hsliuustc0106 as a code owner April 26, 2026 18:37
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: https://huggingface.co/zai-org/GLM-TTS

3 participants