[Model] Add GLM-TTS text-to-speech model support by Etelis · Pull Request #834 · vllm-project/vllm-omni

Etelis · 2026-01-18T19:43:39Z

Adds initial support for GLM-TTS in the X2S pipeline.

Changes

vllm_omni/diffusion/models/glm_tts/ - GLM-TTS model implementation
vllm_omni/diffusion/registry.py - Register GLMTTSPipeline
tests/e2e/offline_inference/test_glm_tts_model.py - Tests

Implementation

Flow matching DiT model for speech token → mel-spectrogram
Pipeline following Stable Audio pattern
Optional Vocos vocoder for waveform output
Speaker embedding support

Closes #821
Related: #808

Add initial support for GLM-TTS (zai-org/GLM-TTS) text-to-speech model in the X2S (text-to-speech) pipeline. Changes: - Add GLM-TTS DiT model with flow matching for mel-spectrogram generation - Add GLM-TTS pipeline following the Stable Audio pattern - Register GLMTTSPipeline in diffusion registry - Add tests for GLM-TTS model The implementation supports: - Flow matching with Euler ODE integration - Speech token to mel-spectrogram conversion - Optional Vocos vocoder for waveform generation - Speaker embedding for voice characteristics Closes vllm-project#821 Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d00c86473e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-18T19:51:37Z

+        if latents is None:
+            latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+        else:
+            latents = latents.to(device)
+


Cast provided latents to model dtype

When callers pass latents (e.g., for reproducibility or partial sampling), the code only moves them to the device and does not cast to dtype. If the model weights are bf16/fp16 (typical for diffusion), a float32 latents tensor will cause a dtype mismatch error in linear layers or force unintended upcasting. Other pipelines in this repo cast provided latents to dtype, so this should too.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-01-18T19:51:37Z

+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        elif speech_tokens is not None:


Derive batch size from speech_tokens when provided

If speech_tokens are supplied but prompt is also present (a common pattern for logging), batch size is derived from the prompt and the speech_tokens batch is ignored. When their batch sizes differ, the pipeline will later build latents/speaker_embedding with the prompt batch and then fail in torch.cat with a shape mismatch. Prefer using speech_tokens.shape[0] when speech_tokens is provided or validate that both inputs agree.

Useful? React with 👍 / 👎.

gcanlin

Thanks for contributing! I remember that this model is a 2-stage model, which includes LLM and Dit. But the current implementation seems to be only diffusion part. Is it possible to separate it to two stages like omni model?

hsliuustc0106 · 2026-01-19T03:54:29Z

fix precommit and add test plan&results

Implement GLM-TTS as a proper two-stage model following the Qwen-Omni pattern: - Stage 0 (LLM): Text → Speech tokens (Llama-based via vLLM) - Stage 1 (DiT): Speech tokens → Mel → Audio (Flow matching) Changes: - Add stage input processor (llm2dit) for LLM→DiT transition - Add YAML stage config for two-stage pipeline - Update DiT pipeline to accept speech tokens from stage processor - Add tests for stage input processor and two-stage pipeline The LLM stage uses vLLM's native LlamaForCausalLM support with GLM-TTS special tokens (BOA, EOA) for speech token generation. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

- Remove unused F841 variable (sample_rate in test) - Remove unused torch.nn.functional import - Apply ruff format to multi-line strings - Fix import ordering in __init__.py Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Signed-off-by: Itay Etelis <92247226+Etelis@users.noreply.github.com>

- Sort imports alphabetically in test_glm_tts_model.py - Fix line length in glm_tts.py stage input processor Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

- Condense multi-line assignments to single lines - Fix line length issues per ruff formatting rules Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis · 2026-01-19T07:43:28Z

Thanks for contributing! I remember that this model is a 2-stage model, which includes LLM and Dit. But the current implementation seems to be only diffusion part. Is it possible to separate it to two stages like omni model?

Thank you!
How do you feel about that change?

Etelis · 2026-01-19T07:43:36Z

fix precommit and add test plan&results

Sorry, done.

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

- GLM-TTS requires a stage config for LLM + DiT two-stage architecture - Add stage_config parameter to properly initialize the pipeline - Use SamplingParams matching the qwen3_omni example pattern - Auto-resolve stage config path relative to repository root Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

…onfig Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

- Add explicit model path for each stage - Remove gpu_memory_utilization from diffusion stage (not a valid param) - Add model_stage: flow for DiT stage to load from flow/ subdirectory - Add comments explaining HuggingFace model structure Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

- Remove 'model' from engine_args for LLM stage to avoid duplicate argument - Remove 'max_num_seqs' from engine_args for diffusion stage (not valid param) - Use pop with default None instead of bare pop for model_stage Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis requested a review from hsliuustc0106 as a code owner January 18, 2026 19:43

Etelis force-pushed the feature/glm-tts-support branch from d00c864 to b598dfa Compare January 18, 2026 19:47

chatgpt-codex-connector Bot reviewed Jan 18, 2026

View reviewed changes

david6666666 mentioned this pull request Jan 19, 2026

vLLM-Omni Model Support #808

Open

63 tasks

gcanlin reviewed Jan 19, 2026

View reviewed changes

EtelisIBM added 2 commits January 19, 2026 08:48

fix: Apply ruff formatting and remove unused imports

bf8d685

- Remove unused F841 variable (sample_rate in test) - Remove unused torch.nn.functional import - Apply ruff format to multi-line strings - Fix import ordering in __init__.py Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis force-pushed the feature/glm-tts-support branch from 353c295 to bf8d685 Compare January 19, 2026 07:02

Etelis and others added 4 commits January 19, 2026 09:03

Delete PR.md

df9855f

Signed-off-by: Itay Etelis <92247226+Etelis@users.noreply.github.com>

Merge branch 'main' into feature/glm-tts-support

2002f99

chore: apply pre-commit formatting fixes

af63caa

- Sort imports alphabetically in test_glm_tts_model.py - Fix line length in glm_tts.py stage input processor Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

chore: apply ruff format fixes to glm_tts.py

065c5d9

- Condense multi-line assignments to single lines - Fix line length issues per ruff formatting rules Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis force-pushed the feature/glm-tts-support branch from f97abd2 to 065c5d9 Compare January 19, 2026 07:42

feat: add example script for text-to-speech generation using GLM-TTS

7f43f50

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis force-pushed the feature/glm-tts-support branch from 13c8d0f to 7f43f50 Compare January 19, 2026 08:19

EtelisIBM added 4 commits January 19, 2026 10:24

fix: use correct parameter name stage_configs_path instead of stage_c…

b1bdd5c

…onfig Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis closed this Jan 30, 2026

linyueqian mentioned this pull request Mar 10, 2026

[RFC]: TTS Development Roadmap - March 2026 #1795

Open

BeatSeat mentioned this pull request Apr 25, 2026

[Model] Add GLM-TTS text-to-speech model support #3141

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Add GLM-TTS text-to-speech model support#834

[Model] Add GLM-TTS text-to-speech model support#834
Etelis wants to merge 12 commits into
vllm-project:mainfrom
Etelis:feature/glm-tts-support

Etelis commented Jan 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jan 18, 2026

Uh oh!

chatgpt-codex-connector Bot Jan 18, 2026

Uh oh!

gcanlin left a comment

Uh oh!

hsliuustc0106 commented Jan 19, 2026

Uh oh!

Etelis commented Jan 19, 2026

Uh oh!

Etelis commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Etelis commented Jan 18, 2026

Changes

Implementation

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Jan 19, 2026

Uh oh!

Etelis commented Jan 19, 2026

Uh oh!

Etelis commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants