Skip to content

[Model] Add GLM-TTS text-to-speech model support#834

Closed
Etelis wants to merge 12 commits into
vllm-project:mainfrom
Etelis:feature/glm-tts-support
Closed

[Model] Add GLM-TTS text-to-speech model support#834
Etelis wants to merge 12 commits into
vllm-project:mainfrom
Etelis:feature/glm-tts-support

Conversation

@Etelis
Copy link
Copy Markdown

@Etelis Etelis commented Jan 18, 2026

Adds initial support for GLM-TTS in the X2S pipeline.

Changes

  • vllm_omni/diffusion/models/glm_tts/ - GLM-TTS model implementation
  • vllm_omni/diffusion/registry.py - Register GLMTTSPipeline
  • tests/e2e/offline_inference/test_glm_tts_model.py - Tests

Implementation

  • Flow matching DiT model for speech token → mel-spectrogram
  • Pipeline following Stable Audio pattern
  • Optional Vocos vocoder for waveform output
  • Speaker embedding support

Closes #821
Related: #808

@Etelis Etelis requested a review from hsliuustc0106 as a code owner January 18, 2026 19:43
Add initial support for GLM-TTS (zai-org/GLM-TTS) text-to-speech model
in the X2S (text-to-speech) pipeline.

Changes:
- Add GLM-TTS DiT model with flow matching for mel-spectrogram generation
- Add GLM-TTS pipeline following the Stable Audio pattern
- Register GLMTTSPipeline in diffusion registry
- Add tests for GLM-TTS model

The implementation supports:
- Flow matching with Euler ODE integration
- Speech token to mel-spectrogram conversion
- Optional Vocos vocoder for waveform generation
- Speaker embedding for voice characteristics

Closes vllm-project#821

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
@Etelis Etelis force-pushed the feature/glm-tts-support branch from d00c864 to b598dfa Compare January 18, 2026 19:47
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d00c86473e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +275 to +279
if latents is None:
latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
else:
latents = latents.to(device)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Cast provided latents to model dtype

When callers pass latents (e.g., for reproducibility or partial sampling), the code only moves them to the device and does not cast to dtype. If the model weights are bf16/fp16 (typical for diffusion), a float32 latents tensor will cause a dtype mismatch error in linear layers or force unintended upcasting. Other pipelines in this repo cast provided latents to dtype, so this should too.

Useful? React with 👍 / 👎.

Comment on lines +386 to +390
if prompt is not None and isinstance(prompt, str):
batch_size = 1
elif prompt is not None and isinstance(prompt, list):
batch_size = len(prompt)
elif speech_tokens is not None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Derive batch size from speech_tokens when provided

If speech_tokens are supplied but prompt is also present (a common pattern for logging), batch size is derived from the prompt and the speech_tokens batch is ignored. When their batch sizes differ, the pipeline will later build latents/speaker_embedding with the prompt batch and then fail in torch.cat with a shape mismatch. Prefer using speech_tokens.shape[0] when speech_tokens is provided or validate that both inputs agree.

Useful? React with 👍 / 👎.

@david6666666 david6666666 mentioned this pull request Jan 19, 2026
63 tasks
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing! I remember that this model is a 2-stage model, which includes LLM and Dit. But the current implementation seems to be only diffusion part. Is it possible to separate it to two stages like omni model?

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

fix precommit and add test plan&results

Implement GLM-TTS as a proper two-stage model following the Qwen-Omni
pattern:
- Stage 0 (LLM): Text → Speech tokens (Llama-based via vLLM)
- Stage 1 (DiT): Speech tokens → Mel → Audio (Flow matching)

Changes:
- Add stage input processor (llm2dit) for LLM→DiT transition
- Add YAML stage config for two-stage pipeline
- Update DiT pipeline to accept speech tokens from stage processor
- Add tests for stage input processor and two-stage pipeline

The LLM stage uses vLLM's native LlamaForCausalLM support with
GLM-TTS special tokens (BOA, EOA) for speech token generation.

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
- Remove unused F841 variable (sample_rate in test)
- Remove unused torch.nn.functional import
- Apply ruff format to multi-line strings
- Fix import ordering in __init__.py

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
@Etelis Etelis force-pushed the feature/glm-tts-support branch from 353c295 to bf8d685 Compare January 19, 2026 07:02
Etelis and others added 4 commits January 19, 2026 09:03
Signed-off-by: Itay Etelis <92247226+Etelis@users.noreply.github.com>
- Sort imports alphabetically in test_glm_tts_model.py
- Fix line length in glm_tts.py stage input processor

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
- Condense multi-line assignments to single lines
- Fix line length issues per ruff formatting rules

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
@Etelis Etelis force-pushed the feature/glm-tts-support branch from f97abd2 to 065c5d9 Compare January 19, 2026 07:42
@Etelis
Copy link
Copy Markdown
Author

Etelis commented Jan 19, 2026

Thanks for contributing! I remember that this model is a 2-stage model, which includes LLM and Dit. But the current implementation seems to be only diffusion part. Is it possible to separate it to two stages like omni model?

Thank you!
How do you feel about that change?

@Etelis
Copy link
Copy Markdown
Author

Etelis commented Jan 19, 2026

fix precommit and add test plan&results

Sorry, done.

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
@Etelis Etelis force-pushed the feature/glm-tts-support branch from 13c8d0f to 7f43f50 Compare January 19, 2026 08:19
- GLM-TTS requires a stage config for LLM + DiT two-stage architecture
- Add stage_config parameter to properly initialize the pipeline
- Use SamplingParams matching the qwen3_omni example pattern
- Auto-resolve stage config path relative to repository root

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
…onfig

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
- Add explicit model path for each stage
- Remove gpu_memory_utilization from diffusion stage (not a valid param)
- Add model_stage: flow for DiT stage to load from flow/ subdirectory
- Add comments explaining HuggingFace model structure

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
- Remove 'model' from engine_args for LLM stage to avoid duplicate argument
- Remove 'max_num_seqs' from engine_args for diffusion stage (not valid param)
- Use pop with default None instead of bare pop for model_stage

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: https://huggingface.co/zai-org/GLM-TTS

4 participants