[Model] Add GLM-TTS text-to-speech model support#834
Conversation
Add initial support for GLM-TTS (zai-org/GLM-TTS) text-to-speech model in the X2S (text-to-speech) pipeline. Changes: - Add GLM-TTS DiT model with flow matching for mel-spectrogram generation - Add GLM-TTS pipeline following the Stable Audio pattern - Register GLMTTSPipeline in diffusion registry - Add tests for GLM-TTS model The implementation supports: - Flow matching with Euler ODE integration - Speech token to mel-spectrogram conversion - Optional Vocos vocoder for waveform generation - Speaker embedding for voice characteristics Closes vllm-project#821 Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
d00c864 to
b598dfa
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d00c86473e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if latents is None: | ||
| latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype) | ||
| else: | ||
| latents = latents.to(device) | ||
|
|
There was a problem hiding this comment.
Cast provided latents to model dtype
When callers pass latents (e.g., for reproducibility or partial sampling), the code only moves them to the device and does not cast to dtype. If the model weights are bf16/fp16 (typical for diffusion), a float32 latents tensor will cause a dtype mismatch error in linear layers or force unintended upcasting. Other pipelines in this repo cast provided latents to dtype, so this should too.
Useful? React with 👍 / 👎.
| if prompt is not None and isinstance(prompt, str): | ||
| batch_size = 1 | ||
| elif prompt is not None and isinstance(prompt, list): | ||
| batch_size = len(prompt) | ||
| elif speech_tokens is not None: |
There was a problem hiding this comment.
Derive batch size from speech_tokens when provided
If speech_tokens are supplied but prompt is also present (a common pattern for logging), batch size is derived from the prompt and the speech_tokens batch is ignored. When their batch sizes differ, the pipeline will later build latents/speaker_embedding with the prompt batch and then fail in torch.cat with a shape mismatch. Prefer using speech_tokens.shape[0] when speech_tokens is provided or validate that both inputs agree.
Useful? React with 👍 / 👎.
gcanlin
left a comment
There was a problem hiding this comment.
Thanks for contributing! I remember that this model is a 2-stage model, which includes LLM and Dit. But the current implementation seems to be only diffusion part. Is it possible to separate it to two stages like omni model?
|
fix precommit and add test plan&results |
Implement GLM-TTS as a proper two-stage model following the Qwen-Omni pattern: - Stage 0 (LLM): Text → Speech tokens (Llama-based via vLLM) - Stage 1 (DiT): Speech tokens → Mel → Audio (Flow matching) Changes: - Add stage input processor (llm2dit) for LLM→DiT transition - Add YAML stage config for two-stage pipeline - Update DiT pipeline to accept speech tokens from stage processor - Add tests for stage input processor and two-stage pipeline The LLM stage uses vLLM's native LlamaForCausalLM support with GLM-TTS special tokens (BOA, EOA) for speech token generation. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
- Remove unused F841 variable (sample_rate in test) - Remove unused torch.nn.functional import - Apply ruff format to multi-line strings - Fix import ordering in __init__.py Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
353c295 to
bf8d685
Compare
Signed-off-by: Itay Etelis <92247226+Etelis@users.noreply.github.com>
- Sort imports alphabetically in test_glm_tts_model.py - Fix line length in glm_tts.py stage input processor Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
- Condense multi-line assignments to single lines - Fix line length issues per ruff formatting rules Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
f97abd2 to
065c5d9
Compare
Thank you! |
Sorry, done. |
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
13c8d0f to
7f43f50
Compare
- GLM-TTS requires a stage config for LLM + DiT two-stage architecture - Add stage_config parameter to properly initialize the pipeline - Use SamplingParams matching the qwen3_omni example pattern - Auto-resolve stage config path relative to repository root Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
…onfig Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
- Add explicit model path for each stage - Remove gpu_memory_utilization from diffusion stage (not a valid param) - Add model_stage: flow for DiT stage to load from flow/ subdirectory - Add comments explaining HuggingFace model structure Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
- Remove 'model' from engine_args for LLM stage to avoid duplicate argument - Remove 'max_num_seqs' from engine_args for diffusion stage (not valid param) - Use pop with default None instead of bare pop for model_stage Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Adds initial support for GLM-TTS in the X2S pipeline.
Changes
vllm_omni/diffusion/models/glm_tts/- GLM-TTS model implementationvllm_omni/diffusion/registry.py- Register GLMTTSPipelinetests/e2e/offline_inference/test_glm_tts_model.py- TestsImplementation
Closes #821
Related: #808