[Model] Add GLM-TTS text-to-speech model support#3141
[Model] Add GLM-TTS text-to-speech model support#3141BeatSeat wants to merge 3 commits intovllm-project:mainfrom
Conversation
5b65770 to
7b1f0c3
Compare
|
Since @zhangj1an is still working at Kimi Audio, I tried to make a contribution first. This is my first time contributing to the VLLM community. Please let me know if I missed anything. Thank you! |
|
I'm new to adding models too, here are feedbacks I received, maybe it will be helpful to you, Regarding model configs, can refer to #2072 (the part on |
1a7f46c to
33abb4b
Compare
Signed-off-by: BeatSeat <wendavid552@gmail.com>
Signed-off-by: BeatSeat <wendavid552@gmail.com>
Signed-off-by: BeatSeat <wendavid552@gmail.com>
66da285 to
6087baa
Compare
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
This PR adds support for GLM-TTS, a two-stage text-to-speech model from Zhipu AI.
Architecture:
Key features:
Signed-off-by: BeatSeat wendavid552@gmail.com
Purpose
Add GLM-TTS (Zhipu AI) as a two-stage TTS pipeline in vLLM-Omni.
Stage 0 — AR Code Predictor:
top_p=0.8, top_k=25, win_size=10, tau_r=0.1min_len/max_lenbased onmin_token_text_ratio=2.0/max_token_text_ratio=20.0Stage 1 — Flow Matching DiT:
Online Serving:
serving_speech.pyfollowing_tts_model_typestring patternref_audio(URL / data URI) +ref_textCloses #821
Related: #808, #834
Files
vllm_omni/model_executor/models/glm_tts/vllm_omni/diffusion/models/glm_tts/vllm_omni/model_executor/stage_configs/glm_tts.yaml,glm_tts_async_chunk.yamlvllm_omni/model_executor/stage_input_processors/glm_tts.pyvllm_omni/transformers_utils/configs/glm_tts.pyvllm_omni/model_executor/models/registry.py,vllm_omni/diffusion/registry.pyvllm_omni/entrypoints/openai/serving_speech.pyexamples/offline_inference/glm_tts/,examples/online_serving/glm_tts/tests/e2e/offline_inference/test_glm_tts.py,tests/e2e/online_serving/test_glm_tts.py.buildkite/test-ready.yml,.buildkite/test-merge.ymlTest Plan
Automated E2E Tests
Offline (
tests/e2e/offline_inference/test_glm_tts.py):test_offline_tts_zh— basic Chinese TTS (core_model)test_offline_tts_long_text— longer Chinese text (core_model)test_offline_voice_clone_zh— voice cloning with public URL ref audio (advanced_model)Online (
tests/e2e/online_serving/test_glm_tts.py):test_basic_tts_zh—/v1/audio/speechbasic TTS (core_model)test_basic_tts_long_text— longer text via API (core_model)test_voice_clone_zh— voice cloning via API (advanced_model)test_models_endpoint—/v1/modelsendpoint (advanced_model)Buildkite CI:
test-ready.yml:core_model,gpu_1_queue, 20mintest-merge.yml:advanced_model,gpu_1_queue, 20minManual Commands
Test Result
Environment
Tested over my own device:
Offline Inference (10 samples, 12-50 chars)
Per-sample breakdown (no-clone)
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md